CN110334300A - Text aid reading method towards the analysis of public opinion - Google Patents
Text aid reading method towards the analysis of public opinion Download PDFInfo
- Publication number
- CN110334300A CN110334300A CN201910621253.3A CN201910621253A CN110334300A CN 110334300 A CN110334300 A CN 110334300A CN 201910621253 A CN201910621253 A CN 201910621253A CN 110334300 A CN110334300 A CN 110334300A
- Authority
- CN
- China
- Prior art keywords
- text
- algorithm
- analysis
- sentence
- public opinion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides the text aid reading method towards the analysis of public opinion, belongs to natural language processing technique field.The present invention carries out text extracting using unified approach to various types webpage first;Then the name Entity recognition of the entities such as personage, place, mechanism is carried out to text and highlighted;Autoabstract is carried out to text again, and sentence is ranked up and is highlighted by different degree.The present invention solves the problems, such as that there are emphasis extraction difficulty, reading efficiency are low when existing the analysis of public opinion personnel read a large amount of text informations.The present invention can be used for the text aid reading of the analysis of public opinion, and reader can be quickly obtained purport information.
Description
Technical field
The present invention relates to text aid reading methods, belong to natural language processing technique field.
Background technique
The analysis of public opinion is that one kind passes through real-time collecting public opinion information and statisticallys analyze, and is rationally determined with aid decision person
The technology of plan.In the analysis of public opinion, the information being collected into can substantially be divided into structural data (such as social networks, the index of discharge
Deng) and unstructured data (such as user comment and newsletter archive), wherein again in the majority with unstructured data.For a carriage
Facts part, in addition to statistical analysis of data, the analysis of public opinion personnel be often also required to read a large amount of news report just can make it is relatively complete
The analysis and summary in face, is easy to cause visual fatigue.And the development of natural language processing technique, allowing the machine auxiliary mankind to read becomes
It may.Text aid reading system proposed by the present invention towards the analysis of public opinion attempts to solve following difficulty:
One, since public sentiment news is usually from webpage, and page structure, the character code of different news websites are inconsistent,
And page structure has variation at any time, extracts content and has difficulties.
Two, due to generally involving the important informations such as personage, region, mechanism in public sentiment news and all kinds of editorials.Due to
Chinese is accustomed to without the word space gap of similar English and entity initial caps, causes to focus when reading a large amount of texts tired
It is difficult.
Three, it due to the larger low efficiency of public sentiment journalism amount, is not easy to catch article purport.
Summary of the invention
The present invention is solves the problems, such as that existing the analysis of public opinion technology exists to extract content difficulty, low efficiency, provide towards
The text aid reading method of the analysis of public opinion.
Text aid reading method of the present invention towards the analysis of public opinion, is achieved through the following technical solutions:
Step 1: carrying out text extracting to various types webpage;
Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting;
Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree.
Present invention feature the most prominent and significant beneficial effect are:
Text aid reading method according to the present invention towards the analysis of public opinion, successively to webpage by text extracting,
Entity recognition, autoabstract processing are named, so that the refreshing after rendering to original web page and " reading model " webpage reproduces.
The method of the present invention allows reader to easily pass the emphasis for reading highlighted sentence only to understand entire article, saves
Reading time doubles the efficiency of the analysis of public opinion at least;If reader wants to understand the role relation in media event,
It can quickly assist understanding by highlighted entity word.Since processing is fully established on original text, quick side is allowed users to
Just original read.
Detailed description of the invention
Fig. 1 is embodiment of the present invention method flow chart;
Fig. 2 is the Named Entity Extraction Model schematic diagram based on stacking HMM in the present invention;
Fig. 3 is the Named Entity Extraction Model schematic diagram based on BiLSTM+CRF in the present invention;Human is people, and space refers to
Place, institute are mechanisms;
Fig. 4 is CBOW schematic illustration;The word of w (t) expression current location;
Fig. 5 is Skip-Gram schematic illustration;
Fig. 6 is the original web page in the embodiment of the present invention;
Fig. 7 is to open effect after text auxiliary reading function in the embodiment of the present invention in original web page;
Fig. 8 is to open effect after text auxiliary reading function in the embodiment of the present invention on " reading model " webpage.
Specific embodiment
Specific embodiment 1: being illustrated in conjunction with Fig. 1 to present embodiment, what present embodiment provided divides towards public sentiment
The text aid reading method of analysis, specifically includes the following steps:
Step 1: carrying out text extracting using unified approach to various types webpage;
Since in a practical situation, the page HTML structure of major news website is more complicated, and inconsistent.From
It mainly include the distracters such as typesetting, picture, special efficacy, advertisement from the point of view of external sense organ;It mainly include label from the point of view of inherent code
Intricate, situations such as character code is chaotic.The purpose of this step is the effective text paragraph extracted in webpage, is rejected in webpage
Distracter, be clean succinct " reading model " by web repristination, while facilitating subsequent natural language processing.
Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting;
The purpose of this step is to extract important entity elements and from non-structured text to meet visual attention
Form is presented.Under normal circumstances, news, editorial usually surround an event and are unfolded, and event having time, place, personage's (machine
Structure) etc. fundamentals.In English, proper noun is write in the form of initial caps, and is allowed specially between word using space
There is the boundary of noun to be more clear.But Chinese will cause reader when in face of a large amount of texts without above-mentioned writing feature
It is difficult to focus important sentence element.Name Entity recognition highlights such issues that can alleviate in conjunction with appropriate.
Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree;
The purpose of this step is to excavate important sentences from non-structured text, and to visualize shape on original text
Formula is presented, with aid reading person's Fast Reading.
Specific embodiment 2: the present embodiment is different from the first embodiment in that, dom tree solution is used in step 1
Analysis carries out the text extracting, detailed process the following steps are included:
Step 1 one obtains original web page HTML (Hyper Text Markup Language hypertext markup language),
Detection coding;UTF-8 volume is converted into if encoding non-UTF-8 (many Chinese web pages are encoded using GB2312)
Code;
Step 1 two predefines several regular expression groupings, and web page tag is grouped;Grouping mainly has " weighting
Grouping " (may contain the label of text, such as body, article, content), " grouping of drop power " (are less likely
Label containing text, such as footnote, media, meta) and other groupings;
Step 1 three establishes dom tree to HTML;DOM is DOM Document Object Model (Document Object Model);
Step 1 four, the element for deleting non-textual content in DOM;
All elements need further recurrence time if element is the label of<div>in step 1 five, traversal DOM
All nested elements in<div>label are gone through, by the weight plus-minus to grouping, reconfigures and sorts out the interior of page body
Hold.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 3: present embodiment unlike specific embodiment two, is named in fact described in step 2
Body identification is generally viewed as sequence labelling task, precision can be used slightly lower but the conventional machines learning algorithm of fast speed, or
Precision is slightly higher but slow deep learning algorithm;It can select according to actual needs.The conventional machines learning algorithm,
Including the Named Entity Extraction Model algorithm based on stacking HMM and the Named Entity Extraction Model algorithm based on CRF, wherein HMM
Indicate Hidden Markov Model, CRF indicates condition random field;The deep learning algorithm includes the name entity based on BiLSTM
Identification model algorithm and Named Entity Extraction Model algorithm based on BiLSTM+CRF, the BiLSTM indicate that two-way length is remembered in short-term
Recall network.
Other steps and parameter are identical with embodiment two.
Specific embodiment 4: present embodiment is unlike specific embodiment three, although HMM is simplest sequence
Column marking model, but its speed is fast, scalability is strong.Due to requiring in production environment the process of refinement of name Entity recognition
It is higher, it the use of multilayer laminated HMM is still at present one of widely used method.
In the stacking HMM of the Named Entity Extraction Model based on stacking HMM:
1st layer of HMM is for segmenting;Due to the reason of training corpus, this layer tends to cutting and opens too remote place name
(such as at county level, township level) and name.
2nd layer of HMM roughly identifies place name, name on the basis of the 1st layer, then carries out mode to generation result
With secondary automatic marking corpus is carried out, as next layer of training corpus;
For 3rd layer of HMM on secondary automatic marking corpus, training subtly identifies place name, name;
4th layer of HMM mechanism name for identification.Mechanism name is then slightly more complex, because it generally comprises place name, (minority is even also
Include name), it needs on the basis of identifying place name and name, reuses pattern match and carry out secondary automatic marking, then with this
For training corpus training identification mechanism name.
Therefore, it is desirable to more subtly identify name, place name, 3 layers of HMM are needed;And want relatively subtly identification mechanism
Name, then need 4 layers of HMM.Training process is (the 2nd, 3 layer is reduced to one layer) as shown in Figure 2: about character position in entity in figure
Label has: B presentation-entity beginning character (Begin), I presentation-entity intermediate character (Inside), E presentation-entity termination character
(End), S indicates corpus separatum character (Single), and O indicates non-physical character (Outside);About the mark of entity type in figure
Be signed with: Nh indicates that name entity, Ns indicate place name entity, Ni outgoing mechanism name entity;Position and type label, which can combine, to be made
With if the label of " Jinan District " in " Jinan District procuratorial organ " is " B-Ni ", expression is the prefix of a mechanism name, and the mark of " procuratorial organ "
Label are " E-Ni ", and expression is the suffix of a mechanism name.
It is trained using CRF similar above-mentioned.It needs to set corresponding feature templates, can start to train.
Other steps and parameter are the same as the specific implementation mode 3.
Specific embodiment 5: present embodiment is unlike specific embodiment three, the life based on BiLSTM
Name entity recognition model is as shown in figure 3, mainly include following 3 layers:
1st layer is insertion (Embedding) layer, and for converting character vector for the character of sentence, character vector can
Enough random initializtions, then update during training;Also it is good to be able to use pre-training on network;
2nd layer is BiLSTM layers, first to the stochastic parameter in each unit of LSTM (long memory network in short-term), so
By character vector, spacer step is sent into LSTM unit one by one and does cycle calculations at any time afterwards;BiLSTM carries out LSTM primary positive
(forward LSTM) cycle calculations and primary reversed (backward LSTM) cycle calculations;Obtained result, which is passed through, splices, just
Then change, send to output layer;
3rd layer is output layer, which is simple softmax (normalization exponential function) output layer;It should be noted that
, there are inclusion relations for the label of place name, the label of name and mechanism name, so if being concerned about inclusion relation, here can
Two-way output is done, simultaneously training;If being indifferent to inclusion relation, it can also ignore by comprising entity, directly export all the way.
Other steps and parameter are identical as specific embodiment three or four.
Specific embodiment 6: present embodiment is unlike specific embodiment five, it is described to be based on BiLSTM+CRF
Named Entity Extraction Model, on the basis of the Named Entity Extraction Model based on BiLSTM, in BiLSTM layers and output layer
Between increase by one CRF layer, final label is calculated by CRF layers again for the BiLSTM layers of data transmitted, and is transmitted
To output layer.
Other steps and parameter are identical as specific embodiment five.
Specific embodiment 7: the present embodiment is different from the first embodiment in that, Step 2: described in step 3
Highlighting can either carry out in original web page, also can " reading model " webpage (i.e. after step 1 is handled only wrap
Containing text, the webpage without distracters such as picture, special efficacy, advertisements) on carry out, by being that corresponding word, sentence add in HTML
Upper effect is presented.
Name, place name are highlighted:
It is previously noted that identification name entity compares English difficulty in Chinese text, Chinese is essentially consisted in without word point
Every with initial caps feature.In fact, there is the punctuate of a kind of entitled " line under or beside a word to show that it is a proper noun " (" _ ") for overcoming this in Chinese punctuate
Problem.Its regulation in 1919 " new-type punctuation mark proposal please be issue for enforcement ", and issued in State Bureau of Technical Supervision in nineteen ninety-five
In " punctuation usage ", it is proposed that used in ancient books, for identifying name, place name, towards code name.
The present invention continues to use the usage of line under or beside a word to show that it is a proper noun, and name, place name class are named entity, subscript line under or beside a word to show that it is a proper noun, and added with font
Slightly, the form of color is presented.
Mechanism name is highlighted:
Mechanism name can continue to use name, the method for place name uses line under or beside a word to show that it is a proper noun.Here it is considered that since mechanism may includes
Place name, if reusing line under or beside a word to show that it is a proper noun, it will cause continuous.And the frequency that occurs in article of mechanism name far below name,
Name, it is also important for the analysis of public opinion, preferably indicated with more prominent symbol.It is presented in the form of adding frame in the present embodiment
Mechanism name.
Abstract is highlighted:
Background color is arranged in the sentence that autoabstract is obtained in original text, and Sentence significance score is then mapped to back
The significance level of sentence is embodied in the brightness of scenery coloured silk.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 8: the present embodiment is different from the first embodiment in that, due to needing the result that will make a summary
It is highlighted on original text (as highlighted), the autoabstract is carried out using unsupervised extraction-type digest algorithm in step 3;
Unsupervised extraction-type digest algorithm includes mining algorithm (such as TextRank), the algorithm based on cluster based on figure.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 9: present embodiment unlike specific embodiment one to eight, uses in step 3 and belongs to
TextRank algorithm in the mining algorithm based on figure carries out the autoabstract;The TextRank algorithm be it is a kind of will be literary
Originally it is built into figure expression, is then excavated using figure to find the algorithm of key node (i.e. important sentences);Specifically include following step
It is rapid:
Firstly, document is first divided into sentence, and by sentence expression at vector form;
Then, it calculates sentence similarity matrix: to the vector of two sentences any in text, calculating phase using cosine formula
Like degree, it is aggregated into similarity matrix;In this way, it is the nothing on side that entire text, which can be considered as using sentence the similarity between node, sentence,
To the side connected graph G that has the right;
Finally, carrying out important node excavation to G using PageRank (page rank) algorithm;Calculation formula is as follows:
Wherein, c indicates damped coefficient, generally may be configured as 0.85, VtIndicate t-th of node (node, that is, text in figure G
In sentence), In (Vt) indicate to be directed toward node VtNode set, Out (Vj) indicate node VjPointed node set, wjt
Indicate node VtTo node VjSide weight;WS (V on the left of formulat) indicate node VtWeight and (Weight Sum), it is right
The sum term of side then indicates each adjacent node to the percentage contribution of this node;
Continuous iteration update is carried out to nodes all in figure using above-mentioned formula, until all weights tend to be steady;Finally
Weight selection and highest N number of node take its corresponding N number of sentence as abstract output.
Other steps and parameter are identical as specific embodiment one to eight.
Specific embodiment 10: present embodiment is unlike specific embodiment nine, it is described by sentence expression to
There are many amount forms, and method is optional, can learn using BM25 algorithm (best match algorithm Best Match) or based on distributed
Practise algorithm;
BM25 algorithm is commonly used in doing relevance of searches scoring.Its calculation formula is as follows:
Wherein, Q indicates inquiry string (Query);qiIndicate i-th of word in inquiry string Q (for Chinese
Speech, each word after participle can be considered as), i=1 ... n;N is word number in inquiry string Q;D indicates a search result
Document;WiIndicate qiWeight;R(qi, d) and indicate qiWith the Relevance scores of search result document d;WiWith R (qi, d) can
Designed, designed is typically designed Wi=IDF (qi), IDF () i.e. inverse document frequency;And R (qi, d) it is then more more flexible, it embodies single
Word qiWith the correlation of document d;
Corpus pre-training is generally first passed through based on distributed learning algorithm and obtains term vector, and word then is done to the word in sentence
Vector average calculating operation obtains the vector of sentence, and the method that corpus pre-training obtains term vector has Skip-Gram, CBOW etc., former
Reason is as shown in Figure 4, Figure 5:
Wherein, CBOW (Fig. 4) then removes prediction centre word with upper and lower cliction, and Skip-Gram (Fig. 5) goes to predict with centre word
Upper and lower cliction.The distributed word that training obtains in this way indicates the semantic information for having contained word.Finally, constructing sentence with term vector
Subvector can be obtained after rejecting stop words by being simply averaging.
Other steps and parameter are identical as specific embodiment nine.
Embodiment
Beneficial effects of the present invention are verified using following embodiment:
The present embodiment is carried out according to process as shown in Figure 1.The aid reading system towards the analysis of public opinion is built, this is
For system by two module compositions of front end plug-in unit and back-end algorithm, front-end module can be installed in the form of browser window plug-in unit (can
Choosing), it is mainly responsible for after downloading the original HTML of current page and is sent into rear end reception, and the processing result wash with watercolours that rear end is returned
It dyes and;Back-end algorithm module is then mainly extracted comprising text, the natural language processings such as Entity recognition, autoabstract is named to calculate
Method.
A news web page is opened, such as Fig. 6 chooses a piece of news editorial of wherein publication in " People's Net " Yu Sanyue 5th here
For.
Front end plug-in unit detects start command, and the original HTML of current web page is sent to rear end;Rear end is by original web page
It is forwarded to each algorithm, successively by processing such as text extracting, name Entity recognition, autoabstracts.Using these results to original
Webpage and " reading model " webpage carry out visualization rendering, and the webpage after rendering is returned to front end and refreshes reproduction.
Effect is as shown in Figure 7 after opening text auxiliary reading function in original web page;There are typesettings, figure for original web page
The interference such as piece, special efficacy, advertisement, can handle into succinct " reading model ", effect is as shown in Figure 8.For name, place name class life
Name entity, subscript line under or beside a word to show that it is a proper noun, and presented in the form of font-weight, different colours etc.;It is prominent in the form of adding frame to mechanism name
Display;Highlighted word indicates entity word, and highlighting background indicates the significance level of sentence.Reader, which can easily pass, only reads face
Color fills the sentence of background to understand the emphasis of entire article, improves the efficiency of the analysis of public opinion;If it is desired to understanding media event
In role relation, highlighted entity word also can quickly assist understanding.Since visualization processing is fully established on original text, if
Want to read in detail also very convenient in full.
The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field
Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to
The protection scope of the appended claims of the present invention.
Claims (10)
1. the text aid reading method towards the analysis of public opinion, which is characterized in that specifically includes the following steps:
Step 1: carrying out text extracting to various types webpage;
Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting;
Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree.
2. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that used in step 1
Dom tree parsing carries out the text extracting, detailed process the following steps are included:
Step 1 one obtains original web page HTML, detection coding;UTF-8 coding is converted into if encoding non-UTF-8;
Web page tag is grouped by step 1 two;
Step 1 three establishes dom tree to HTML;DOM is DOM Document Object Model;
Step 1 four, the element for deleting non-textual content in DOM;
All element in step 1 five, traversal DOM, if element is the label of<div>, need further recursive traversal<
All nested elements in div > label reconfigure the content for sorting out page body by the weight plus-minus to grouping.
3. the text aid reading method towards the analysis of public opinion according to claim 2, which is characterized in that described in step 2
Name Entity recognition can use conventional machines learning algorithm or deep learning algorithm;The conventional machines learning algorithm, packet
Include the Named Entity Extraction Model algorithm based on stacking HMM and the Named Entity Extraction Model algorithm based on CRF, wherein HMM table
Show Hidden Markov Model, CRF indicates condition random field;The deep learning algorithm includes that the name entity based on BiLSTM is known
Other model algorithm and Named Entity Extraction Model algorithm based on BiLSTM+CRF, the BiLSTM indicate two-way long short-term memory
Network.
4. the text aid reading method towards the analysis of public opinion according to claim 3, which is characterized in that described based on stacking
In the stacking HMM of the Named Entity Extraction Model of HMM:
1st layer of HMM is for segmenting;
2nd layer of HMM roughly identifies place name, name on the basis of the 1st layer, then next to result progress pattern match is generated
Secondary automatic marking corpus;
For 3rd layer of HMM on secondary automatic marking corpus, training subtly identifies place name, name;
4th layer of HMM mechanism name for identification.
5. the text aid reading method towards the analysis of public opinion according to claim 3, which is characterized in that described to be based on
The Named Entity Extraction Model of BiLSTM includes following 3 layers:
1st layer is embeding layer, for converting character vector for the character of sentence, character vector can either random initializtion, so
It is updated during training afterwards;Also it is good to be able to use pre-training on network;
2nd layer is BiLSTM layers, first to the stochastic parameter in each unit of LSTM, then by character vector spacer step at any time
It is sent into LSTM unit one by one and does cycle calculations;LSTM is carried out a forward circulation to calculate and a recycled back calculating;It obtains
Result by splicing, regularization, send to output layer;
3rd layer is output layer, which is softmax output layer.
6. the text aid reading method towards the analysis of public opinion according to claim 5, which is characterized in that described to be based on
The Named Entity Extraction Model of BiLSTM+CRF, on the basis of the Named Entity Extraction Model based on BiLSTM, in BiLSTM
Increasing by one CRF layers between layer and output layer, final label is calculated by CRF layers again for the data of BiLSTM layers of transmission,
And it is sent to output layer.
7. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that Step 2: step
Highlighting described in three can either carry out in original web page, can also carry out on " reading model " webpage, by
It is corresponding word, sentence in HTML plus effect presentation.
8. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that used in step 3
Unsupervised extraction-type digest algorithm carries out the autoabstract;Unsupervised extraction-type digest algorithm includes the excavation based on figure
Algorithm, the algorithm based on cluster.
9. the text aid reading method described in any one towards the analysis of public opinion according to claim 1~8, which is characterized in that
The autoabstract is carried out using the TextRank algorithm belonged in the mining algorithm based on figure in step 3;The TextRank
Algorithm be it is a kind of text be built into figure indicate, then excavated using figure to find the algorithm of key node;It specifically includes as follows
Step:
Firstly, document is first divided into sentence, and by sentence expression at vector form;
Then, it calculates sentence similarity matrix: to the vector of two sentences any in text, being calculated using cosine formula similar
Degree, is aggregated into similarity matrix;Entire text can be considered as using sentence between node, sentence similarity for while it is undirected have the right while
Connected graph G;
Finally, carrying out important node excavation to G using PageRank algorithm;Calculation formula is as follows:
Wherein, c indicates damped coefficient, VtIndicate t-th of node in figure G, In (Vt) indicate to be directed toward node VtNode set,
Out(Vj) indicate node VjPointed node set, wjtIndicate node VtTo node VjSide weight;WS(Vt) indicate section
Point VtWeight and, the sum term on right side then indicates each adjacent node to the percentage contribution of this node;
Continuous iteration update is carried out to nodes all in figure using above-mentioned formula, until all weights tend to be steady;It is final to choose
Weight and highest N number of node take its corresponding N number of sentence as abstract output.
10. the text aid reading method towards the analysis of public opinion according to claim 9, which is characterized in that described by sentence
Being expressed as vector form can be using BM25 algorithm or based on distributed learning algorithm;
Its calculation formula of BM25 algorithm is as follows:
Wherein, Q indicates inquiry string;qiIndicate i-th of word in inquiry string Q, i=1 ... n;N is in inquiry string Q
Word number;D indicates a search result document;WiIndicate qiWeight;R(qi, d) and indicate qiIt is related to search result document d's
Property score;
Corpus pre-training is first passed through based on distributed learning algorithm and obtains term vector, and it is average then to do term vector to the word in sentence
Operation obtains the vector of sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910621253.3A CN110334300A (en) | 2019-07-10 | 2019-07-10 | Text aid reading method towards the analysis of public opinion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910621253.3A CN110334300A (en) | 2019-07-10 | 2019-07-10 | Text aid reading method towards the analysis of public opinion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110334300A true CN110334300A (en) | 2019-10-15 |
Family
ID=68145988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910621253.3A Pending CN110334300A (en) | 2019-07-10 | 2019-07-10 | Text aid reading method towards the analysis of public opinion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110334300A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160019A (en) * | 2019-12-30 | 2020-05-15 | 中国联合网络通信集团有限公司 | Public opinion monitoring method, device and system |
CN112989811A (en) * | 2021-03-01 | 2021-06-18 | 哈尔滨工业大学 | BilSTM-CRF-based historical book reading auxiliary system and control method thereof |
CN113297826A (en) * | 2020-06-28 | 2021-08-24 | 上海交通大学 | Method for marking on natural language text |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN109753660A (en) * | 2019-01-07 | 2019-05-14 | 福州大学 | A kind of acceptance of the bid webpage name entity abstracting method based on LSTM |
CN109800386A (en) * | 2017-11-17 | 2019-05-24 | 奥多比公司 | Highlight the key component of text in document |
-
2019
- 2019-07-10 CN CN201910621253.3A patent/CN110334300A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN109800386A (en) * | 2017-11-17 | 2019-05-24 | 奥多比公司 | Highlight the key component of text in document |
CN109753660A (en) * | 2019-01-07 | 2019-05-14 | 福州大学 | A kind of acceptance of the bid webpage name entity abstracting method based on LSTM |
Non-Patent Citations (3)
Title |
---|
裴大帅2021: "NLP机构名识别中的层叠式HMM架构", 《新浪博客》 * |
郭正斌: "面向社会安全事件的知识图谱构建方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
陈老师或波哥: "网页内容高亮的实现", 《简书》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160019A (en) * | 2019-12-30 | 2020-05-15 | 中国联合网络通信集团有限公司 | Public opinion monitoring method, device and system |
CN111160019B (en) * | 2019-12-30 | 2023-08-15 | 中国联合网络通信集团有限公司 | Public opinion monitoring method, device and system |
CN113297826A (en) * | 2020-06-28 | 2021-08-24 | 上海交通大学 | Method for marking on natural language text |
CN112989811A (en) * | 2021-03-01 | 2021-06-18 | 哈尔滨工业大学 | BilSTM-CRF-based historical book reading auxiliary system and control method thereof |
CN112989811B (en) * | 2021-03-01 | 2022-09-09 | 哈尔滨工业大学 | History book reading auxiliary system based on BiLSTM-CRF and control method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Sentiment analysis of Chinese micro-blog text based on extended sentiment dictionary | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
WO2021114745A1 (en) | Named entity recognition method employing affix perception for use in social media | |
CN103049435B (en) | Text fine granularity sentiment analysis method and device | |
CN109857990A (en) | A kind of financial class notice information abstracting method based on file structure and deep learning | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN110334300A (en) | Text aid reading method towards the analysis of public opinion | |
CN103678412A (en) | Document retrieval method and device | |
CN111444704B (en) | Network safety keyword extraction method based on deep neural network | |
CN109086355A (en) | Hot spot association relationship analysis method and system based on theme of news word | |
Wang et al. | Learning morpheme representation for mongolian named entity recognition | |
Tohidi et al. | A Practice of Human-Machine Collaboration for Persian Text Summarization | |
Liu et al. | A parallel computing-based deep attention model for named entity recognition | |
Xu et al. | ALSEE: a framework for attribute-level sentiment element extraction towards product reviews | |
Le-Hong | Diacritics generation and application in hate speech detection on Vietnamese social networks | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
CN108595466B (en) | Internet information filtering and internet user information and network card structure analysis method | |
Feifei et al. | Bert-based Siamese network for semantic similarity | |
Nasim et al. | Evaluation of clustering techniques on Urdu News head-lines: A case of short length text | |
Behere et al. | Text summarization and classification of conversation data between service chatbot and customer | |
Mohnot et al. | Hybrid approach for Part of Speech Tagger for Hindi language | |
Hua et al. | A character-level method for text classification | |
CN116049437A (en) | Element extraction method of document-level low-resource scene based on self-label and prompt | |
Jiang et al. | A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems | |
CN112445887A (en) | Method and device for realizing machine reading understanding system based on retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191015 |