CN108897749A - Method for abstracting web page information and system based on syntax tree and text block density - Google Patents
Method for abstracting web page information and system based on syntax tree and text block density Download PDFInfo
- Publication number
- CN108897749A CN108897749A CN201810355382.8A CN201810355382A CN108897749A CN 108897749 A CN108897749 A CN 108897749A CN 201810355382 A CN201810355382 A CN 201810355382A CN 108897749 A CN108897749 A CN 108897749A
- Authority
- CN
- China
- Prior art keywords
- node
- text
- tree
- information
- title
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of method for abstracting web page information based on syntax tree and text block density, including:Obtain the title text information of webpage;Setting screening threshold value, calculates the text block density of all nodes of the webpage, the node for being greater than the screening threshold value using text block density extracts the node text information of the acquisition node as acquisition node;If the quantity of the acquisition node is 1, extracted using the node text information as target information;If the quantity of the acquisition node is greater than 1, the title text information and the node text information are respectively converted into the title deep grammar tree and node deep grammar tree of unique expression sentence semantics;The overall similarity for obtaining each the node deep grammar tree and the title deep grammar tree, is extracted using the corresponding node text information of the maximum value in the overall similarity as target information.
Description
Technical field
The invention belongs to network data acquisition fields, in particular to a kind of to be based on syntax tree semantics recognition and label text block
The method for abstracting web page information and system of density ratio.
Background technique
With the rapid development of information technology, data are more and more electronic, quickly and effectively network data collection technology
Become particularly important.Effective network data collection is enterprise diagnosis market environment, and the necessity of customer demand possesses efficiently
The enterprise of data acquisition ability shows powerful competitiveness in big data era.Simultaneously efficient data acquisition technology also concerning
National politics safety.In today that information technology reaches its maturity, various informative information is rapidly propagated on network, netizen's main body
Increasingly huge, public opinion is fast changing, and thought speech spread speed is even more like wild chargers, this is also that public opinion control ability proposes
New challenge also becomes particularly important to the efficient collection of the network information, provides for the net room of grinding necessary in this case
Network data, concerning country political security.
Now due to technology need, network data acquisition recent years be also it is in full flourish, for different web pages data
Acquisition technique emerge one after another.One of them more insoluble technological difficulties is the pumping for text message short on webpage
It takes.In the case where web page body text is shorter, the identification of main information just becomes more difficult, because it has compared long main body text
The garbage such as advertisement in this webpage, with webpage, the identification of " noise " is lower, when execution webpage information screens
It waits, more likely accidentally excludes it as garbage information filtering, instead extracting some advertising information mistakes as text
This main body.
Summary of the invention
In view of the above-mentioned problems, the present invention proposes a kind of method for abstracting web page information based on syntax tree and text block density,
Including:The title text information of webpage is obtained by regular expression;Setting screening threshold value, calculates all nodes of webpage
Text block density, the node for being greater than screening threshold value using text block density extract the node text envelope of acquisition node as acquisition node
Breath;If the quantity of acquisition node is 1, extracted using the node text information as target information;If the sum of acquisition node
Amount is greater than 1, then is respectively converted the title text information and the node text information to by probabilistic type context-free model
Heading syntax tree and node syntax tree;Heading syntax tree and node syntax tree are respectively converted into only by the synchronization tree replacement syntax
The title deep grammar tree and node deep grammar tree of one expression sentence semantics;It calculates title deep grammar tree and each node is deep
The overall similarity of layer syntax tree is that target information is taken out to the corresponding node text information of maximum value in overall similarity
It takes.
Method for abstracting web page information of the present invention, wherein text block density obtains in the following manner:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor node
The child node of v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText block wrapped
The hyperlink number of characters contained, TNviFor child node viThe text block label that is included number, LTNviFor child node viText
The number for the hyperlink label that block is included.
Method for abstracting web page information of the present invention is followed the steps below when the total quantity of acquisition node is greater than 1:
Extract the title term vector t of the title deep grammar treei, and it is identical with the title deep grammar tree construction should
The text term vector a of node deep grammar treei;With title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S=S1·S2·S3·……·Sn;Wherein, 0<I≤n, n are positive integer, and n is should
Title deep grammar burl points.
The Web page information extraction system based on syntax tree and text block density that the invention further relates to a kind of, including:
Text information obtains module, for obtaining the title text information of webpage, and acquisition section by regular expression
The node text information of point;Threshold value is screened including setting, calculates the text block density of all nodes of webpage, it is close with text block
The node that degree is greater than screening threshold value is acquisition node, extracts the node text information of acquisition node;
First object data obtaining module, for obtaining target information when the quantity of node text information is 1;
Second target information obtains module, carries out for obtaining target information when the quantity of node text information is greater than 1
It extracts;Mark is wherein converted for the title text information and the node text information respectively by probabilistic type context-free model
Inscribe syntax tree and node syntax tree;Heading syntax tree and node syntax tree are respectively converted into uniquely by the synchronization tree replacement syntax
Express the title deep grammar tree and node deep grammar tree of sentence semantics;Obtain title deep grammar tree and each node deep layer
The overall similarity of syntax tree extracts the corresponding node text information of maximum value in overall similarity.
Web page information extraction system of the present invention obtains in module in text information, and text block density passes through following
Mode obtains:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor node
The child node of v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText block wrapped
The hyperlink number of characters contained, TNviFor child node viThe text block label that is included number, LTNviFor child node viText
The number for the hyperlink label that block is included.
Web page information extraction system of the present invention, the second target information obtain module and specifically include:
Term vector obtains module, for obtaining the title term vector t of the title deep grammar tree, and with the title deep layer
The text term vector a of the identical node deep grammar tree of syntax tree structure;
Similarity obtains module, for title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S=S1·S2·S3·……·Sn;Wherein 0<I≤n, n are positive integer, and n is the mark
Inscribe deep grammar burl points.
Detailed description of the invention
Fig. 1 is the method for abstracting web page information flow chart of the embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing, the present invention is mentioned
A kind of method for abstracting web page information and system based on syntax tree and text block density out is further described.It should manage
Solution, specific implementation method described herein are only used to explain the present invention, be not intended to limit the present invention.
Present invention aim to address the deficiencies in the prior art, proposes and be directed to the different number of the processing long short text of webpage
According to acquisition method.By for traditional tune carried out in result selection based on the data gathering algorithm of label text density ratio
It is whole, become choosing the relatively more similar target labels of multiple label densities from choosing a target labels, so that originally in text master
In the case that body length is shorter, the case where body of text is filtered out by mistake, no longer occurs;It will be easy to get and acquisition precision
It is high, moreover it is possible to which that comparison other of the headline of general overview body of text semanteme as body of text semanteme makes by semantic
Method with Web page text is provided with semantic standard of comparison;Candidate multiple texts and article title are subjected to node syntax tree point
Analysis constructs node syntax tree, the preliminary preparation of having done semantic matches for after;Language is carried out to all syntax trees built
Method tree deformation process, the Principle component extraction in syntax tree is come out, and retains the keywords such as the Subject, Predicate and Object with critical significance and knot
It is structure, sentence structure is different and semantic identical sentence does formal unification, compare text semantic for after and prepares;For
It finishes pretreated multiple body of text syntax trees and headline syntax tree carries out syntax tree entirety semantic matches, pass through semanteme
Identify which short text is acquisition target.
Fig. 1 is the method for abstracting web page information flow chart of the embodiment of the present invention.As shown in Figure 1, webpage information of the invention
The step of abstracting method, is as follows:
Step S1 obtains the title text information of webpage;
Step S2 runs web page tag text density algorithm;
Step S3, setting screening threshold value, screens acquisition node, extracts the node text information of acquisition node;
Step S4 judges the quantity of acquisition node;
Step S5, if an only acquisition node, using the node text information of this acquisition node as target information,
It is extracted;
Step S6 respectively carries out title text information and node text information if the quantity of acquisition node is greater than 1
Processing generates heading syntax tree and node syntax tree;
All syntax trees are normalized in step S7, generate title deep grammar tree and node deep layer language respectively
Method tree;
Step S8 calculates the overall similarity between title deep grammar tree and node deep grammar tree;
Step S9, choosing the corresponding node text information of overall similarity maximum value is target information, is extracted to it.
Specifically, in the embodiment of the present invention, first by being exclusively used in identification article title and author, time etc. is delivered
Position is fixed, and the regular expression for the data that form is uniform carries out the matching of aforementioned data, obtains the letter such as title of article
Breath.
Secondly the text block density T BD of each node is calculated according to calculation formula (1);
Wherein, TBD (textblockdensity) is text block density;If v is a node in web analysis tree T,
Blk (v) is using node v as the text block of root node, and the text block density T BD (v) of definition node v is all sub- sections of node v
Point is the sum of non-link text number of characters and non-link label number ratio in the text block of root.
CN (ContentNumber) is text block number of characters, i.e. the text block text character number that is included;Usual situation
Under, the text under body text block compares concentration, and text character length can be bigger;Text under noise text block compares point
It dissipates, text character length can be smaller.
LCN (LinkContentNumber) is text block hyperlink number of characters, i.e. the text block hyperlink character that is included
Number;Hyperlink text under body text block is fewer, and the hyperlink text under noise text block is relatively more.
TN (Tag Number) is text block number of tags, i.e. the number of the text block label that is included;Under body text block
Mostly continuous text, label number are few;It is dispersion text under noise text block, label number is more.
LTN (LinkTagNumber) is text block hyperlink number of tags, i.e. of the text block hyperlink label that is included
Number;Text under hyperlink label is mostly noise information, and the hyperlink label number contained under body text block is few, noise text
The hyperlink label number contained under block is more.
After obtaining the text block density T BD of all nodes, a screening threshold value is set, all TBD values are greater than this
The node of screening threshold value is set to acquisition node, and the text information in above all of acquisition node is extracted.
If being 1 by the quantity for screening the acquisition node that threshold value obtains, that is, the text information obtained is unique, then by this
The node text information of acquisition node is extracted as target information;If passing through the number for screening the acquisition node that threshold value obtains
Amount is greater than 1, that is, the text information obtained is not unique, then needs by the way that the node of aforementioned title text information and acquisition node is literary
This information is handled, and is mentioned using acquisition with the highest node text information of title text information similarity as target information
It takes.
Title text information and node are carried out using probabilistic type context-free model (PCFG) in the embodiment of the present invention
The pretreatment of text information passes through PCFG model analysis and generates the title of title text information and node text information respectively
Syntax tree and node syntax tree.PCFG model is a kind of common natural language syntactic analysis model.The parser of PCFG with
Non- probabilistic type context-free grammar is identical, is extended since nonterminal symbol, and the PCFG analysis different for every kind is passed through
Tree, calculates its corresponding probability.When sentence has ambiguity, probability is calculated to carry out selecting which syntactic analysis as a result, choosing
The standard of selecting is generating probability maximum.Enabling T is alternative tree, can select the analysis of sentence by probability when sentence has ambiguity
As a result T*, i.e.,:
The generating probability of the alternative tree T of analysis is exactly the conditional probability product of strictly all rules required for generating T:
Wherein r is rule, and P (r) is the probability for meeting this rule.
PCFG is as a kind of mature natural language analysis model, the ability with certain disambiguation, generative grammar
It is high to set precision.And due to the Markov property of model itself, context environment is not considered, therefore the sparsity of data is asked
Inscribe it is insensitive, therefore its analyze result have certain robustness.
It is further that all syntax trees (heading syntax tree and node syntax tree) generated above are handled.This hair
Bright embodiment replaces grammatical (STSG) using synchronization tree and all heading syntax trees and node syntax tree is respectively converted into title depth
Layer syntax tree and node deep grammar tree.
Here mentioned transformational grammar is a kind of theory being directed in sentence syntax and sentence in semantic relation, this
Theory thinks that all natural language sentences all have deep layer and two, surface layer structure;Surface structure is the people recorded in document
The visible text of eye, as actual word sequence;The deep structure of sentence is different from the surface structure that sentence arrives, the depth of sentence
Layer structure actually determines the practical semanteme an of sentence;The sentence identical and that surface structure is different of multiple semantemes corresponds to together
One deep sentence structure.
Such as:The lunch of my today is a hamburger.
I has had lunch of a hamburger as today.
Though the two sentences structures are different, inherent deep sentence structure be it is duplicate, what it is because of its expression is
The same meaning.
STSG is a kind of rule self-study algorithm based on syntax tree, by syntax tree come the rule that voluntarily learns grammar.
The surface structure of sentence is converted into deep structure, so that syntax difference but the identical Sentence Grammar of semantic identical sentence generation
Tree.
STSG primitive rule extraction algorithm is as follows:
Input:Syntax tree pair<T(f),T(e),A>, A is the alignment relation of T (f) and T (e).
Establish an empty primitive rule set P.
T (p) is using p as the subtree of the T (f) of root node;
T (q) is using q as the subtree of the T (e) of root node;
A (t (p), t (q)) is word alignment relationship relevant to t (p) and t (q) in A
If<t(p),t(q),A(t(p),t(q))>Meet word alignment limitation and syntax limitation
Then will<t(p),t(q),A(t(p),t(q))>Regular collection P is added
Output:Primitive rule set P
Standard is carried out to all sentences corresponding to heading syntax tree and node syntax tree using trained STSG algorithm
Change, heading syntax tree and node syntax tree are converted to title deep grammar tree and the node depth of unique expression sentence semantics respectively
Layer syntax tree.For multiple syntax trees that the text in the same label may generate, they are standardized one by one, and returned
In this label.
Term vector is a kind of language model towards natural language processing, and core concept is by different semantic criteria
Word or word different in language are mapped to a high dimension vector, these vectors are made of per one-dimensional real number, between term vector
Relationship by between word and word be abstracted semantic relation embodied, enable a computer to by specifically calculate come closely
Like the abstract semantic relation of processing.Directional similarity between term vector also reflects the Semantic Similarity between word.
After obtaining title deep grammar tree and node deep grammar tree, need through trained term vector comparative grammar tree
Between Semantic Similarity.By all title deep grammar trees come from the same text and the progress of node deep grammar tree
Match.
Matching process is using the preamble traversal method set.Here T is title deep grammar tree, sets A1、A2、A3、…、AmFor
Come from text deep grammar tree LiM to carry out matched syntax tree with tree T;Here m is candidate amount of text, and m, i are positive
Integer, 0<i≤m;Sequence matches this m tree.T and AiIt is synchronous to carry out preamble traversal, if T and AiStructure is identical, then opens
The similarity for beginning to calculate two trees skips A if not identicali;When calculating similarity, for being in the node a ∈ of same position
T、b∈Ai, enabling its term vector is tiWith ai, calculate tiWith aiCosine similarity obtain term vector similarity Si.If T shares n
A node, 0<I≤n then sets AiWith the overall similarity of tree TFinally with T-phase
Like the corresponding A of degree maximum valueiAs acquisition target, extracted using its corresponding node text information as target information.
Specific algorithm is as follows:
Input is title tree T and text LiSet { the A of corresponding tree1,A2,A3,…,An}
1、
2, to AiPreamble traversal, t are synchronized with TiWith aiFor the corresponding term vector of node traversed, SiFor T and Li's
Semantic similarity, S=1;
If 3, tiWith aiIt is not sky, then calculatesS=SS simultaneouslyi, otherwise skip Ai, jump to step 1;
4, above every tree A is soughtiCorresponding S;
Output is T and LiOverall similarity.
After completing overall similarity calculating, maximum value in node text information and title text information overall similarity is chosen
That corresponding node text information is extracted as final acquisition target text.
The present invention is acquired using easy by template matching, and acquisition precision is high, being capable of high level overview body of text language
The title of justice carries out semantic matches processing as semantic matches standard, to the short text of multiple doubtful body of text, and will be semantic
The highest text of matching degree is as acquisition target.It is being difficult to accurately screen the feelings for wanting acquisition target by information such as web page tags
Under condition, using the short text semanteme of multiple doubtful web page body texts as filter information, a kind of semantic-based webpage is provided
Collecting method, greatly breaching in the past cannot be using the limit for the acquisition method that acquisition target itself semanteme is identified
System.
Claims (10)
1. a kind of method for abstracting web page information based on syntax tree and text block density, which is characterized in that including:
Obtain the title text information of webpage;Setting screening threshold value, calculates the text block density of all nodes of the webpage, with this article
The node that this block density is greater than the screening threshold value is acquisition node, extracts the node text information of the acquisition node;
If the quantity of the acquisition node is 1, extracted using the node text information as target information;
If the quantity of the acquisition node is greater than 1, the title text information and the node text information are respectively converted into uniquely
Express the title deep grammar tree and node deep grammar tree of sentence semantics;Obtain each node deep grammar tree and the title
The overall similarity of deep grammar tree, using the corresponding node text information of the maximum value in the overall similarity as target information into
Row extracts.
2. method for abstracting web page information as described in claim 1, which is characterized in that obtain the heading-text by regular expression
This information.
3. method for abstracting web page information as described in claim 1, which is characterized in that text block density obtains in the following manner
?:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor the section
The child node of point v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText
The hyperlink number of characters that block is included, TNviFor child node viThe text block label that is included number, LTNviFor the sub- section
Point viThe text block hyperlink label that is included number.
4. method for abstracting web page information as described in claim 1, which is characterized in that using probabilistic type context-free model point
Heading syntax tree and node syntax tree are not converted by the title text information and the node text information, and is replaced using synchronization tree
The heading syntax tree and the node syntax tree are converted to the title deep grammar tree and the node deep grammar respectively by exchange of notes method
Tree.
5. method for abstracting web page information as described in claim 1, which is characterized in that it is similar to obtain the entirety by following steps
Degree:
Extract the title term vector t of the title deep grammar treei, and the node identical with the title deep grammar tree construction
The text term vector a of deep grammar treei;
With title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S
=S1·S2·S3·……·Sn;
Wherein 0<I≤n, n are positive integer, and n is title deep grammar burl points.
6. a kind of Web page information extraction system based on syntax tree and text block density, which is characterized in that including:
Text information obtains module, for obtaining the title text information of webpage and the node text information of acquisition node;Its
In include setting screening threshold value, calculate the text block density of all nodes of the webpage, with text block density be greater than the screening threshold
The node of value is the acquisition node, extracts the node text information of the acquisition node;
First object data obtaining module is extracted for obtaining target information when the quantity of the node text information is 1;
Second target information obtains module, is taken out for obtaining target information when the quantity of the node text information is greater than 1
It takes;The title text information and the node text information are wherein respectively converted into the title deep layer language of unique expression sentence semantics
Method tree and node deep grammar tree;The overall similarity of each the node deep grammar tree and the title deep grammar tree is obtained,
Using the corresponding node text information of the maximum value in the overall similarity as target information.
7. Web page information extraction system as claimed in claim 6, which is characterized in that the text information obtains in module, leads to
It crosses regular expression and obtains the title text information.
8. Web page information extraction system as claimed in claim 6, which is characterized in that the text information obtains in module, should
Text block density obtains in the following manner:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor the section
The child node of point v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText
The hyperlink number of characters that block is included, TNviFor child node viThe text block label that is included number, LTNviFor the sub- section
Point viThe text block hyperlink label that is included number.
9. Web page information extraction system as claimed in claim 6, which is characterized in that second target information obtains module
In, heading syntax is converted for the title text information and the node text information respectively using probabilistic type context-free model
Tree and node syntax tree, and the heading syntax tree and the node syntax tree are converted into the mark respectively using the synchronization tree replacement syntax
Inscribe deep grammar tree and the node deep grammar tree.
10. Web page information extraction system as claimed in claim 6, which is characterized in that second target information obtains module,
Further include:
Term vector obtains module, for obtaining the title term vector t of the title deep grammar treei, and with the title deep grammar
The text term vector a of the identical node deep grammar tree of tree constructioni;
Similarity obtains module, for title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S=S1·S2·S3·……·Sn;Wherein 0<I≤n, n are positive integer, and n is the mark
Inscribe deep grammar burl points.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810355382.8A CN108897749A (en) | 2018-04-19 | 2018-04-19 | Method for abstracting web page information and system based on syntax tree and text block density |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810355382.8A CN108897749A (en) | 2018-04-19 | 2018-04-19 | Method for abstracting web page information and system based on syntax tree and text block density |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108897749A true CN108897749A (en) | 2018-11-27 |
Family
ID=64342530
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810355382.8A Pending CN108897749A (en) | 2018-04-19 | 2018-04-19 | Method for abstracting web page information and system based on syntax tree and text block density |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108897749A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115391711A (en) * | 2022-10-28 | 2022-11-25 | 中新宽维传媒科技有限公司 | Webpage text information extraction method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002091163A1 (en) * | 2001-05-08 | 2002-11-14 | Eizel Technologies, Inc. | Reorganizing content of an electronic document |
CN102184189A (en) * | 2011-04-18 | 2011-09-14 | 北京理工大学 | Webpage core block determining method based on DOM (Document Object Model) node text density |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105095466A (en) * | 2015-07-31 | 2015-11-25 | 山东大学 | Web text information extraction method |
CN107229668A (en) * | 2017-03-07 | 2017-10-03 | 桂林电子科技大学 | A kind of text extracting method based on Keywords matching |
-
2018
- 2018-04-19 CN CN201810355382.8A patent/CN108897749A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002091163A1 (en) * | 2001-05-08 | 2002-11-14 | Eizel Technologies, Inc. | Reorganizing content of an electronic document |
CN102184189A (en) * | 2011-04-18 | 2011-09-14 | 北京理工大学 | Webpage core block determining method based on DOM (Document Object Model) node text density |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105095466A (en) * | 2015-07-31 | 2015-11-25 | 山东大学 | Web text information extraction method |
CN107229668A (en) * | 2017-03-07 | 2017-10-03 | 桂林电子科技大学 | A kind of text extracting method based on Keywords matching |
Non-Patent Citations (1)
Title |
---|
孙飞: "《基于DOM节点文本密度的网页核心块抽取算法研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115391711A (en) * | 2022-10-28 | 2022-11-25 | 中新宽维传媒科技有限公司 | Webpage text information extraction method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
Li et al. | Markuplm: Pre-training of text and markup language for visually-rich document understanding | |
US9430742B2 (en) | Method and apparatus for extracting entity names and their relations | |
CN101079024B (en) | Special word list dynamic generation system and method | |
CN112667940B (en) | Webpage text extraction method based on deep learning | |
CN110609983B (en) | Structured decomposition method for policy file | |
CN110598203A (en) | Military imagination document entity information extraction method and device combined with dictionary | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
Konstas et al. | Inducing document plans for concept-to-text generation | |
CN100552673C (en) | Open type document isomorphism engines system | |
CN106528583A (en) | Method for extracting and comparing web page main body | |
CN110609998A (en) | Data extraction method of electronic document information, electronic equipment and storage medium | |
CN102270212A (en) | User interest feature extraction method based on hidden semi-Markov model | |
CN111061882A (en) | Knowledge graph construction method | |
CN116127090B (en) | Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction | |
CN109508458A (en) | The recognition methods of legal entity and device | |
CN113569050A (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
CN111178080B (en) | Named entity identification method and system based on structured information | |
CN105389303B (en) | A kind of automatic fusion method of heterologous corpus | |
CN114997288A (en) | Design resource association method | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
CN106372232B (en) | Information mining method and device based on artificial intelligence | |
CN110929518A (en) | Text sequence labeling algorithm using overlapping splitting rule | |
CN114298048A (en) | Named entity identification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |