CN108897749A - Method for abstracting web page information and system based on syntax tree and text block density - Google Patents

Method for abstracting web page information and system based on syntax tree and text block density Download PDF

Info

Publication number
CN108897749A
CN108897749A CN201810355382.8A CN201810355382A CN108897749A CN 108897749 A CN108897749 A CN 108897749A CN 201810355382 A CN201810355382 A CN 201810355382A CN 108897749 A CN108897749 A CN 108897749A
Authority
CN
China
Prior art keywords
node
text
tree
information
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810355382.8A
Other languages
Chinese (zh)
Inventor
舒琦赟
汪立东
刘晓飞
王慧
俞晓明
赵忠华
刘悦
王卿
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN201810355382.8A priority Critical patent/CN108897749A/en
Publication of CN108897749A publication Critical patent/CN108897749A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of method for abstracting web page information based on syntax tree and text block density, including:Obtain the title text information of webpage;Setting screening threshold value, calculates the text block density of all nodes of the webpage, the node for being greater than the screening threshold value using text block density extracts the node text information of the acquisition node as acquisition node;If the quantity of the acquisition node is 1, extracted using the node text information as target information;If the quantity of the acquisition node is greater than 1, the title text information and the node text information are respectively converted into the title deep grammar tree and node deep grammar tree of unique expression sentence semantics;The overall similarity for obtaining each the node deep grammar tree and the title deep grammar tree, is extracted using the corresponding node text information of the maximum value in the overall similarity as target information.

Description

Method for abstracting web page information and system based on syntax tree and text block density
Technical field
The invention belongs to network data acquisition fields, in particular to a kind of to be based on syntax tree semantics recognition and label text block The method for abstracting web page information and system of density ratio.
Background technique
With the rapid development of information technology, data are more and more electronic, quickly and effectively network data collection technology Become particularly important.Effective network data collection is enterprise diagnosis market environment, and the necessity of customer demand possesses efficiently The enterprise of data acquisition ability shows powerful competitiveness in big data era.Simultaneously efficient data acquisition technology also concerning National politics safety.In today that information technology reaches its maturity, various informative information is rapidly propagated on network, netizen's main body Increasingly huge, public opinion is fast changing, and thought speech spread speed is even more like wild chargers, this is also that public opinion control ability proposes New challenge also becomes particularly important to the efficient collection of the network information, provides for the net room of grinding necessary in this case Network data, concerning country political security.
Now due to technology need, network data acquisition recent years be also it is in full flourish, for different web pages data Acquisition technique emerge one after another.One of them more insoluble technological difficulties is the pumping for text message short on webpage It takes.In the case where web page body text is shorter, the identification of main information just becomes more difficult, because it has compared long main body text The garbage such as advertisement in this webpage, with webpage, the identification of " noise " is lower, when execution webpage information screens It waits, more likely accidentally excludes it as garbage information filtering, instead extracting some advertising information mistakes as text This main body.
Summary of the invention
In view of the above-mentioned problems, the present invention proposes a kind of method for abstracting web page information based on syntax tree and text block density, Including:The title text information of webpage is obtained by regular expression;Setting screening threshold value, calculates all nodes of webpage Text block density, the node for being greater than screening threshold value using text block density extract the node text envelope of acquisition node as acquisition node Breath;If the quantity of acquisition node is 1, extracted using the node text information as target information;If the sum of acquisition node Amount is greater than 1, then is respectively converted the title text information and the node text information to by probabilistic type context-free model Heading syntax tree and node syntax tree;Heading syntax tree and node syntax tree are respectively converted into only by the synchronization tree replacement syntax The title deep grammar tree and node deep grammar tree of one expression sentence semantics;It calculates title deep grammar tree and each node is deep The overall similarity of layer syntax tree is that target information is taken out to the corresponding node text information of maximum value in overall similarity It takes.
Method for abstracting web page information of the present invention, wherein text block density obtains in the following manner:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor node The child node of v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText block wrapped The hyperlink number of characters contained, TNviFor child node viThe text block label that is included number, LTNviFor child node viText The number for the hyperlink label that block is included.
Method for abstracting web page information of the present invention is followed the steps below when the total quantity of acquisition node is greater than 1:
Extract the title term vector t of the title deep grammar treei, and it is identical with the title deep grammar tree construction should The text term vector a of node deep grammar treei;With title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S=S1·S2·S3·……·Sn;Wherein, 0<I≤n, n are positive integer, and n is should Title deep grammar burl points.
The Web page information extraction system based on syntax tree and text block density that the invention further relates to a kind of, including:
Text information obtains module, for obtaining the title text information of webpage, and acquisition section by regular expression The node text information of point;Threshold value is screened including setting, calculates the text block density of all nodes of webpage, it is close with text block The node that degree is greater than screening threshold value is acquisition node, extracts the node text information of acquisition node;
First object data obtaining module, for obtaining target information when the quantity of node text information is 1;
Second target information obtains module, carries out for obtaining target information when the quantity of node text information is greater than 1 It extracts;Mark is wherein converted for the title text information and the node text information respectively by probabilistic type context-free model Inscribe syntax tree and node syntax tree;Heading syntax tree and node syntax tree are respectively converted into uniquely by the synchronization tree replacement syntax Express the title deep grammar tree and node deep grammar tree of sentence semantics;Obtain title deep grammar tree and each node deep layer The overall similarity of syntax tree extracts the corresponding node text information of maximum value in overall similarity.
Web page information extraction system of the present invention obtains in module in text information, and text block density passes through following Mode obtains:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor node The child node of v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText block wrapped The hyperlink number of characters contained, TNviFor child node viThe text block label that is included number, LTNviFor child node viText The number for the hyperlink label that block is included.
Web page information extraction system of the present invention, the second target information obtain module and specifically include:
Term vector obtains module, for obtaining the title term vector t of the title deep grammar tree, and with the title deep layer The text term vector a of the identical node deep grammar tree of syntax tree structure;
Similarity obtains module, for title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S=S1·S2·S3·……·Sn;Wherein 0<I≤n, n are positive integer, and n is the mark Inscribe deep grammar burl points.
Detailed description of the invention
Fig. 1 is the method for abstracting web page information flow chart of the embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing, the present invention is mentioned A kind of method for abstracting web page information and system based on syntax tree and text block density out is further described.It should manage Solution, specific implementation method described herein are only used to explain the present invention, be not intended to limit the present invention.
Present invention aim to address the deficiencies in the prior art, proposes and be directed to the different number of the processing long short text of webpage According to acquisition method.By for traditional tune carried out in result selection based on the data gathering algorithm of label text density ratio It is whole, become choosing the relatively more similar target labels of multiple label densities from choosing a target labels, so that originally in text master In the case that body length is shorter, the case where body of text is filtered out by mistake, no longer occurs;It will be easy to get and acquisition precision It is high, moreover it is possible to which that comparison other of the headline of general overview body of text semanteme as body of text semanteme makes by semantic Method with Web page text is provided with semantic standard of comparison;Candidate multiple texts and article title are subjected to node syntax tree point Analysis constructs node syntax tree, the preliminary preparation of having done semantic matches for after;Language is carried out to all syntax trees built Method tree deformation process, the Principle component extraction in syntax tree is come out, and retains the keywords such as the Subject, Predicate and Object with critical significance and knot It is structure, sentence structure is different and semantic identical sentence does formal unification, compare text semantic for after and prepares;For It finishes pretreated multiple body of text syntax trees and headline syntax tree carries out syntax tree entirety semantic matches, pass through semanteme Identify which short text is acquisition target.
Fig. 1 is the method for abstracting web page information flow chart of the embodiment of the present invention.As shown in Figure 1, webpage information of the invention The step of abstracting method, is as follows:
Step S1 obtains the title text information of webpage;
Step S2 runs web page tag text density algorithm;
Step S3, setting screening threshold value, screens acquisition node, extracts the node text information of acquisition node;
Step S4 judges the quantity of acquisition node;
Step S5, if an only acquisition node, using the node text information of this acquisition node as target information, It is extracted;
Step S6 respectively carries out title text information and node text information if the quantity of acquisition node is greater than 1 Processing generates heading syntax tree and node syntax tree;
All syntax trees are normalized in step S7, generate title deep grammar tree and node deep layer language respectively Method tree;
Step S8 calculates the overall similarity between title deep grammar tree and node deep grammar tree;
Step S9, choosing the corresponding node text information of overall similarity maximum value is target information, is extracted to it.
Specifically, in the embodiment of the present invention, first by being exclusively used in identification article title and author, time etc. is delivered Position is fixed, and the regular expression for the data that form is uniform carries out the matching of aforementioned data, obtains the letter such as title of article Breath.
Secondly the text block density T BD of each node is calculated according to calculation formula (1);
Wherein, TBD (textblockdensity) is text block density;If v is a node in web analysis tree T, Blk (v) is using node v as the text block of root node, and the text block density T BD (v) of definition node v is all sub- sections of node v Point is the sum of non-link text number of characters and non-link label number ratio in the text block of root.
CN (ContentNumber) is text block number of characters, i.e. the text block text character number that is included;Usual situation Under, the text under body text block compares concentration, and text character length can be bigger;Text under noise text block compares point It dissipates, text character length can be smaller.
LCN (LinkContentNumber) is text block hyperlink number of characters, i.e. the text block hyperlink character that is included Number;Hyperlink text under body text block is fewer, and the hyperlink text under noise text block is relatively more.
TN (Tag Number) is text block number of tags, i.e. the number of the text block label that is included;Under body text block Mostly continuous text, label number are few;It is dispersion text under noise text block, label number is more.
LTN (LinkTagNumber) is text block hyperlink number of tags, i.e. of the text block hyperlink label that is included Number;Text under hyperlink label is mostly noise information, and the hyperlink label number contained under body text block is few, noise text The hyperlink label number contained under block is more.
After obtaining the text block density T BD of all nodes, a screening threshold value is set, all TBD values are greater than this The node of screening threshold value is set to acquisition node, and the text information in above all of acquisition node is extracted.
If being 1 by the quantity for screening the acquisition node that threshold value obtains, that is, the text information obtained is unique, then by this The node text information of acquisition node is extracted as target information;If passing through the number for screening the acquisition node that threshold value obtains Amount is greater than 1, that is, the text information obtained is not unique, then needs by the way that the node of aforementioned title text information and acquisition node is literary This information is handled, and is mentioned using acquisition with the highest node text information of title text information similarity as target information It takes.
Title text information and node are carried out using probabilistic type context-free model (PCFG) in the embodiment of the present invention The pretreatment of text information passes through PCFG model analysis and generates the title of title text information and node text information respectively Syntax tree and node syntax tree.PCFG model is a kind of common natural language syntactic analysis model.The parser of PCFG with Non- probabilistic type context-free grammar is identical, is extended since nonterminal symbol, and the PCFG analysis different for every kind is passed through Tree, calculates its corresponding probability.When sentence has ambiguity, probability is calculated to carry out selecting which syntactic analysis as a result, choosing The standard of selecting is generating probability maximum.Enabling T is alternative tree, can select the analysis of sentence by probability when sentence has ambiguity As a result T*, i.e.,:
The generating probability of the alternative tree T of analysis is exactly the conditional probability product of strictly all rules required for generating T:
Wherein r is rule, and P (r) is the probability for meeting this rule.
PCFG is as a kind of mature natural language analysis model, the ability with certain disambiguation, generative grammar It is high to set precision.And due to the Markov property of model itself, context environment is not considered, therefore the sparsity of data is asked Inscribe it is insensitive, therefore its analyze result have certain robustness.
It is further that all syntax trees (heading syntax tree and node syntax tree) generated above are handled.This hair Bright embodiment replaces grammatical (STSG) using synchronization tree and all heading syntax trees and node syntax tree is respectively converted into title depth Layer syntax tree and node deep grammar tree.
Here mentioned transformational grammar is a kind of theory being directed in sentence syntax and sentence in semantic relation, this Theory thinks that all natural language sentences all have deep layer and two, surface layer structure;Surface structure is the people recorded in document The visible text of eye, as actual word sequence;The deep structure of sentence is different from the surface structure that sentence arrives, the depth of sentence Layer structure actually determines the practical semanteme an of sentence;The sentence identical and that surface structure is different of multiple semantemes corresponds to together One deep sentence structure.
Such as:The lunch of my today is a hamburger.
I has had lunch of a hamburger as today.
Though the two sentences structures are different, inherent deep sentence structure be it is duplicate, what it is because of its expression is The same meaning.
STSG is a kind of rule self-study algorithm based on syntax tree, by syntax tree come the rule that voluntarily learns grammar. The surface structure of sentence is converted into deep structure, so that syntax difference but the identical Sentence Grammar of semantic identical sentence generation Tree.
STSG primitive rule extraction algorithm is as follows:
Input:Syntax tree pair<T(f),T(e),A>, A is the alignment relation of T (f) and T (e).
Establish an empty primitive rule set P.
T (p) is using p as the subtree of the T (f) of root node;
T (q) is using q as the subtree of the T (e) of root node;
A (t (p), t (q)) is word alignment relationship relevant to t (p) and t (q) in A
If<t(p),t(q),A(t(p),t(q))>Meet word alignment limitation and syntax limitation
Then will<t(p),t(q),A(t(p),t(q))>Regular collection P is added
Output:Primitive rule set P
Standard is carried out to all sentences corresponding to heading syntax tree and node syntax tree using trained STSG algorithm Change, heading syntax tree and node syntax tree are converted to title deep grammar tree and the node depth of unique expression sentence semantics respectively Layer syntax tree.For multiple syntax trees that the text in the same label may generate, they are standardized one by one, and returned In this label.
Term vector is a kind of language model towards natural language processing, and core concept is by different semantic criteria Word or word different in language are mapped to a high dimension vector, these vectors are made of per one-dimensional real number, between term vector Relationship by between word and word be abstracted semantic relation embodied, enable a computer to by specifically calculate come closely Like the abstract semantic relation of processing.Directional similarity between term vector also reflects the Semantic Similarity between word.
After obtaining title deep grammar tree and node deep grammar tree, need through trained term vector comparative grammar tree Between Semantic Similarity.By all title deep grammar trees come from the same text and the progress of node deep grammar tree Match.
Matching process is using the preamble traversal method set.Here T is title deep grammar tree, sets A1、A2、A3、…、AmFor Come from text deep grammar tree LiM to carry out matched syntax tree with tree T;Here m is candidate amount of text, and m, i are positive Integer, 0<i≤m;Sequence matches this m tree.T and AiIt is synchronous to carry out preamble traversal, if T and AiStructure is identical, then opens The similarity for beginning to calculate two trees skips A if not identicali;When calculating similarity, for being in the node a ∈ of same position T、b∈Ai, enabling its term vector is tiWith ai, calculate tiWith aiCosine similarity obtain term vector similarity Si.If T shares n A node, 0<I≤n then sets AiWith the overall similarity of tree TFinally with T-phase Like the corresponding A of degree maximum valueiAs acquisition target, extracted using its corresponding node text information as target information.
Specific algorithm is as follows:
Input is title tree T and text LiSet { the A of corresponding tree1,A2,A3,…,An}
1、
2, to AiPreamble traversal, t are synchronized with TiWith aiFor the corresponding term vector of node traversed, SiFor T and Li's Semantic similarity, S=1;
If 3, tiWith aiIt is not sky, then calculatesS=SS simultaneouslyi, otherwise skip Ai, jump to step 1;
4, above every tree A is soughtiCorresponding S;
Output is T and LiOverall similarity.
After completing overall similarity calculating, maximum value in node text information and title text information overall similarity is chosen That corresponding node text information is extracted as final acquisition target text.
The present invention is acquired using easy by template matching, and acquisition precision is high, being capable of high level overview body of text language The title of justice carries out semantic matches processing as semantic matches standard, to the short text of multiple doubtful body of text, and will be semantic The highest text of matching degree is as acquisition target.It is being difficult to accurately screen the feelings for wanting acquisition target by information such as web page tags Under condition, using the short text semanteme of multiple doubtful web page body texts as filter information, a kind of semantic-based webpage is provided Collecting method, greatly breaching in the past cannot be using the limit for the acquisition method that acquisition target itself semanteme is identified System.

Claims (10)

1. a kind of method for abstracting web page information based on syntax tree and text block density, which is characterized in that including:
Obtain the title text information of webpage;Setting screening threshold value, calculates the text block density of all nodes of the webpage, with this article The node that this block density is greater than the screening threshold value is acquisition node, extracts the node text information of the acquisition node;
If the quantity of the acquisition node is 1, extracted using the node text information as target information;
If the quantity of the acquisition node is greater than 1, the title text information and the node text information are respectively converted into uniquely Express the title deep grammar tree and node deep grammar tree of sentence semantics;Obtain each node deep grammar tree and the title The overall similarity of deep grammar tree, using the corresponding node text information of the maximum value in the overall similarity as target information into Row extracts.
2. method for abstracting web page information as described in claim 1, which is characterized in that obtain the heading-text by regular expression This information.
3. method for abstracting web page information as described in claim 1, which is characterized in that text block density obtains in the following manner ?:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor the section The child node of point v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText The hyperlink number of characters that block is included, TNviFor child node viThe text block label that is included number, LTNviFor the sub- section Point viThe text block hyperlink label that is included number.
4. method for abstracting web page information as described in claim 1, which is characterized in that using probabilistic type context-free model point Heading syntax tree and node syntax tree are not converted by the title text information and the node text information, and is replaced using synchronization tree The heading syntax tree and the node syntax tree are converted to the title deep grammar tree and the node deep grammar respectively by exchange of notes method Tree.
5. method for abstracting web page information as described in claim 1, which is characterized in that it is similar to obtain the entirety by following steps Degree:
Extract the title term vector t of the title deep grammar treei, and the node identical with the title deep grammar tree construction The text term vector a of deep grammar treei
With title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S =S1·S2·S3·……·Sn
Wherein 0<I≤n, n are positive integer, and n is title deep grammar burl points.
6. a kind of Web page information extraction system based on syntax tree and text block density, which is characterized in that including:
Text information obtains module, for obtaining the title text information of webpage and the node text information of acquisition node;Its In include setting screening threshold value, calculate the text block density of all nodes of the webpage, with text block density be greater than the screening threshold The node of value is the acquisition node, extracts the node text information of the acquisition node;
First object data obtaining module is extracted for obtaining target information when the quantity of the node text information is 1;
Second target information obtains module, is taken out for obtaining target information when the quantity of the node text information is greater than 1 It takes;The title text information and the node text information are wherein respectively converted into the title deep layer language of unique expression sentence semantics Method tree and node deep grammar tree;The overall similarity of each the node deep grammar tree and the title deep grammar tree is obtained, Using the corresponding node text information of the maximum value in the overall similarity as target information.
7. Web page information extraction system as claimed in claim 6, which is characterized in that the text information obtains in module, leads to It crosses regular expression and obtains the title text information.
8. Web page information extraction system as claimed in claim 6, which is characterized in that the text information obtains in module, should Text block density obtains in the following manner:
Wherein, TBD (v) is the text block density of node v, and v.children is the child node set of node v, viFor the section The child node of point v, CNviFor child node viThe text block text character number that is included, LCNviFor child node viText The hyperlink number of characters that block is included, TNviFor child node viThe text block label that is included number, LTNviFor the sub- section Point viThe text block hyperlink label that is included number.
9. Web page information extraction system as claimed in claim 6, which is characterized in that second target information obtains module In, heading syntax is converted for the title text information and the node text information respectively using probabilistic type context-free model Tree and node syntax tree, and the heading syntax tree and the node syntax tree are converted into the mark respectively using the synchronization tree replacement syntax Inscribe deep grammar tree and the node deep grammar tree.
10. Web page information extraction system as claimed in claim 6, which is characterized in that second target information obtains module, Further include:
Term vector obtains module, for obtaining the title term vector t of the title deep grammar treei, and with the title deep grammar The text term vector a of the identical node deep grammar tree of tree constructioni
Similarity obtains module, for title term vector tiWith text term vector aiTerm vector similarityObtain overall similarity S=S1·S2·S3·……·Sn;Wherein 0<I≤n, n are positive integer, and n is the mark Inscribe deep grammar burl points.
CN201810355382.8A 2018-04-19 2018-04-19 Method for abstracting web page information and system based on syntax tree and text block density Pending CN108897749A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810355382.8A CN108897749A (en) 2018-04-19 2018-04-19 Method for abstracting web page information and system based on syntax tree and text block density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810355382.8A CN108897749A (en) 2018-04-19 2018-04-19 Method for abstracting web page information and system based on syntax tree and text block density

Publications (1)

Publication Number Publication Date
CN108897749A true CN108897749A (en) 2018-11-27

Family

ID=64342530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810355382.8A Pending CN108897749A (en) 2018-04-19 2018-04-19 Method for abstracting web page information and system based on syntax tree and text block density

Country Status (1)

Country Link
CN (1) CN108897749A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391711A (en) * 2022-10-28 2022-11-25 中新宽维传媒科技有限公司 Webpage text information extraction method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002091163A1 (en) * 2001-05-08 2002-11-14 Eizel Technologies, Inc. Reorganizing content of an electronic document
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN107229668A (en) * 2017-03-07 2017-10-03 桂林电子科技大学 A kind of text extracting method based on Keywords matching

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002091163A1 (en) * 2001-05-08 2002-11-14 Eizel Technologies, Inc. Reorganizing content of an electronic document
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN107229668A (en) * 2017-03-07 2017-10-03 桂林电子科技大学 A kind of text extracting method based on Keywords matching

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙飞: "《基于DOM节点文本密度的网页核心块抽取算法研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391711A (en) * 2022-10-28 2022-11-25 中新宽维传媒科技有限公司 Webpage text information extraction method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
US9430742B2 (en) Method and apparatus for extracting entity names and their relations
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN101079024B (en) Special word list dynamic generation system and method
CN107590219A (en) Webpage personage subject correlation message extracting method
CN100552673C (en) Open type document isomorphism engines system
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN106528583A (en) Method for extracting and comparing web page main body
CN111061882A (en) Knowledge graph construction method
CN112667940B (en) Webpage text extraction method based on deep learning
CN110609983A (en) Structured decomposition method for policy file
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN109508458A (en) The recognition methods of legal entity and device
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
Xafopoulos et al. Language identification in web documents using discrete HMMs
CN111178080B (en) Named entity identification method and system based on structured information
CN114997288A (en) Design resource association method
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN106372232B (en) Information mining method and device based on artificial intelligence
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN111027312B (en) Text expansion method and device, electronic equipment and readable storage medium
CN111274354B (en) Referee document structuring method and referee document structuring device
CN108897749A (en) Method for abstracting web page information and system based on syntax tree and text block density
CN110210033A (en) The basic chapter unit recognition methods of Chinese based on main rheme theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination