CN107463696A - A kind of method of Webpage largest block extraction - Google Patents

A kind of method of Webpage largest block extraction Download PDF

Info

Publication number
CN107463696A
CN107463696A CN201710694534.2A CN201710694534A CN107463696A CN 107463696 A CN107463696 A CN 107463696A CN 201710694534 A CN201710694534 A CN 201710694534A CN 107463696 A CN107463696 A CN 107463696A
Authority
CN
China
Prior art keywords
msub
mrow
text
webpage
msup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710694534.2A
Other languages
Chinese (zh)
Inventor
苑聪虎
程国艮
李世奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mandarin Technology (beijing) Co Ltd
Original Assignee
Mandarin Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mandarin Technology (beijing) Co Ltd filed Critical Mandarin Technology (beijing) Co Ltd
Priority to CN201710694534.2A priority Critical patent/CN107463696A/en
Publication of CN107463696A publication Critical patent/CN107463696A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention belongs to field of webpage design, discloses a kind of method of Webpage largest block extraction, including:Webpage source code is obtained first, and all blank line is substituted for canonical for the word segment of the non-page presentation of label in webpage source code;Then the function that the number of each style of writing word is formed is calculated;The function of formation is distributed as a line;Then one using step length as 3 row sums combines block blocks, forms the function of new text distribution more crypto set;Then found out and risen sharply according to intensive function, the point of rapid drawdown out generates text as web page body Text Feature Extraction, preserves.The cost of labor and maintenance cost of web page contents extraction can be greatly lowered in the present invention, while has preferable adaptive ability, and in web site contents structural change, abstracting method is still effective.

Description

A kind of method of Webpage largest block extraction
Technical field
The invention belongs to field of webpage design, more particularly to a kind of method of Webpage largest block extraction.
Background technology
At present, the main method that page extraction uses is to carry out carrying for text for being customized of different web sites rule Take.Analysis needs to crawl text page first, according to the extracting rule of the page structure handwritten copy page.Rule is with xpath or just Then write, specific content of pages extraction is then carried out according to the page extracting rule of writing, preserved locally after obtaining content.It is complete Into the extraction of a web page contents.It is excessive so to do maintenance cost, crawls content for website, there is a socket gauge each website Then.Once webpage format changes, originally just cancelled for this website, came to content of pages extraction belt difficult.
In summary:Problems of the prior art:Technology takes particular webpage specific rule to take out at present Content is taken, writing, optimize, safeguarding for substantial amounts of decimation rule can be related to when in face of a large amount of webpage content extractions;This Sample can put into substantial amounts of human resources, and the webpage format former decimation rule that once changes will fail;In order to keep away Exempt from webpage content extraction program caused by due to the page changing to fail, reduce development cost and maintenance cost, improve web page contents The intelligence degree and adaptive ability of abstracting method, it is proposed that the present invention.
The content of the invention
The problem of existing for prior art, the invention provides a kind of method of Webpage largest block extraction.
The present invention is achieved in that a kind of method of Webpage largest block extraction, specifically includes:
Webpage html codes are obtained, the pretreatment then encoded to content, remove script, removed spcial character;Obtain To be encoded after to html source codes according to website and carry out research content, if website does not have the research content of meta labels, given tacit consent to Content is encoded with utf8;Canonical is used after the completion of coding '<script[^>]*>.*</script>' remove< script></script>Comprising script and special annotation text<!--...-->, for the spcial character & of escape nbsp;&lt;&gt;&amp;&quot;The escapes such as &apos into corresponding space,<、>, &, ", ' etc.;
Remove the text that format tags obtain full page;To text canonical "<[^>]*>" label is removed, remove All labels in html;Rough text distribution makes the label script of processing become blank line;
Calculate the distribution function of row block number of characters;
The point to rise sharply with rapid drawdown is found according to distribution function change, obtains valuable text;
Text is handled, according to obtained line number, the content that every row represents is stitched together to form the master of needs according to line number Hold in vivo, null is also had in the body matter so obtained;Then with canonical, ' ^ n ' are substituted for sky, to remove null;Finally Obtain article main body.
Further, the distribution function for calculating row block number of characters includes:Text based on line number is established to rough text This length distribution function;Function is expressed as in reference axis, and x-axis represents text line number:LN;Y-axis represents the text length of the line number Degree:LL;
Then row text size is merged into above most long as 3 using jumping degree and gone as a block block, LN/3 be present Individual block, concentrate text size and make the rapid drawdown point differentiation that rises sharply.
Further, it is described that the point to rise sharply with rapid drawdown is found according to distribution function change, obtain the method for valuable text Including:
If matched curve is:Y=a0+a1x+...+akxk, each point to this curve distance, i.e. sum of square of deviations is:
Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result:
Result is after simplification:
Then equation matrix is converted into obtain:
Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, be then just fitted Parameter of curve;Obtain matched curve;Matched curve obtains its graded function after obtaining to the curve derivation:
Y '=a1+a2x+...+akxk-1
Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, And take absolute value, find the maximum line number of difference;
Maximum line number rises sharply the point of rapid drawdown for text, obtains page main body, finally extracts text.
Another object of the present invention is to provide a kind of Webpage largest block extraction system.
The present invention is largely focused on the characteristic in continuous blocks using the effective information of netpage page face displaying.Utilize maximum Block method can quickly and efficiently determine page body matter position, and be extracted.In existing rule-based web page body It is big to hold extracting method workload.If need several rules for a website, then in face of this work of thousands of website Measure very big.And to safeguard that so much rule will also put into substantial amounts of cost.
Such as extract web site contents according to existing rule and method, 1000 websites need the rule of 10 people's write responses and And a general week is taken, the maintenance that crawls for content is also required to 2 people.But the present invention does not need writing for rule tight It is tight to need 1 people to carry out reptile maintenance.Thus save the workload for coming out 11 people.
In addition, the risk of Website page structural change is also faced, once Website page changes, before for the net Standing, it is regular by whole failures to design.The cost of labor and maintenance cost of web page contents extraction can be greatly lowered in the present invention, There is preferable adaptive ability simultaneously, in web site contents structural change, abstracting method is still effective.
Brief description of the drawings
Fig. 1 is the method flow diagram that the Webpage largest block that the present invention implements to provide is extracted.
Fig. 2 is that the page that the present invention implements to provide can obtain distribution schematic diagram by maximum block analysis.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
Technology takes particular webpage specific rule to carry out extraction content at present, in face of a large amount of webpage content extractions When can be related to writing, optimize, safeguarding for substantial amounts of decimation rule;Substantial amounts of human resources, and webpage can so be put into The form former decimation rule that once changes will fail;In order to avoid webpage content extraction caused by changing due to the page Program fails, and reduces development cost and maintenance cost, improves the intelligence degree and adaptive ability of webpage content extraction method, Propose the present invention.
The application principle of the present invention is described in detail below in conjunction with the accompanying drawings.
The method that the present invention implements the Webpage largest block extraction provided, including:
Webpage source code is obtained first, and canonical is all used for the word segment of the non-page presentation such as the label in webpage source code It is substituted for blank line.Then the number for calculating each style of writing word forms similar Fig. 2 function distribution.The function so formed point Cloth is a line.Then the block blocks (this block block length is 3 row sums) using step-length as 3 as combinations are formed newly Text is distributed the function of more crypto set.Then found out and risen sharply according to density function, the point of rapid drawdown is as web page body Text Feature Extraction Text is out generated, is preserved.
As shown in figure 1, the method for Webpage largest block extraction provided in an embodiment of the present invention, is specifically included:
S101:Webpage html codes are obtained, is then pre-processed and (content is encoded, script is removed, removes special word Symbol).
S102:Remove the text that format tags obtain full page.
S103:Calculate the distribution function of row block number of characters.
S104:The point to rise sharply with rapid drawdown is found in changing according to distribution function, obtains valuable text.
S105:Text is handled, the blank line between text is removed and obtains effective body matter and preserve.
The present invention is described further with reference to specific embodiment.
The present invention is implemented in the method for the Webpage largest block extraction provided, including:
1) html source codes obtain crawls with excessively universal reptile to Target Station, and the content of return changes into character string type Data.
2) progress research content will be encoded according to website later by getting html source codes, if website does not have the interior of meta labels Hold coding, then acquiescence is encoded with utf8 to content.Removed after the completion of coding with canonical<script></script>Comprising Script and special annotation text<!--...-->.Then label is removed with canonical to text, removes in html and own Label.So rough text distribution has just come out label script of processing etc. above and has all become blank line.
3) the text size distribution function based on line number is established to rough text.Function is expressed as in reference axis, x-axis Represent text line number:LN, y-axis represent the text size of the line number:LL.Then in order to more highlight the jumping characteristic of text distribution, Row text size is merged into above most long using jumping degree as 3 (such as the 1st row length is 1, the third line of the 2nd behavior 4 be 5 that First block length is 8 after merging, by that analogy 4,5,6 rows etc..).Row is used as a block block, thus deposits In LN/3 block, can thus text size be set more to concentrate, so as to more preferably distinguish the rapid drawdown point that rises sharply.
One page can obtain distribution schematic diagram as shown in Figure 2 by maximum block analysis;
According to above-mentioned distribution function, rising sharply a little and rapid drawdown point for text is found.Above-mentioned curve concrete principle is as follows
If matched curve is:Y=a0+a1x+...+akxk, each point to this curve distance, i.e. sum of square of deviations is:
Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result:
Result is after simplification:
Then equation matrix is converted into obtain:
Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, be then just fitted Parameter of curve;Obtain matched curve;Matched curve obtains its graded function after obtaining to the curve derivation:
Y '=a1+a2x+...+akxk-1
Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, And take absolute value, find the maximum line number of difference.
Then, this line number is exactly that text rises sharply the point of rapid drawdown, such as Fig. 2 150-225 rows.So page main body just obtains , finally extract text and just complete.
The present invention is largely focused on the characteristic in continuous blocks using the effective information of netpage page face displaying.Utilize maximum Block method can quickly and efficiently determine page body matter position, and be extracted.In existing rule-based web page body It is big to hold extracting method workload.If need several rules for a website, then in face of this work of thousands of website Measure very big.And to safeguard that so much rule will also put into substantial amounts of cost.In addition, Website page knot is also faced The risk that structure changes, once Website page changes, before for the regular by whole failures of the website design.The present invention can be big The cost of labor and maintenance cost of amplitude reduction web page contents extraction, while there is preferable adaptive ability, in web site contents During structural change, abstracting method is still effective.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims (5)

  1. A kind of 1. method of Webpage largest block extraction, it is characterised in that the method for the Webpage largest block extraction, bag Include:
    Webpage source code is obtained first, is all substituted for for the word segment of the non-page presentation of label in webpage source code with canonical Blank line;
    Then the function that the number of each style of writing word is formed is calculated;The function of formation is distributed as a line;
    Then one using step length as 3 row sums combines block blocks, forms the function of new text distribution more crypto set;
    Then found out and risen sharply according to intensive function, the point of rapid drawdown out generates text as web page body Text Feature Extraction, preserves.
  2. 2. the method for Webpage largest block extraction as claimed in claim 1, it is characterised in that the Webpage largest block The method of extraction, is specifically included:
    Webpage html codes are obtained, the pretreatment then encoded to content, remove script, removed spcial character;Get Encoded after html source codes according to website and carry out research content, canonical is used after the completion of coding '<script[^>]*>.*</ script>' remove<script></script>Comprising script and special annotation text<!--...-->, for escape Te Shuzifu &nbsp;&lt;&gt;&amp;&quot;&apos escapes into corresponding space,<、>、&、”、’;
    Remove the text that format tags obtain full page;To text canonical "<[^>]*>" label is removed, remove html In all label;The script of the label of processing is set to become blank line using rough text distribution;
    Calculate the distribution function of row block number of characters;
    The point to rise sharply with rapid drawdown is found according to distribution function change, obtains valuable text;
    Text is handled, according to obtained line number, the content that every row represents is stitched together to form body matter according to line number, if To body matter in also have null;Then with canonical ' ^ n ' be substituted for sky, remove null;Finally obtain article main body.
  3. 3. the method for Webpage largest block extraction as claimed in claim 2, it is characterised in that the calculating row block number of characters Distribution function include:Text size distribution function based on line number is established to rough text;Function represents in reference axis For x-axis represents text line number:LN;Y-axis represents the text size of the line number:LL;
    Then row text size is merged into as 3 using jumping degree by row above most long and is used as a block block, it is individual to there is LN/3 Block, concentrate text size and make the rapid drawdown point differentiation that rises sharply.
  4. 4. the method for Webpage largest block extraction as claimed in claim 2, it is characterised in that described to be become according to distribution function Change the point for finding and rising sharply with rapid drawdown, obtaining the method for valuable text includes:
    If matched curve is:Y=a0+a1x+...+akxk, each point to this curve distance, i.e. sum of square of deviations is:
    <mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>&amp;lsqb;</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>+</mo> <mo>...</mo> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mi>k</mi> </msup> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mn>2</mn> </msup> <mo>;</mo> </mrow>
    Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result:
    <mrow> <mo>-</mo> <mn>2</mn> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>&amp;lsqb;</mo> <mi>y</mi> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <mi>x</mi> <mo>+</mo> <mo>...</mo> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <msup> <mi>x</mi> <mi>k</mi> </msup> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> <msup> <mi>x</mi> <mi>k</mi> </msup> <mo>=</mo> <mn>0</mn> <mo>;</mo> </mrow>
    Result is after simplification:
    <mrow> <msub> <mi>a</mi> <mn>0</mn> </msub> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mi>k</mi> </msup> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mi>k</mi> <mo>+</mo> <mn>1</mn> </mrow> </msup> <mo>+</mo> <mn>...</mn> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mn>2</mn> <mi>k</mi> </mrow> </msup> <mo>=</mo> <mn>0</mn> <mo>;</mo> </mrow>
    Then equation matrix is converted into obtain:
    Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, then just obtain matched curve Parameter;Obtain matched curve;Matched curve obtains its graded function after obtaining to the curve derivation:
    Y '=a1+a2x+...+akxk-1
    Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, and Take absolute value, find the maximum line number of difference;
    Maximum line number rises sharply the point of rapid drawdown for text, obtains page main body, finally extracts text.
  5. A kind of 5. Webpage largest block extraction system of the method for Webpage largest block extraction as claimed in claim 1.
CN201710694534.2A 2017-08-15 2017-08-15 A kind of method of Webpage largest block extraction Pending CN107463696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710694534.2A CN107463696A (en) 2017-08-15 2017-08-15 A kind of method of Webpage largest block extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710694534.2A CN107463696A (en) 2017-08-15 2017-08-15 A kind of method of Webpage largest block extraction

Publications (1)

Publication Number Publication Date
CN107463696A true CN107463696A (en) 2017-12-12

Family

ID=60549634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710694534.2A Pending CN107463696A (en) 2017-08-15 2017-08-15 A kind of method of Webpage largest block extraction

Country Status (1)

Country Link
CN (1) CN107463696A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271145A (en) * 2018-09-03 2019-01-25 科大国创软件股份有限公司 Fast regular method for customizing based on pythonQT and intelligent algorithm
CN110881056A (en) * 2018-09-05 2020-03-13 百度在线网络技术(北京)有限公司 Method and device for pushing information
CN113537091A (en) * 2021-07-20 2021-10-22 东莞市盟大塑化科技有限公司 Webpage text recognition method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456050A (en) * 2010-10-27 2012-05-16 中国移动通信集团四川有限公司 Method and device for extracting data from webpage
CN102955818A (en) * 2011-08-31 2013-03-06 镇江诺尼基智能技术有限公司 Method for acquiring full names in Chinese from Web page
US20140095463A1 (en) * 2012-06-06 2014-04-03 Derek Edwin Pappas Product Search Engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456050A (en) * 2010-10-27 2012-05-16 中国移动通信集团四川有限公司 Method and device for extracting data from webpage
CN102955818A (en) * 2011-08-31 2013-03-06 镇江诺尼基智能技术有限公司 Method for acquiring full names in Chinese from Web page
US20140095463A1 (en) * 2012-06-06 2014-04-03 Derek Edwin Pappas Product Search Engine

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAIRUSCHAN: "最小二乘法多项式曲线拟合原理与实现", 《HTTPS://BLOG.CSDN.NET/JAIRUSCHAN/ARTICLE/DETAILS/7517773》 *
谭守标 等: "Web信息抽取及知识表示系统的研究与实现", 《计算机系统应用》 *
陈鑫: "基于行块分布函数的通用网页正文抽取", 《百度文库》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271145A (en) * 2018-09-03 2019-01-25 科大国创软件股份有限公司 Fast regular method for customizing based on pythonQT and intelligent algorithm
CN109271145B (en) * 2018-09-03 2021-12-14 科大国创软件股份有限公司 Quick rule customizing method based on pythonQT and intelligent algorithm
CN110881056A (en) * 2018-09-05 2020-03-13 百度在线网络技术(北京)有限公司 Method and device for pushing information
CN113537091A (en) * 2021-07-20 2021-10-22 东莞市盟大塑化科技有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Wang et al. Tire defect detection using fully convolutional network
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
CN105183801B (en) web page text extracting method and device
CN106156239B (en) Table extraction method and device
Sun et al. Dom based content extraction via text density
CN102184189B (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN107590219A (en) Webpage personage subject correlation message extracting method
CN106709032A (en) Method and device for extracting structured information from spreadsheet document
CN105653668A (en) Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN107463696A (en) A kind of method of Webpage largest block extraction
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN105320734B (en) A kind of web page core content extracting method
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN105389329A (en) Open source software recommendation method based on group comments
CN103491116A (en) Method and device for processing text-related structural data
CN108733405A (en) The method and apparatus that training webpage distribution indicates model
CN105677638A (en) Web information extraction method
CN106528068A (en) Webpage content reconstruction method and system
CN114528811B (en) Article content extraction method, device, equipment and storage medium
CN112559929B (en) Method, electronic device and medium for extracting webpage target information
CN106528509A (en) Webpage information extracting method and apparatus
CN104462061A (en) Word extraction method and word extraction device
CN103942224A (en) Method and device for acquiring annotation rule of webpage blocks
CN108694192B (en) Webpage type judging method and device
CN103729354B (en) web information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100040 Shijingshan Road, Shijingshan District, Beijing, No. 20, 16 layer 1601

Applicant after: Chinese translation language through Polytron Technologies Inc

Address before: 100040 Shijingshan District railway building, Beijing, the 16 floor

Applicant before: Mandarin Technology (Beijing) Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171212