CN107463696A - A kind of method of Webpage largest block extraction - Google Patents
A kind of method of Webpage largest block extraction Download PDFInfo
- Publication number
- CN107463696A CN107463696A CN201710694534.2A CN201710694534A CN107463696A CN 107463696 A CN107463696 A CN 107463696A CN 201710694534 A CN201710694534 A CN 201710694534A CN 107463696 A CN107463696 A CN 107463696A
- Authority
- CN
- China
- Prior art keywords
- msub
- mrow
- text
- webpage
- msup
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/154—Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention belongs to field of webpage design, discloses a kind of method of Webpage largest block extraction, including:Webpage source code is obtained first, and all blank line is substituted for canonical for the word segment of the non-page presentation of label in webpage source code;Then the function that the number of each style of writing word is formed is calculated;The function of formation is distributed as a line;Then one using step length as 3 row sums combines block blocks, forms the function of new text distribution more crypto set;Then found out and risen sharply according to intensive function, the point of rapid drawdown out generates text as web page body Text Feature Extraction, preserves.The cost of labor and maintenance cost of web page contents extraction can be greatly lowered in the present invention, while has preferable adaptive ability, and in web site contents structural change, abstracting method is still effective.
Description
Technical field
The invention belongs to field of webpage design, more particularly to a kind of method of Webpage largest block extraction.
Background technology
At present, the main method that page extraction uses is to carry out carrying for text for being customized of different web sites rule
Take.Analysis needs to crawl text page first, according to the extracting rule of the page structure handwritten copy page.Rule is with xpath or just
Then write, specific content of pages extraction is then carried out according to the page extracting rule of writing, preserved locally after obtaining content.It is complete
Into the extraction of a web page contents.It is excessive so to do maintenance cost, crawls content for website, there is a socket gauge each website
Then.Once webpage format changes, originally just cancelled for this website, came to content of pages extraction belt difficult.
In summary:Problems of the prior art:Technology takes particular webpage specific rule to take out at present
Content is taken, writing, optimize, safeguarding for substantial amounts of decimation rule can be related to when in face of a large amount of webpage content extractions;This
Sample can put into substantial amounts of human resources, and the webpage format former decimation rule that once changes will fail;In order to keep away
Exempt from webpage content extraction program caused by due to the page changing to fail, reduce development cost and maintenance cost, improve web page contents
The intelligence degree and adaptive ability of abstracting method, it is proposed that the present invention.
The content of the invention
The problem of existing for prior art, the invention provides a kind of method of Webpage largest block extraction.
The present invention is achieved in that a kind of method of Webpage largest block extraction, specifically includes:
Webpage html codes are obtained, the pretreatment then encoded to content, remove script, removed spcial character;Obtain
To be encoded after to html source codes according to website and carry out research content, if website does not have the research content of meta labels, given tacit consent to
Content is encoded with utf8;Canonical is used after the completion of coding '<script[^>]*>.*</script>' remove<
script></script>Comprising script and special annotation text<!--...-->, for the spcial character & of escape
nbsp;<;>;&;";The escapes such as &apos into corresponding space,<、>, &, ", ' etc.;
Remove the text that format tags obtain full page;To text canonical "<[^>]*>" label is removed, remove
All labels in html;Rough text distribution makes the label script of processing become blank line;
Calculate the distribution function of row block number of characters;
The point to rise sharply with rapid drawdown is found according to distribution function change, obtains valuable text;
Text is handled, according to obtained line number, the content that every row represents is stitched together to form the master of needs according to line number
Hold in vivo, null is also had in the body matter so obtained;Then with canonical, ' ^ n ' are substituted for sky, to remove null;Finally
Obtain article main body.
Further, the distribution function for calculating row block number of characters includes:Text based on line number is established to rough text
This length distribution function;Function is expressed as in reference axis, and x-axis represents text line number:LN;Y-axis represents the text length of the line number
Degree:LL;
Then row text size is merged into above most long as 3 using jumping degree and gone as a block block, LN/3 be present
Individual block, concentrate text size and make the rapid drawdown point differentiation that rises sharply.
Further, it is described that the point to rise sharply with rapid drawdown is found according to distribution function change, obtain the method for valuable text
Including:
If matched curve is:Y=a0+a1x+...+akxk, each point to this curve distance, i.e. sum of square of deviations is:
Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result:
Result is after simplification:
Then equation matrix is converted into obtain:
Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, be then just fitted
Parameter of curve;Obtain matched curve;Matched curve obtains its graded function after obtaining to the curve derivation:
Y '=a1+a2x+...+akxk-1;
Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor,
And take absolute value, find the maximum line number of difference;
Maximum line number rises sharply the point of rapid drawdown for text, obtains page main body, finally extracts text.
Another object of the present invention is to provide a kind of Webpage largest block extraction system.
The present invention is largely focused on the characteristic in continuous blocks using the effective information of netpage page face displaying.Utilize maximum
Block method can quickly and efficiently determine page body matter position, and be extracted.In existing rule-based web page body
It is big to hold extracting method workload.If need several rules for a website, then in face of this work of thousands of website
Measure very big.And to safeguard that so much rule will also put into substantial amounts of cost.
Such as extract web site contents according to existing rule and method, 1000 websites need the rule of 10 people's write responses and
And a general week is taken, the maintenance that crawls for content is also required to 2 people.But the present invention does not need writing for rule tight
It is tight to need 1 people to carry out reptile maintenance.Thus save the workload for coming out 11 people.
In addition, the risk of Website page structural change is also faced, once Website page changes, before for the net
Standing, it is regular by whole failures to design.The cost of labor and maintenance cost of web page contents extraction can be greatly lowered in the present invention,
There is preferable adaptive ability simultaneously, in web site contents structural change, abstracting method is still effective.
Brief description of the drawings
Fig. 1 is the method flow diagram that the Webpage largest block that the present invention implements to provide is extracted.
Fig. 2 is that the page that the present invention implements to provide can obtain distribution schematic diagram by maximum block analysis.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
Technology takes particular webpage specific rule to carry out extraction content at present, in face of a large amount of webpage content extractions
When can be related to writing, optimize, safeguarding for substantial amounts of decimation rule;Substantial amounts of human resources, and webpage can so be put into
The form former decimation rule that once changes will fail;In order to avoid webpage content extraction caused by changing due to the page
Program fails, and reduces development cost and maintenance cost, improves the intelligence degree and adaptive ability of webpage content extraction method,
Propose the present invention.
The application principle of the present invention is described in detail below in conjunction with the accompanying drawings.
The method that the present invention implements the Webpage largest block extraction provided, including:
Webpage source code is obtained first, and canonical is all used for the word segment of the non-page presentation such as the label in webpage source code
It is substituted for blank line.Then the number for calculating each style of writing word forms similar Fig. 2 function distribution.The function so formed point
Cloth is a line.Then the block blocks (this block block length is 3 row sums) using step-length as 3 as combinations are formed newly
Text is distributed the function of more crypto set.Then found out and risen sharply according to density function, the point of rapid drawdown is as web page body Text Feature Extraction
Text is out generated, is preserved.
As shown in figure 1, the method for Webpage largest block extraction provided in an embodiment of the present invention, is specifically included:
S101:Webpage html codes are obtained, is then pre-processed and (content is encoded, script is removed, removes special word
Symbol).
S102:Remove the text that format tags obtain full page.
S103:Calculate the distribution function of row block number of characters.
S104:The point to rise sharply with rapid drawdown is found in changing according to distribution function, obtains valuable text.
S105:Text is handled, the blank line between text is removed and obtains effective body matter and preserve.
The present invention is described further with reference to specific embodiment.
The present invention is implemented in the method for the Webpage largest block extraction provided, including:
1) html source codes obtain crawls with excessively universal reptile to Target Station, and the content of return changes into character string type
Data.
2) progress research content will be encoded according to website later by getting html source codes, if website does not have the interior of meta labels
Hold coding, then acquiescence is encoded with utf8 to content.Removed after the completion of coding with canonical<script></script>Comprising
Script and special annotation text<!--...-->.Then label is removed with canonical to text, removes in html and own
Label.So rough text distribution has just come out label script of processing etc. above and has all become blank line.
3) the text size distribution function based on line number is established to rough text.Function is expressed as in reference axis, x-axis
Represent text line number:LN, y-axis represent the text size of the line number:LL.Then in order to more highlight the jumping characteristic of text distribution,
Row text size is merged into above most long using jumping degree as 3 (such as the 1st row length is 1, the third line of the 2nd behavior 4 be 5 that
First block length is 8 after merging, by that analogy 4,5,6 rows etc..).Row is used as a block block, thus deposits
In LN/3 block, can thus text size be set more to concentrate, so as to more preferably distinguish the rapid drawdown point that rises sharply.
One page can obtain distribution schematic diagram as shown in Figure 2 by maximum block analysis;
According to above-mentioned distribution function, rising sharply a little and rapid drawdown point for text is found.Above-mentioned curve concrete principle is as follows
If matched curve is:Y=a0+a1x+...+akxk, each point to this curve distance, i.e. sum of square of deviations is:
Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result:
Result is after simplification:
Then equation matrix is converted into obtain:
Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, be then just fitted
Parameter of curve;Obtain matched curve;Matched curve obtains its graded function after obtaining to the curve derivation:
Y '=a1+a2x+...+akxk-1;
Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor,
And take absolute value, find the maximum line number of difference.
Then, this line number is exactly that text rises sharply the point of rapid drawdown, such as Fig. 2 150-225 rows.So page main body just obtains
, finally extract text and just complete.
The present invention is largely focused on the characteristic in continuous blocks using the effective information of netpage page face displaying.Utilize maximum
Block method can quickly and efficiently determine page body matter position, and be extracted.In existing rule-based web page body
It is big to hold extracting method workload.If need several rules for a website, then in face of this work of thousands of website
Measure very big.And to safeguard that so much rule will also put into substantial amounts of cost.In addition, Website page knot is also faced
The risk that structure changes, once Website page changes, before for the regular by whole failures of the website design.The present invention can be big
The cost of labor and maintenance cost of amplitude reduction web page contents extraction, while there is preferable adaptive ability, in web site contents
During structural change, abstracting method is still effective.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.
Claims (5)
- A kind of 1. method of Webpage largest block extraction, it is characterised in that the method for the Webpage largest block extraction, bag Include:Webpage source code is obtained first, is all substituted for for the word segment of the non-page presentation of label in webpage source code with canonical Blank line;Then the function that the number of each style of writing word is formed is calculated;The function of formation is distributed as a line;Then one using step length as 3 row sums combines block blocks, forms the function of new text distribution more crypto set;Then found out and risen sharply according to intensive function, the point of rapid drawdown out generates text as web page body Text Feature Extraction, preserves.
- 2. the method for Webpage largest block extraction as claimed in claim 1, it is characterised in that the Webpage largest block The method of extraction, is specifically included:Webpage html codes are obtained, the pretreatment then encoded to content, remove script, removed spcial character;Get Encoded after html source codes according to website and carry out research content, canonical is used after the completion of coding '<script[^>]*>.*</ script>' remove<script></script>Comprising script and special annotation text<!--...-->, for escape Te Shuzifu &nbsp;<;>;&;";&apos escapes into corresponding space,<、>、&、”、’;Remove the text that format tags obtain full page;To text canonical "<[^>]*>" label is removed, remove html In all label;The script of the label of processing is set to become blank line using rough text distribution;Calculate the distribution function of row block number of characters;The point to rise sharply with rapid drawdown is found according to distribution function change, obtains valuable text;Text is handled, according to obtained line number, the content that every row represents is stitched together to form body matter according to line number, if To body matter in also have null;Then with canonical ' ^ n ' be substituted for sky, remove null;Finally obtain article main body.
- 3. the method for Webpage largest block extraction as claimed in claim 2, it is characterised in that the calculating row block number of characters Distribution function include:Text size distribution function based on line number is established to rough text;Function represents in reference axis For x-axis represents text line number:LN;Y-axis represents the text size of the line number:LL;Then row text size is merged into as 3 using jumping degree by row above most long and is used as a block block, it is individual to there is LN/3 Block, concentrate text size and make the rapid drawdown point differentiation that rises sharply.
- 4. the method for Webpage largest block extraction as claimed in claim 2, it is characterised in that described to be become according to distribution function Change the point for finding and rising sharply with rapid drawdown, obtaining the method for valuable text includes:If matched curve is:Y=a0+a1x+...+akxk, each point to this curve distance, i.e. sum of square of deviations is:<mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>&lsqb;</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>+</mo> <mo>...</mo> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mi>k</mi> </msup> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mn>2</mn> </msup> <mo>;</mo> </mrow>Meet actual all a values to ask, derivation is carried out to the ai of formula above, obtains following result:<mrow> <mo>-</mo> <mn>2</mn> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>&lsqb;</mo> <mi>y</mi> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <mi>x</mi> <mo>+</mo> <mo>...</mo> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <msup> <mi>x</mi> <mi>k</mi> </msup> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <msup> <mi>x</mi> <mi>k</mi> </msup> <mo>=</mo> <mn>0</mn> <mo>;</mo> </mrow>Result is after simplification:<mrow> <msub> <mi>a</mi> <mn>0</mn> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mi>k</mi> </msup> <mo>+</mo> <msub> <mi>a</mi> <mn>1</mn> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mi>k</mi> <mo>+</mo> <mn>1</mn> </mrow> </msup> <mo>+</mo> <mn>...</mn> <mo>+</mo> <msub> <mi>a</mi> <mi>k</mi> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mn>2</mn> <mi>k</mi> </mrow> </msup> <mo>=</mo> <mn>0</mn> <mo>;</mo> </mrow>Then equation matrix is converted into obtain:Wherein, X*A=Y, then A=(X*X) -1*X*Y, then obtain the coefficient matrix on A, then just obtain matched curve Parameter;Obtain matched curve;Matched curve obtains its graded function after obtaining to the curve derivation:Y '=a1+a2x+...+akxk-1;Bring to obtain all line numbers of the page into one group of curve in the gradient of each point and then to calculate each neighboring gradient poor, and Take absolute value, find the maximum line number of difference;Maximum line number rises sharply the point of rapid drawdown for text, obtains page main body, finally extracts text.
- A kind of 5. Webpage largest block extraction system of the method for Webpage largest block extraction as claimed in claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710694534.2A CN107463696A (en) | 2017-08-15 | 2017-08-15 | A kind of method of Webpage largest block extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710694534.2A CN107463696A (en) | 2017-08-15 | 2017-08-15 | A kind of method of Webpage largest block extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107463696A true CN107463696A (en) | 2017-12-12 |
Family
ID=60549634
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710694534.2A Pending CN107463696A (en) | 2017-08-15 | 2017-08-15 | A kind of method of Webpage largest block extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107463696A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271145A (en) * | 2018-09-03 | 2019-01-25 | 科大国创软件股份有限公司 | Fast regular method for customizing based on pythonQT and intelligent algorithm |
CN110881056A (en) * | 2018-09-05 | 2020-03-13 | 百度在线网络技术(北京)有限公司 | Method and device for pushing information |
CN113537091A (en) * | 2021-07-20 | 2021-10-22 | 东莞市盟大塑化科技有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102456050A (en) * | 2010-10-27 | 2012-05-16 | 中国移动通信集团四川有限公司 | Method and device for extracting data from webpage |
CN102955818A (en) * | 2011-08-31 | 2013-03-06 | 镇江诺尼基智能技术有限公司 | Method for acquiring full names in Chinese from Web page |
US20140095463A1 (en) * | 2012-06-06 | 2014-04-03 | Derek Edwin Pappas | Product Search Engine |
-
2017
- 2017-08-15 CN CN201710694534.2A patent/CN107463696A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102456050A (en) * | 2010-10-27 | 2012-05-16 | 中国移动通信集团四川有限公司 | Method and device for extracting data from webpage |
CN102955818A (en) * | 2011-08-31 | 2013-03-06 | 镇江诺尼基智能技术有限公司 | Method for acquiring full names in Chinese from Web page |
US20140095463A1 (en) * | 2012-06-06 | 2014-04-03 | Derek Edwin Pappas | Product Search Engine |
Non-Patent Citations (3)
Title |
---|
JAIRUSCHAN: "最小二乘法多项式曲线拟合原理与实现", 《HTTPS://BLOG.CSDN.NET/JAIRUSCHAN/ARTICLE/DETAILS/7517773》 * |
谭守标 等: "Web信息抽取及知识表示系统的研究与实现", 《计算机系统应用》 * |
陈鑫: "基于行块分布函数的通用网页正文抽取", 《百度文库》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271145A (en) * | 2018-09-03 | 2019-01-25 | 科大国创软件股份有限公司 | Fast regular method for customizing based on pythonQT and intelligent algorithm |
CN109271145B (en) * | 2018-09-03 | 2021-12-14 | 科大国创软件股份有限公司 | Quick rule customizing method based on pythonQT and intelligent algorithm |
CN110881056A (en) * | 2018-09-05 | 2020-03-13 | 百度在线网络技术(北京)有限公司 | Method and device for pushing information |
CN113537091A (en) * | 2021-07-20 | 2021-10-22 | 东莞市盟大塑化科技有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
CN113537091B (en) * | 2021-07-20 | 2024-05-03 | 东莞盟大集团有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Tire defect detection using fully convolutional network | |
CN110968667B (en) | Periodical and literature table extraction method based on text state characteristics | |
CN105183801B (en) | web page text extracting method and device | |
CN106156239B (en) | Table extraction method and device | |
Sun et al. | Dom based content extraction via text density | |
CN102184189B (en) | Webpage core block determining method based on DOM (Document Object Model) node text density | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
CN106709032A (en) | Method and device for extracting structured information from spreadsheet document | |
CN105653668A (en) | Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment | |
CN107463696A (en) | A kind of method of Webpage largest block extraction | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN105320734B (en) | A kind of web page core content extracting method | |
CN102915361B (en) | Webpage text extracting method based on character distribution characteristic | |
CN105389329A (en) | Open source software recommendation method based on group comments | |
CN103491116A (en) | Method and device for processing text-related structural data | |
CN108733405A (en) | The method and apparatus that training webpage distribution indicates model | |
CN105677638A (en) | Web information extraction method | |
CN106528068A (en) | Webpage content reconstruction method and system | |
CN114528811B (en) | Article content extraction method, device, equipment and storage medium | |
CN112559929B (en) | Method, electronic device and medium for extracting webpage target information | |
CN106528509A (en) | Webpage information extracting method and apparatus | |
CN104462061A (en) | Word extraction method and word extraction device | |
CN103942224A (en) | Method and device for acquiring annotation rule of webpage blocks | |
CN108694192B (en) | Webpage type judging method and device | |
CN103729354B (en) | web information processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100040 Shijingshan Road, Shijingshan District, Beijing, No. 20, 16 layer 1601 Applicant after: Chinese translation language through Polytron Technologies Inc Address before: 100040 Shijingshan District railway building, Beijing, the 16 floor Applicant before: Mandarin Technology (Beijing) Co., Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171212 |