CN109885754A - A kind of acquisition method of internet unstructured text data - Google Patents
A kind of acquisition method of internet unstructured text data Download PDFInfo
- Publication number
- CN109885754A CN109885754A CN201910123191.3A CN201910123191A CN109885754A CN 109885754 A CN109885754 A CN 109885754A CN 201910123191 A CN201910123191 A CN 201910123191A CN 109885754 A CN109885754 A CN 109885754A
- Authority
- CN
- China
- Prior art keywords
- page
- data
- details page
- text
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000007514 turning Methods 0.000 claims abstract description 22
- 238000010276 construction Methods 0.000 claims abstract description 21
- 230000014509 gene expression Effects 0.000 claims abstract description 21
- 241000238413 Octopus Species 0.000 claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 238000013075 data extraction Methods 0.000 claims abstract description 6
- 239000000284 extract Substances 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 4
- 238000004064 recycling Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 235000009508 confectionery Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Abstract
The present invention discloses a kind of acquisition method of internet unstructured text data, it is related to data service technical field, target webpage is obtained using octopus collector, determine acquisition field, obtain page turning, construction circulation, carry out each details page link parsing, obtain details page, data extraction is carried out according to details page text type, data wherein are extracted using regular expression to text type details page and are formatted, data are extracted using XPath to sheet format details page and are formatted, data are extracted to details page combination regular expression and XPath comprising text and table and are formatted, obtain formatted extraction data.
Description
Technical field
The present invention discloses a kind of acquisition method of internet unstructured text data, is related to data service technical field.
Background technique
Unstructured data refers to that its field length is variable, and the record of each field again can be by repeating or can not
The data that duplicate subfield is constituted not only can handle structural data (such as digital, symbol information) but also more suitable with it
Close the information such as processing full text text, image, sound, video display, hypermedia.
The data of internet now are numerous, grab data Shi Changyong octopus collector, octopus collector is visual
Change sampling instrument, determines acquisition project by clicking target position, the webpage of structuring can be acquired quickly and easily
To required data.But for most of webpages, the data of unstructured data are only leading, especially unstructured text
Notebook data, content and typesetting be it is non-structured, cannot pass through XPath positioned in sequence be accurately positioned target.And the present invention mentions
Various Complex can be accessed using acquisition strategies of the invention for a kind of acquisition method of internet unstructured text data
Unstructured internet web page data source, enhance the flexibility of system configuration and the accuracy of data grabber, ensure that data
The validity and efficiency of reading effectively solve the processing of extensive unstructured data sources.
Summary of the invention
The present invention is directed to problem of the prior art, provides a kind of acquisition method of internet unstructured text data, has
There is the features such as versatile, to be easy to implement, has broad application prospects.
Concrete scheme proposed by the present invention is:
A kind of acquisition method of internet unstructured text data obtains target webpage using octopus collector, really
Surely field is acquired, page turning is obtained, construction circulation carries out each details page link parsing, details page obtained, according to details page text
Type carries out data extraction, wherein data are extracted using regular expression to text type details page and are formatted, it is detailed to sheet format
Feelings page extracts data using XPath and formats, and mentions to details page combination regular expression and XPath comprising text and table
Access evidence simultaneously formats, and obtains formatted extraction data.
In the method by obtaining page turning, construction circulation traverses the list pages of all page turnings, according to requiring to parse
Each details page, rejects unnecessary acquisition link.
Page turning is obtained in the method, construction circulation obtains the link of webpage transmitting at the webpage debugging interface Network
And parameter attribute, circular linkage is constructed by varying cyclically parameter value, using //a [one page under text ()=' '] label acquisition
The details page of lower one page links.
After being formatted processing using regular expression extraction data in the method, reuse according to canonical
Expression formula carries out data extraction.
When extracting data using XPath in the method, if the table dvielement label of sheet format details page is not bright
True id or class title, using text () attribute and contains function locating position, recycle following or
The text of following-sibling function crawl next element at the same level.
A kind of sampling instrument of internet unstructured text data, including octopus acquisition unit and analytical unit,
Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out every
A details page link parsing, obtains details page, analytical unit divides details page according to details page text type, octopus
Acquisition unit extracts data using regular expression to text type details page and formats, and utilizes XPath to sheet format details page
It extracts data and formats, data and format are extracted to details page combination regular expression and XPath comprising text and table
Change, obtains formatted extraction data.
Octopus acquisition unit recycles to traverse the list of all page turnings by obtaining page turning, construction in the tool
Page rejects unnecessary acquisition link according to requiring to parse each details page.
Octopus acquisition unit obtains page turning in the tool, and construction circulation obtains at the webpage debugging interface Network
The link and parameter attribute of webpage transmitting construct circular linkage by varying cyclically parameter value, using //a [under text ()='
One page '] label obtains the details page link of lower one page.
Usefulness of the present invention is:
The present invention provides a kind of acquisition method of internet unstructured text data, to acquire the unstructured net in internet
Page provides good solution, acquires especially for the data of big paragraph text and the netted table class page, passes through canonical
Expression formula positions big paragraph text data position, in combination with XPath position-table element, substantially increases the matching of collection rule
Degree, ensure that the accuracy, consistency and integrality of collection result.
Detailed description of the invention
Fig. 1 is the method for the present invention flow diagram.
2 page turning schematic diagram of Fig. 2 embodiment 1 and embodiment;
1 details page schematic diagram of Fig. 3 embodiment;
Page detailed schematic diagram in Fig. 4 embodiment 1;
Specific embodiment
The present invention provides a kind of acquisition method of internet unstructured text data, obtains mesh using octopus collector
Webpage to be marked, determines acquisition field, obtains page turning, construction circulation carries out each details page link parsing, obtains details page, according to
Details page text type carries out data extraction, wherein data are extracted using regular expression to text type details page and are formatted,
Data are extracted using XPath to sheet format details page and are formatted, to the details page combination regular expressions comprising text and table
Formula and XPath extract data and format, and obtain formatted extraction data.
A kind of sampling instrument of the internet unstructured text data corresponded to the above method, including eight are provided simultaneously
Pawl fish acquisition unit and analytical unit,
Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out every
A details page link parsing, obtains details page, analytical unit divides details page according to details page text type, octopus
Acquisition unit extracts data using regular expression to text type details page and formats, and utilizes XPath to sheet format details page
It extracts data and formats, data and format are extracted to details page combination regular expression and XPath comprising text and table
Change, obtains formatted extraction data.
The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with
It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.
Embodiment 1 acquires Zaozhuang real estate presell information using the method for the present invention or tool.
Step 1. obtains Zaozhuang house and real estate information network address, determines acquisition field,
Website URL:http: //www.zzzzfdc.com.cn/site/news/11/news_11_1_0.html
Acquire field: Announcement Number, date of declaration, exploitation enterprise, the commercial house permit for presale number, project name, project
Position, presell area, acquisition URL, acquisition time;
Step 2: obtaining page turning, such as Fig. 2;
Construction circulation, circulation click lower one page, element position:
//a [one page under text ()=' ']
Step 3: acquisition details page, such as Fig. 3,
Due to only acquiring the commercial house presell license bulletin, need to reject Management in Real Estate Exploration license bulletin, therefore utilize
Advanced XPath positioning target details link, construction circulation using the contains usage of XPath, are matched and are wrapped in an attribute value
The character string contained, element position:
//a [contains (@title, ' presell license bulletin ')]
Successively open the link in circulation;
Step 4: extracting data, such as Fig. 4 is extracted and formatted using regular expression since webpage is a Duan Wenben;
Table 1 is obtained after extraction,
Table 1
Announcement Number | The commercial house presell license bulletin .* |
Announce the time | D+ the d+ month d+ days |
Develop enterprise | To (.*) development & construction |
The commercial house permit for presale number | " the commercial house permit for presale " number are as follows: (.*).The project |
Project name | (.*) the project .* of development & construction makes |
Item location | The project is located at (.*), approval |
Presell area | Presell area (.*), the commercial house |
Table 2 is obtained after formatting,
Table 2
Embodiment 2 acquires credit Liaoning-administrative penalty data using the present invention or tool.
Step 1. obtains credit Liaoning-administrative penalty network address, determines acquisition field,
Website URL:
Http: // 218.60.149.124:8088/sgs/xyln/three.htm? nowPage=1&orgId=0&
ParentId=0&qymc=&xk=0&cf=16&orgName=&gj=1
Acquire field: administrative punishment form code, enterprise name, the punishment origin of an incident, punishing justification, punishment classification 1, punishment
As a result, decision for punishment date, punishment organ;
Step 2: acquisition page turning, such as Fig. 2,
Construction circulation, circulation click lower one page, element position:
//a [one page under text ()=' ']
Step 3: obtaining details page
Construction circulation, is extracted, element position using position () range of function:
//table [@class='list_list1 f12 m5']/tbody/tr [position () > 1]
Step 4:
Data are extracted, since the form of the website details page is had nothing in common with each other, table field title is also inconsistent, single
Pure XPath sequential path //BODY [@class='tc']/DIV [2]/DIV [1]/DIV [2]/CENTER [1]/TABLE
[1] aiming field position can not be accurately positioned in/TBODY [1]/TR [7]/TD [2], and element does not have apparent class or id
Title can only go identification to position by text, in conjunction with following-sibling function, choose the peer after present node
Node, the advanced path the XPath such as table 3 of each field,
Table 3
Such as table 4 after formatting,
Table 4
Field name | The data extracted |
Administrative punishment form code | Certainly word [2018] 2-19-002 are penalized in sweet law enforcement |
Enterprise name | Rise Home Co., Ltd in Dalian ten thousand |
Punish the origin of an incident | On May 23rd, 2018, Home Co., Ltd's quilt ... was risen in Dalian ten thousand |
Punishing justification | " Daliang City's urban road bridges facilities management method " ... |
Punish classification 1 | Warning fine |
Punish result | Illegal activities and RMB ... of imposing a fine are corrected in time limit 3 days |
The decision for punishment date | 2018-07-11 |
Punish organ | Daliang City Ganjingzi District Bureau of City Administration |
Acquire URL | Http: // 218.60.149.124:8088/sgs//xyln/f... |
Acquisition time | 2018-08-07 |
Above-described embodiment, the present invention pass through regular expression and advanced XPath configuration strategy, precise positioning datum target position
It sets, improves the quality of data, guarantee the accuracy, consistency and integrality of data.
Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention
It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention
Protection scope within.Protection scope of the present invention is subject to claims.
Claims (8)
1. a kind of acquisition method of internet unstructured text data, it is characterized in that
Target webpage is obtained using octopus collector, determines acquisition field, obtains page turning, construction circulation carries out each details
Page link parsing, obtains details page, carries out data extraction according to details page text type, wherein to text type details page using just
Then expression formula is extracted and data and is formatted, and is extracted data using XPath to sheet format details page and is formatted, to comprising text and
The details page combination regular expression and XPath of table extract data and format, and obtain formatted extraction data.
2. according to the method described in claim 1, it is characterized in that by obtaining page turning, construction recycles to traverse the column of all page turnings
Table page rejects unnecessary acquisition link according to requiring to parse each details page.
3. method according to claim 1 or 2, it is characterized in that obtaining page turning, construction circulation debugs Network in webpage
Interface obtains the link and parameter attribute of webpage transmitting, constructs circular linkage by varying cyclically parameter value, utilizes //a [text
One page under ()=' '] label obtains the details page link of lower one page.
4. according to the method described in claim 3, it is characterized in that extracted after data are formatted processing using regular expression,
Reuse progress data extraction according to regular expressions.
5. method according to claim 1 or 4, it is characterized in that when extracting data using XPath, if sheet format details page
Table dvielement label does not have specific id or class title, using text () attribute and contains function locating position,
The text for the next element for recycling the crawl of following or following-sibling function at the same level.
6. a kind of sampling instrument of internet unstructured text data, it is characterized in that including that octopus acquisition unit and analysis are single
Member,
Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out each detailed
The link parsing of feelings page, obtains details page, and analytical unit divides details page according to details page text type, octopus acquisition
Unit extracts data using regular expression to text type details page and formats, and is extracted to sheet format details page using XPath
Data simultaneously format, and extract data to details page combination regular expression and XPath comprising text and table and format, obtain
Obtain formatted extraction data.
7. tool according to claim 6, it is characterized in that octopus acquisition unit is recycled next time by obtaining page turning, construction
The list page for going through all page turnings rejects unnecessary acquisition link according to requiring to parse each details page.
8. tool according to claim 6 or 7, it is characterized in that octopus acquisition unit obtains page turning, construction circulation, in net
The page debugging interface Network obtains the link and parameter attribute of webpage transmitting, constructs circular linkage by varying cyclically parameter value,
It is linked using the details page that //a [one page under text ()=' '] label obtains lower one page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910123191.3A CN109885754A (en) | 2019-02-18 | 2019-02-18 | A kind of acquisition method of internet unstructured text data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910123191.3A CN109885754A (en) | 2019-02-18 | 2019-02-18 | A kind of acquisition method of internet unstructured text data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109885754A true CN109885754A (en) | 2019-06-14 |
Family
ID=66928589
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910123191.3A Pending CN109885754A (en) | 2019-02-18 | 2019-02-18 | A kind of acquisition method of internet unstructured text data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109885754A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115374334A (en) * | 2022-10-26 | 2022-11-22 | 墨责(北京)科技传播有限公司 | Text page acquisition method of webpage acquisition page based on machine learning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101361063A (en) * | 2006-04-12 | 2009-02-04 | 龙搜(北京)科技有限公司 | System and method supporting document content mining based on rules |
US20140359413A1 (en) * | 2013-05-28 | 2014-12-04 | Tencent Technology (Shenzhen) Company Limited | Apparatuses and methods for webpage content processing |
CN105022806A (en) * | 2015-07-03 | 2015-11-04 | 厦门二五八集团有限公司 | Method and system for establishing mobile page based on internet webpage of translation template |
CN105335516A (en) * | 2015-11-04 | 2016-02-17 | 浪潮软件集团有限公司 | Construction method of universal acquisition system |
CN106095984A (en) * | 2016-06-20 | 2016-11-09 | 乐视控股(北京)有限公司 | A kind of method and device obtaining structural data |
CN106294885A (en) * | 2016-10-09 | 2017-01-04 | 华东师范大学 | A kind of data collection towards isomery webpage and mask method |
CN106874495A (en) * | 2017-02-23 | 2017-06-20 | 山东浪潮云服务信息科技有限公司 | Based on the method that structure of web page is extracted in machine learning modeling |
-
2019
- 2019-02-18 CN CN201910123191.3A patent/CN109885754A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101361063A (en) * | 2006-04-12 | 2009-02-04 | 龙搜(北京)科技有限公司 | System and method supporting document content mining based on rules |
US20140359413A1 (en) * | 2013-05-28 | 2014-12-04 | Tencent Technology (Shenzhen) Company Limited | Apparatuses and methods for webpage content processing |
CN105022806A (en) * | 2015-07-03 | 2015-11-04 | 厦门二五八集团有限公司 | Method and system for establishing mobile page based on internet webpage of translation template |
CN105335516A (en) * | 2015-11-04 | 2016-02-17 | 浪潮软件集团有限公司 | Construction method of universal acquisition system |
CN106095984A (en) * | 2016-06-20 | 2016-11-09 | 乐视控股(北京)有限公司 | A kind of method and device obtaining structural data |
CN106294885A (en) * | 2016-10-09 | 2017-01-04 | 华东师范大学 | A kind of data collection towards isomery webpage and mask method |
CN106874495A (en) * | 2017-02-23 | 2017-06-20 | 山东浪潮云服务信息科技有限公司 | Based on the method that structure of web page is extracted in machine learning modeling |
Non-Patent Citations (1)
Title |
---|
杨帆等: "基于R语言的网页抓取与数据收集", 《现代经济信息》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115374334A (en) * | 2022-10-26 | 2022-11-22 | 墨责(北京)科技传播有限公司 | Text page acquisition method of webpage acquisition page based on machine learning |
CN115374334B (en) * | 2022-10-26 | 2023-01-06 | 墨责(北京)科技传播有限公司 | Text page acquisition method of webpage acquisition page based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101253498B (en) | Learning facts from semi-structured text | |
CN103294781B (en) | A kind of method and apparatus for processing page data | |
TWI695277B (en) | Automatic website data collection method | |
CN101470728B (en) | Method and device for automatically abstracting text of Chinese news web page | |
US7516397B2 (en) | Methods, apparatus and computer programs for characterizing web resources | |
CN107808000A (en) | A kind of hidden web data collection and extraction system and method | |
CN114595344B (en) | Crop variety management-oriented knowledge graph construction method and device | |
CN103246644B (en) | Method and device for processing Internet public opinion information | |
CN100444591C (en) | Method for acquiring front-page keyword and its application system | |
CN109271477A (en) | A kind of method and system by internet building taxonomy library | |
CN103544178A (en) | Method and equipment for providing reconstruction page corresponding to target page | |
CN101299217A (en) | Method, apparatus and system for processing map information | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
KR101801257B1 (en) | Text-Mining Application Technique for Productive Construction Document Management | |
Zhao et al. | Mining templates from search result records of search engines | |
CN102135976A (en) | Hypertext markup language page structured data extraction method and device | |
CN111737623A (en) | Webpage information extraction method and related equipment | |
CN101794277A (en) | Method for embedding geographical labels in network character information and system | |
CN102567392A (en) | Control method for interest subject excavation based on time window | |
CN105117434A (en) | Webpage classification method and webpage classification system | |
Romero-Frías | Googling companies-a webometric approach to business studies | |
CN109885754A (en) | A kind of acquisition method of internet unstructured text data | |
Jou | Schema extraction for deep web query interfaces using heuristics rules | |
Cirkovic | Grey literature–the chameleon of information resources | |
Almeida et al. | folk2onto: Bridging the gap between social tags and ontologies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |