CN109885754A - A kind of acquisition method of internet unstructured text data - Google Patents

A kind of acquisition method of internet unstructured text data Download PDF

Info

Publication number
CN109885754A
CN109885754A CN201910123191.3A CN201910123191A CN109885754A CN 109885754 A CN109885754 A CN 109885754A CN 201910123191 A CN201910123191 A CN 201910123191A CN 109885754 A CN109885754 A CN 109885754A
Authority
CN
China
Prior art keywords
page
data
details page
text
obtains
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910123191.3A
Other languages
Chinese (zh)
Inventor
张磊
单震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN201910123191.3A priority Critical patent/CN109885754A/en
Publication of CN109885754A publication Critical patent/CN109885754A/en
Pending legal-status Critical Current

Links

Abstract

The present invention discloses a kind of acquisition method of internet unstructured text data, it is related to data service technical field, target webpage is obtained using octopus collector, determine acquisition field, obtain page turning, construction circulation, carry out each details page link parsing, obtain details page, data extraction is carried out according to details page text type, data wherein are extracted using regular expression to text type details page and are formatted, data are extracted using XPath to sheet format details page and are formatted, data are extracted to details page combination regular expression and XPath comprising text and table and are formatted, obtain formatted extraction data.

Description

A kind of acquisition method of internet unstructured text data
Technical field
The present invention discloses a kind of acquisition method of internet unstructured text data, is related to data service technical field.
Background technique
Unstructured data refers to that its field length is variable, and the record of each field again can be by repeating or can not The data that duplicate subfield is constituted not only can handle structural data (such as digital, symbol information) but also more suitable with it Close the information such as processing full text text, image, sound, video display, hypermedia.
The data of internet now are numerous, grab data Shi Changyong octopus collector, octopus collector is visual Change sampling instrument, determines acquisition project by clicking target position, the webpage of structuring can be acquired quickly and easily To required data.But for most of webpages, the data of unstructured data are only leading, especially unstructured text Notebook data, content and typesetting be it is non-structured, cannot pass through XPath positioned in sequence be accurately positioned target.And the present invention mentions Various Complex can be accessed using acquisition strategies of the invention for a kind of acquisition method of internet unstructured text data Unstructured internet web page data source, enhance the flexibility of system configuration and the accuracy of data grabber, ensure that data The validity and efficiency of reading effectively solve the processing of extensive unstructured data sources.
Summary of the invention
The present invention is directed to problem of the prior art, provides a kind of acquisition method of internet unstructured text data, has There is the features such as versatile, to be easy to implement, has broad application prospects.
Concrete scheme proposed by the present invention is:
A kind of acquisition method of internet unstructured text data obtains target webpage using octopus collector, really Surely field is acquired, page turning is obtained, construction circulation carries out each details page link parsing, details page obtained, according to details page text Type carries out data extraction, wherein data are extracted using regular expression to text type details page and are formatted, it is detailed to sheet format Feelings page extracts data using XPath and formats, and mentions to details page combination regular expression and XPath comprising text and table Access evidence simultaneously formats, and obtains formatted extraction data.
In the method by obtaining page turning, construction circulation traverses the list pages of all page turnings, according to requiring to parse Each details page, rejects unnecessary acquisition link.
Page turning is obtained in the method, construction circulation obtains the link of webpage transmitting at the webpage debugging interface Network And parameter attribute, circular linkage is constructed by varying cyclically parameter value, using //a [one page under text ()=' '] label acquisition The details page of lower one page links.
After being formatted processing using regular expression extraction data in the method, reuse according to canonical Expression formula carries out data extraction.
When extracting data using XPath in the method, if the table dvielement label of sheet format details page is not bright True id or class title, using text () attribute and contains function locating position, recycle following or The text of following-sibling function crawl next element at the same level.
A kind of sampling instrument of internet unstructured text data, including octopus acquisition unit and analytical unit,
Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out every A details page link parsing, obtains details page, analytical unit divides details page according to details page text type, octopus Acquisition unit extracts data using regular expression to text type details page and formats, and utilizes XPath to sheet format details page It extracts data and formats, data and format are extracted to details page combination regular expression and XPath comprising text and table Change, obtains formatted extraction data.
Octopus acquisition unit recycles to traverse the list of all page turnings by obtaining page turning, construction in the tool Page rejects unnecessary acquisition link according to requiring to parse each details page.
Octopus acquisition unit obtains page turning in the tool, and construction circulation obtains at the webpage debugging interface Network The link and parameter attribute of webpage transmitting construct circular linkage by varying cyclically parameter value, using //a [under text ()=' One page '] label obtains the details page link of lower one page.
Usefulness of the present invention is:
The present invention provides a kind of acquisition method of internet unstructured text data, to acquire the unstructured net in internet Page provides good solution, acquires especially for the data of big paragraph text and the netted table class page, passes through canonical Expression formula positions big paragraph text data position, in combination with XPath position-table element, substantially increases the matching of collection rule Degree, ensure that the accuracy, consistency and integrality of collection result.
Detailed description of the invention
Fig. 1 is the method for the present invention flow diagram.
2 page turning schematic diagram of Fig. 2 embodiment 1 and embodiment;
1 details page schematic diagram of Fig. 3 embodiment;
Page detailed schematic diagram in Fig. 4 embodiment 1;
Specific embodiment
The present invention provides a kind of acquisition method of internet unstructured text data, obtains mesh using octopus collector Webpage to be marked, determines acquisition field, obtains page turning, construction circulation carries out each details page link parsing, obtains details page, according to Details page text type carries out data extraction, wherein data are extracted using regular expression to text type details page and are formatted, Data are extracted using XPath to sheet format details page and are formatted, to the details page combination regular expressions comprising text and table Formula and XPath extract data and format, and obtain formatted extraction data.
A kind of sampling instrument of the internet unstructured text data corresponded to the above method, including eight are provided simultaneously Pawl fish acquisition unit and analytical unit,
Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out every A details page link parsing, obtains details page, analytical unit divides details page according to details page text type, octopus Acquisition unit extracts data using regular expression to text type details page and formats, and utilizes XPath to sheet format details page It extracts data and formats, data and format are extracted to details page combination regular expression and XPath comprising text and table Change, obtains formatted extraction data.
The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.
Embodiment 1 acquires Zaozhuang real estate presell information using the method for the present invention or tool.
Step 1. obtains Zaozhuang house and real estate information network address, determines acquisition field,
Website URL:http: //www.zzzzfdc.com.cn/site/news/11/news_11_1_0.html
Acquire field: Announcement Number, date of declaration, exploitation enterprise, the commercial house permit for presale number, project name, project Position, presell area, acquisition URL, acquisition time;
Step 2: obtaining page turning, such as Fig. 2;
Construction circulation, circulation click lower one page, element position:
//a [one page under text ()=' ']
Step 3: acquisition details page, such as Fig. 3,
Due to only acquiring the commercial house presell license bulletin, need to reject Management in Real Estate Exploration license bulletin, therefore utilize Advanced XPath positioning target details link, construction circulation using the contains usage of XPath, are matched and are wrapped in an attribute value The character string contained, element position:
//a [contains (@title, ' presell license bulletin ')]
Successively open the link in circulation;
Step 4: extracting data, such as Fig. 4 is extracted and formatted using regular expression since webpage is a Duan Wenben;
Table 1 is obtained after extraction,
Table 1
Announcement Number The commercial house presell license bulletin .*
Announce the time D+ the d+ month d+ days
Develop enterprise To (.*) development & construction
The commercial house permit for presale number " the commercial house permit for presale " number are as follows: (.*).The project
Project name (.*) the project .* of development & construction makes
Item location The project is located at (.*), approval
Presell area Presell area (.*), the commercial house
Table 2 is obtained after formatting,
Table 2
Embodiment 2 acquires credit Liaoning-administrative penalty data using the present invention or tool.
Step 1. obtains credit Liaoning-administrative penalty network address, determines acquisition field,
Website URL:
Http: // 218.60.149.124:8088/sgs/xyln/three.htm? nowPage=1&orgId=0& ParentId=0&qymc=&xk=0&cf=16&orgName=&gj=1
Acquire field: administrative punishment form code, enterprise name, the punishment origin of an incident, punishing justification, punishment classification 1, punishment As a result, decision for punishment date, punishment organ;
Step 2: acquisition page turning, such as Fig. 2,
Construction circulation, circulation click lower one page, element position:
//a [one page under text ()=' ']
Step 3: obtaining details page
Construction circulation, is extracted, element position using position () range of function:
//table [@class='list_list1 f12 m5']/tbody/tr [position () > 1]
Step 4:
Data are extracted, since the form of the website details page is had nothing in common with each other, table field title is also inconsistent, single Pure XPath sequential path //BODY [@class='tc']/DIV [2]/DIV [1]/DIV [2]/CENTER [1]/TABLE [1] aiming field position can not be accurately positioned in/TBODY [1]/TR [7]/TD [2], and element does not have apparent class or id Title can only go identification to position by text, in conjunction with following-sibling function, choose the peer after present node Node, the advanced path the XPath such as table 3 of each field,
Table 3
Such as table 4 after formatting,
Table 4
Field name The data extracted
Administrative punishment form code Certainly word [2018] 2-19-002 are penalized in sweet law enforcement
Enterprise name Rise Home Co., Ltd in Dalian ten thousand
Punish the origin of an incident On May 23rd, 2018, Home Co., Ltd's quilt ... was risen in Dalian ten thousand
Punishing justification " Daliang City's urban road bridges facilities management method " ...
Punish classification 1 Warning fine
Punish result Illegal activities and RMB ... of imposing a fine are corrected in time limit 3 days
The decision for punishment date 2018-07-11
Punish organ Daliang City Ganjingzi District Bureau of City Administration
Acquire URL Http: // 218.60.149.124:8088/sgs//xyln/f...
Acquisition time 2018-08-07
Above-described embodiment, the present invention pass through regular expression and advanced XPath configuration strategy, precise positioning datum target position It sets, improves the quality of data, guarantee the accuracy, consistency and integrality of data.
Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims (8)

1. a kind of acquisition method of internet unstructured text data, it is characterized in that
Target webpage is obtained using octopus collector, determines acquisition field, obtains page turning, construction circulation carries out each details Page link parsing, obtains details page, carries out data extraction according to details page text type, wherein to text type details page using just Then expression formula is extracted and data and is formatted, and is extracted data using XPath to sheet format details page and is formatted, to comprising text and The details page combination regular expression and XPath of table extract data and format, and obtain formatted extraction data.
2. according to the method described in claim 1, it is characterized in that by obtaining page turning, construction recycles to traverse the column of all page turnings Table page rejects unnecessary acquisition link according to requiring to parse each details page.
3. method according to claim 1 or 2, it is characterized in that obtaining page turning, construction circulation debugs Network in webpage Interface obtains the link and parameter attribute of webpage transmitting, constructs circular linkage by varying cyclically parameter value, utilizes //a [text One page under ()=' '] label obtains the details page link of lower one page.
4. according to the method described in claim 3, it is characterized in that extracted after data are formatted processing using regular expression, Reuse progress data extraction according to regular expressions.
5. method according to claim 1 or 4, it is characterized in that when extracting data using XPath, if sheet format details page Table dvielement label does not have specific id or class title, using text () attribute and contains function locating position, The text for the next element for recycling the crawl of following or following-sibling function at the same level.
6. a kind of sampling instrument of internet unstructured text data, it is characterized in that including that octopus acquisition unit and analysis are single Member,
Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out each detailed The link parsing of feelings page, obtains details page, and analytical unit divides details page according to details page text type, octopus acquisition Unit extracts data using regular expression to text type details page and formats, and is extracted to sheet format details page using XPath Data simultaneously format, and extract data to details page combination regular expression and XPath comprising text and table and format, obtain Obtain formatted extraction data.
7. tool according to claim 6, it is characterized in that octopus acquisition unit is recycled next time by obtaining page turning, construction The list page for going through all page turnings rejects unnecessary acquisition link according to requiring to parse each details page.
8. tool according to claim 6 or 7, it is characterized in that octopus acquisition unit obtains page turning, construction circulation, in net The page debugging interface Network obtains the link and parameter attribute of webpage transmitting, constructs circular linkage by varying cyclically parameter value, It is linked using the details page that //a [one page under text ()=' '] label obtains lower one page.
CN201910123191.3A 2019-02-18 2019-02-18 A kind of acquisition method of internet unstructured text data Pending CN109885754A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910123191.3A CN109885754A (en) 2019-02-18 2019-02-18 A kind of acquisition method of internet unstructured text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910123191.3A CN109885754A (en) 2019-02-18 2019-02-18 A kind of acquisition method of internet unstructured text data

Publications (1)

Publication Number Publication Date
CN109885754A true CN109885754A (en) 2019-06-14

Family

ID=66928589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910123191.3A Pending CN109885754A (en) 2019-02-18 2019-02-18 A kind of acquisition method of internet unstructured text data

Country Status (1)

Country Link
CN (1) CN109885754A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115374334A (en) * 2022-10-26 2022-11-22 墨责(北京)科技传播有限公司 Text page acquisition method of webpage acquisition page based on machine learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361063A (en) * 2006-04-12 2009-02-04 龙搜(北京)科技有限公司 System and method supporting document content mining based on rules
US20140359413A1 (en) * 2013-05-28 2014-12-04 Tencent Technology (Shenzhen) Company Limited Apparatuses and methods for webpage content processing
CN105022806A (en) * 2015-07-03 2015-11-04 厦门二五八集团有限公司 Method and system for establishing mobile page based on internet webpage of translation template
CN105335516A (en) * 2015-11-04 2016-02-17 浪潮软件集团有限公司 Construction method of universal acquisition system
CN106095984A (en) * 2016-06-20 2016-11-09 乐视控股(北京)有限公司 A kind of method and device obtaining structural data
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN106874495A (en) * 2017-02-23 2017-06-20 山东浪潮云服务信息科技有限公司 Based on the method that structure of web page is extracted in machine learning modeling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361063A (en) * 2006-04-12 2009-02-04 龙搜(北京)科技有限公司 System and method supporting document content mining based on rules
US20140359413A1 (en) * 2013-05-28 2014-12-04 Tencent Technology (Shenzhen) Company Limited Apparatuses and methods for webpage content processing
CN105022806A (en) * 2015-07-03 2015-11-04 厦门二五八集团有限公司 Method and system for establishing mobile page based on internet webpage of translation template
CN105335516A (en) * 2015-11-04 2016-02-17 浪潮软件集团有限公司 Construction method of universal acquisition system
CN106095984A (en) * 2016-06-20 2016-11-09 乐视控股(北京)有限公司 A kind of method and device obtaining structural data
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN106874495A (en) * 2017-02-23 2017-06-20 山东浪潮云服务信息科技有限公司 Based on the method that structure of web page is extracted in machine learning modeling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨帆等: "基于R语言的网页抓取与数据收集", 《现代经济信息》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115374334A (en) * 2022-10-26 2022-11-22 墨责(北京)科技传播有限公司 Text page acquisition method of webpage acquisition page based on machine learning
CN115374334B (en) * 2022-10-26 2023-01-06 墨责(北京)科技传播有限公司 Text page acquisition method of webpage acquisition page based on machine learning

Similar Documents

Publication Publication Date Title
CN101253498B (en) Learning facts from semi-structured text
CN103294781B (en) A kind of method and apparatus for processing page data
TWI695277B (en) Automatic website data collection method
CN101470728B (en) Method and device for automatically abstracting text of Chinese news web page
US7516397B2 (en) Methods, apparatus and computer programs for characterizing web resources
CN107808000A (en) A kind of hidden web data collection and extraction system and method
CN114595344B (en) Crop variety management-oriented knowledge graph construction method and device
CN103246644B (en) Method and device for processing Internet public opinion information
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN109271477A (en) A kind of method and system by internet building taxonomy library
CN103544178A (en) Method and equipment for providing reconstruction page corresponding to target page
CN101299217A (en) Method, apparatus and system for processing map information
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
KR101801257B1 (en) Text-Mining Application Technique for Productive Construction Document Management
Zhao et al. Mining templates from search result records of search engines
CN102135976A (en) Hypertext markup language page structured data extraction method and device
CN111737623A (en) Webpage information extraction method and related equipment
CN101794277A (en) Method for embedding geographical labels in network character information and system
CN102567392A (en) Control method for interest subject excavation based on time window
CN105117434A (en) Webpage classification method and webpage classification system
Romero-Frías Googling companies-a webometric approach to business studies
CN109885754A (en) A kind of acquisition method of internet unstructured text data
Jou Schema extraction for deep web query interfaces using heuristics rules
Cirkovic Grey literature–the chameleon of information resources
Almeida et al. folk2onto: Bridging the gap between social tags and ontologies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination