CN109885754A

CN109885754A - A kind of acquisition method of internet unstructured text data

Info

Publication number: CN109885754A
Application number: CN201910123191.3A
Authority: CN
Inventors: 张磊; 单震
Original assignee: Inspur Software Group Co Ltd
Current assignee: Inspur Software Group Co Ltd
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2019-06-14

Abstract

The present invention discloses a kind of acquisition method of internet unstructured text data, it is related to data service technical field, target webpage is obtained using octopus collector, determine acquisition field, obtain page turning, construction circulation, carry out each details page link parsing, obtain details page, data extraction is carried out according to details page text type, data wherein are extracted using regular expression to text type details page and are formatted, data are extracted using XPath to sheet format details page and are formatted, data are extracted to details page combination regular expression and XPath comprising text and table and are formatted, obtain formatted extraction data.

Description

A kind of acquisition method of internet unstructured text data

Technical field

The present invention discloses a kind of acquisition method of internet unstructured text data, is related to data service technical field.

Background technique

Unstructured data refers to that its field length is variable, and the record of each field again can be by repeating or can not The data that duplicate subfield is constituted not only can handle structural data (such as digital, symbol information) but also more suitable with it Close the information such as processing full text text, image, sound, video display, hypermedia.

The data of internet now are numerous, grab data Shi Changyong octopus collector, octopus collector is visual Change sampling instrument, determines acquisition project by clicking target position, the webpage of structuring can be acquired quickly and easily To required data.But for most of webpages, the data of unstructured data are only leading, especially unstructured text Notebook data, content and typesetting be it is non-structured, cannot pass through XPath positioned in sequence be accurately positioned target.And the present invention mentions Various Complex can be accessed using acquisition strategies of the invention for a kind of acquisition method of internet unstructured text data Unstructured internet web page data source, enhance the flexibility of system configuration and the accuracy of data grabber, ensure that data The validity and efficiency of reading effectively solve the processing of extensive unstructured data sources.

Summary of the invention

The present invention is directed to problem of the prior art, provides a kind of acquisition method of internet unstructured text data, has There is the features such as versatile, to be easy to implement, has broad application prospects.

Concrete scheme proposed by the present invention is:

A kind of acquisition method of internet unstructured text data obtains target webpage using octopus collector, really Surely field is acquired, page turning is obtained, construction circulation carries out each details page link parsing, details page obtained, according to details page text Type carries out data extraction, wherein data are extracted using regular expression to text type details page and are formatted, it is detailed to sheet format Feelings page extracts data using XPath and formats, and mentions to details page combination regular expression and XPath comprising text and table Access evidence simultaneously formats, and obtains formatted extraction data.

In the method by obtaining page turning, construction circulation traverses the list pages of all page turnings, according to requiring to parse Each details page, rejects unnecessary acquisition link.

Page turning is obtained in the method, construction circulation obtains the link of webpage transmitting at the webpage debugging interface Network And parameter attribute, circular linkage is constructed by varying cyclically parameter value, using //a [one page under text ()=' '] label acquisition The details page of lower one page links.

After being formatted processing using regular expression extraction data in the method, reuse according to canonical Expression formula carries out data extraction.

When extracting data using XPath in the method, if the table dvielement label of sheet format details page is not bright True id or class title, using text () attribute and contains function locating position, recycle following or The text of following-sibling function crawl next element at the same level.

A kind of sampling instrument of internet unstructured text data, including octopus acquisition unit and analytical unit,

Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out every A details page link parsing, obtains details page, analytical unit divides details page according to details page text type, octopus Acquisition unit extracts data using regular expression to text type details page and formats, and utilizes XPath to sheet format details page It extracts data and formats, data and format are extracted to details page combination regular expression and XPath comprising text and table Change, obtains formatted extraction data.

Octopus acquisition unit recycles to traverse the list of all page turnings by obtaining page turning, construction in the tool Page rejects unnecessary acquisition link according to requiring to parse each details page.

Octopus acquisition unit obtains page turning in the tool, and construction circulation obtains at the webpage debugging interface Network The link and parameter attribute of webpage transmitting construct circular linkage by varying cyclically parameter value, using //a [under text ()=' One page '] label obtains the details page link of lower one page.

Usefulness of the present invention is:

The present invention provides a kind of acquisition method of internet unstructured text data, to acquire the unstructured net in internet Page provides good solution, acquires especially for the data of big paragraph text and the netted table class page, passes through canonical Expression formula positions big paragraph text data position, in combination with XPath position-table element, substantially increases the matching of collection rule Degree, ensure that the accuracy, consistency and integrality of collection result.

Detailed description of the invention

Fig. 1 is the method for the present invention flow diagram.

2 page turning schematic diagram of Fig. 2 embodiment 1 and embodiment；

1 details page schematic diagram of Fig. 3 embodiment；

Page detailed schematic diagram in Fig. 4 embodiment 1；

Specific embodiment

The present invention provides a kind of acquisition method of internet unstructured text data, obtains mesh using octopus collector Webpage to be marked, determines acquisition field, obtains page turning, construction circulation carries out each details page link parsing, obtains details page, according to Details page text type carries out data extraction, wherein data are extracted using regular expression to text type details page and are formatted, Data are extracted using XPath to sheet format details page and are formatted, to the details page combination regular expressions comprising text and table Formula and XPath extract data and format, and obtain formatted extraction data.

A kind of sampling instrument of the internet unstructured text data corresponded to the above method, including eight are provided simultaneously Pawl fish acquisition unit and analytical unit,

The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.

Embodiment 1 acquires Zaozhuang real estate presell information using the method for the present invention or tool.

Step 1. obtains Zaozhuang house and real estate information network address, determines acquisition field,

Website URL:http: //www.zzzzfdc.com.cn/site/news/11/news_11_1_0.html

Acquire field: Announcement Number, date of declaration, exploitation enterprise, the commercial house permit for presale number, project name, project Position, presell area, acquisition URL, acquisition time；

Step 2: obtaining page turning, such as Fig. 2；

Construction circulation, circulation click lower one page, element position:

//a [one page under text ()=' ']

Step 3: acquisition details page, such as Fig. 3,

Due to only acquiring the commercial house presell license bulletin, need to reject Management in Real Estate Exploration license bulletin, therefore utilize Advanced XPath positioning target details link, construction circulation using the contains usage of XPath, are matched and are wrapped in an attribute value The character string contained, element position:

//a [contains (@title, ' presell license bulletin ')]

Successively open the link in circulation；

Step 4: extracting data, such as Fig. 4 is extracted and formatted using regular expression since webpage is a Duan Wenben；

Table 1 is obtained after extraction,

Table 1

Announcement Number	The commercial house presell license bulletin .*
		Announce the time	D+ the d+ month d+ days
Develop enterprise	To (.*) development & construction
		The commercial house permit for presale number	" the commercial house permit for presale " number are as follows: (.*).The project
Project name	(.) the project . of development & construction makes
		Item location	The project is located at (.*), approval
Presell area	Presell area (.*), the commercial house

Table 2 is obtained after formatting,

Table 2

Embodiment 2 acquires credit Liaoning-administrative penalty data using the present invention or tool.

Step 1. obtains credit Liaoning-administrative penalty network address, determines acquisition field,

Website URL:

Http: // 218.60.149.124:8088/sgs/xyln/three.htm? nowPage=1&orgId=0& ParentId=0&qymc=&xk=0&cf=16&orgName=&gj=1

Acquire field: administrative punishment form code, enterprise name, the punishment origin of an incident, punishing justification, punishment classification 1, punishment As a result, decision for punishment date, punishment organ；

Step 2: acquisition page turning, such as Fig. 2,

Construction circulation, circulation click lower one page, element position:

//a [one page under text ()=' ']

Step 3: obtaining details page

Construction circulation, is extracted, element position using position () range of function:

//table [@class='list_list1 f12 m5']/tbody/tr [position () > 1]

Step 4:

Data are extracted, since the form of the website details page is had nothing in common with each other, table field title is also inconsistent, single Pure XPath sequential path //BODY [@class='tc']/DIV [2]/DIV [1]/DIV [2]/CENTER [1]/TABLE [1] aiming field position can not be accurately positioned in/TBODY [1]/TR [7]/TD [2], and element does not have apparent class or id Title can only go identification to position by text, in conjunction with following-sibling function, choose the peer after present node Node, the advanced path the XPath such as table 3 of each field,

Table 3

Such as table 4 after formatting,

Table 4

Field name	The data extracted
		Administrative punishment form code	Certainly word [2018] 2-19-002 are penalized in sweet law enforcement
Enterprise name	Rise Home Co., Ltd in Dalian ten thousand
		Punish the origin of an incident	On May 23rd, 2018, Home Co., Ltd's quilt ... was risen in Dalian ten thousand
Punishing justification	" Daliang City's urban road bridges facilities management method " ...
		Punish classification 1	Warning fine
Punish result	Illegal activities and RMB ... of imposing a fine are corrected in time limit 3 days
		The decision for punishment date	2018-07-11
Punish organ	Daliang City Ganjingzi District Bureau of City Administration
		Acquire URL	Http: // 218.60.149.124:8088/sgs//xyln/f...
Acquisition time	2018-08-07

Above-described embodiment, the present invention pass through regular expression and advanced XPath configuration strategy, precise positioning datum target position It sets, improves the quality of data, guarantee the accuracy, consistency and integrality of data.

Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims

1. a kind of acquisition method of internet unstructured text data, it is characterized in that

Target webpage is obtained using octopus collector, determines acquisition field, obtains page turning, construction circulation carries out each details Page link parsing, obtains details page, carries out data extraction according to details page text type, wherein to text type details page using just Then expression formula is extracted and data and is formatted, and is extracted data using XPath to sheet format details page and is formatted, to comprising text and The details page combination regular expression and XPath of table extract data and format, and obtain formatted extraction data.

2. according to the method described in claim 1, it is characterized in that by obtaining page turning, construction recycles to traverse the column of all page turnings Table page rejects unnecessary acquisition link according to requiring to parse each details page.

3. method according to claim 1 or 2, it is characterized in that obtaining page turning, construction circulation debugs Network in webpage Interface obtains the link and parameter attribute of webpage transmitting, constructs circular linkage by varying cyclically parameter value, utilizes //a [text One page under ()=' '] label obtains the details page link of lower one page.

4. according to the method described in claim 3, it is characterized in that extracted after data are formatted processing using regular expression, Reuse progress data extraction according to regular expressions.

5. method according to claim 1 or 4, it is characterized in that when extracting data using XPath, if sheet format details page Table dvielement label does not have specific id or class title, using text () attribute and contains function locating position, The text for the next element for recycling the crawl of following or following-sibling function at the same level.

6. a kind of sampling instrument of internet unstructured text data, it is characterized in that including that octopus acquisition unit and analysis are single Member,

Target webpage is obtained using octopus acquisition unit, determines acquisition field, obtains page turning, construction circulation carries out each detailed The link parsing of feelings page, obtains details page, and analytical unit divides details page according to details page text type, octopus acquisition Unit extracts data using regular expression to text type details page and formats, and is extracted to sheet format details page using XPath Data simultaneously format, and extract data to details page combination regular expression and XPath comprising text and table and format, obtain Obtain formatted extraction data.

7. tool according to claim 6, it is characterized in that octopus acquisition unit is recycled next time by obtaining page turning, construction The list page for going through all page turnings rejects unnecessary acquisition link according to requiring to parse each details page.

8. tool according to claim 6 or 7, it is characterized in that octopus acquisition unit obtains page turning, construction circulation, in net The page debugging interface Network obtains the link and parameter attribute of webpage transmitting, constructs circular linkage by varying cyclically parameter value, It is linked using the details page that //a [one page under text ()=' '] label obtains lower one page.