CN108874870A

CN108874870A - A kind of data pick-up method, equipment and computer can storage mediums

Info

Publication number: CN108874870A
Application number: CN201810375770.2A
Authority: CN
Inventors: 郝保; 王海亮; 王磊; 罗引
Original assignee: Beijing Zhongke Song Polytron Technologies Inc
Current assignee: Beijing Zhongke Song Polytron Technologies Inc
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2018-11-23

Abstract

The invention discloses a kind of data pick-up method, equipment and computers can storage medium.The method includes：Obtain html text；According to preset content extraction rule, the data of preset kind are extracted in the html text；According to the data of the preset kind extracted in the html text, structural data is generated.The present invention presets content extraction rule, using content extraction rule, can carry out fine-grained structuring extraction to html text, and then various types of data can be obtained, the structural data comprising various types data is obtained, extraction speed is fast, and it is high to extract precision.

Description

A kind of data pick-up method, equipment and computer can storage mediums

Technical field

The present invention relates to big data technical fields, can deposit more particularly to a kind of data pick-up method, equipment and computer Storage media.

Background technique

Currently, require to carry out data pick-up in application scenarios such as the analysis of public opinion, propagation analysis, data platform services, with Just using the mass data of acquisition as subsequent data analysis or the data basis of data service business.

The quality of data pick-up influences the accuracy of data analysis result.But it is only simple that available data, which extracts mode, Ground extracts data, without fine granularity, categorizedly extracts data, does so the data volume of extraction and includes greatly and in data Inner capacities is big, does not distinguish the Various types of data content for including in data.Such as：Existing data pick-up method does not distinguish data In include title, content, issuing time, source-information and distributor information.This data for resulting in extracting can not be had Effect utilizes, and also produces adverse effect to the analysis of subsequent data or data service business.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of data pick-up method, equipment and computer can storage medium, To solve the problems, such as that available data abstracting method does not carry out fine granularity extraction to data.

In order to solve the above-mentioned technical problem, the present invention solves by the following technical programs：

The present invention provides a kind of data pick-up methods, including：Obtain html text；It is advised according to preset content extraction Then, the data of preset kind are extracted in the html text；According to the number of the preset kind extracted in the html text According to generation structural data.

Wherein, described according to preset content extraction rule, the data of preset kind, packet are extracted in the html text It includes：In the html text, the corresponding label position of data for positioning the preset kind using XPATH, and described The label position of html text, extracts the data of the preset kind；And/or in the html text, obtain default Text in range extracts the data of the preset kind in the text in the preset range；And/or for described pre- If the data of type, full-text search is carried out to the html text, to extract the preset kind in the html text Data.

Wherein, the data of the preset kind include：Title data, content-data, time data, derived data and/or Publisher's data.

Wherein, the data of preset kind are extracted in the text in the preset range, including：If in default model In text in enclosing, multiple time data for meeting preset condition are drawn into, then using preset time-critical word to each institute It states time data to score, retains the highest time data that score.

Wherein, before the generation structural data, the method also includes：Be drawn into the time data it Afterwards, time zone conversion is carried out to the time data.

Wherein, described according to preset content extraction rule, the data of preset kind, packet are extracted in the html text It includes：In the html text, metamessage META node element list is extracted；In the META node element list, inquiry Carry out Source Description node, and it is described come Source Description node extract derived data.

Wherein, described according to preset content extraction rule, the data of preset kind, packet are extracted in the html text It includes：In the html text, retrieve it is preset come source key；According to the position come where source key, extracts and Source data.

Wherein, described in the html text, retrieve it is preset come source key, including：In the html text Predeterminated position, retrieve it is preset come source key.

The present invention provides a kind of data pick-up equipment, the data pick-up equipment includes processor, memory；The place Reason device is for executing the data extractor stored in the memory, in real above-mentioned data pick-up method.

The present invention provides a kind of computer can storage medium, the computer can storage medium be stored with one or more A program, one or more of programs can be executed by one or more processor, to realize above-mentioned data pick-up side Method.

The present invention has the beneficial effect that：

The present invention presets content extraction rule, using content extraction rule, can carry out particulate to html text The structuring of degree is extracted, and then can obtain various types of data, obtains the structural data comprising various types data, is taken out It takes speed fast, it is high to extract precision.

Detailed description of the invention

Fig. 1 is the flow chart of data pick-up method according to a first embodiment of the present invention；

Fig. 2 is that data according to a first embodiment of the present invention show schematic diagram；

Fig. 3 is the step flow chart that title data according to a second embodiment of the present invention extracts；

Fig. 4 is the step flow chart that content-data according to a third embodiment of the present invention extracts；

Fig. 5 is the step flow chart of time data pick-up according to a fourth embodiment of the present invention；

Fig. 6 is when distinctive emblem schematic diagram according to a fourth embodiment of the present invention；

Fig. 7 is time tag word schematic diagram according to a fourth embodiment of the present invention；

Fig. 8 is the common suffix schematic diagram of media according to a fourth embodiment of the present invention；

Fig. 9 is other common word schematic diagrames according to a fourth embodiment of the present invention；

Figure 10 is the step flow chart that derived data according to a fifth embodiment of the present invention extracts；

Figure 11 is the step flow chart that derived data according to a sixth embodiment of the present invention extracts；

Figure 12 is the structure chart of data pick-up equipment according to a seventh embodiment of the present invention.

Specific embodiment

In order to solve problems in the prior art, the present invention provides a kind of data pick-up method, equipment and computers to deposit Storage media, below in conjunction with attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that described herein Specific embodiment be only used to explain the present invention, limit the present invention.

Embodiment one

Present embodiments provide a kind of data pick-up method.As shown in Figure 1, for according to the data of first embodiment of the invention The flow chart of abstracting method.

Step S110 obtains HTML (Hyper Text Markup Language, HyperText Markup Language) text.

In the present embodiment, html text can be html source code file.

In the present embodiment, the type of html text, including：Details page and list page.

In the present embodiment, html text is, for example,：The data such as local and overseas news web page, forum page, blog page.

Specifically, the download service of ISP can be called, to obtain html text.

Step S120 extracts the data of preset kind according to preset content extraction rule in the html text.

The preset content extraction rule, including but not limited to：

In html text, positioned using XPATH (Xml Path Language, extensible markup language path language) The corresponding label position of the data of preset kind, and in the label position of the html text, extract the preset kind Data；And/or in html text, the text in preset range is obtained, institute is extracted in the text in the preset range State the data of preset kind；And/or the data for preset kind, full-text search is carried out to html text, so as in HTML text The data of the preset kind are extracted in this.

In the present embodiment, the data of preset kind include：Title data, content-data, time data, derived data And/or publisher's data.But invention technician, which should be appreciated that the data of preset kind of the present invention, includes but not It is limited to：Title data, content-data, time data, derived data and publisher's data.Such as：The data of preset kind may be used also To include, the source address of html text.

The corresponding label of each type of data is obtained in advance；Number to be extracted is positioned in html text by XPATH According to corresponding label, in the label position navigated to, the data to be extracted can be drawn into.

Title data corresponds in html tag<title>；Content-data corresponds in html tag<body>, when Between data correspond in html tag<time>, derived data corresponds in html tag<source>, publisher's data correspondence In html tag<address>.

Such as：In html text, the corresponding label of XPATH positioning publisher's data is utilized<address>, in HTML text In this, in label<address>Position, extract publisher's data.

Certainly, during extracting data, it the operation such as can also be filtered, denoise, to remove useless data, Keep data pick-up method of the invention more accurate.

Step S130 generates structural data according to the data of the preset kind extracted in the html text.

In the present embodiment, structural data is row data, and two-dimentional table structure can be used and come what logical expression was drawn into The data of multiple types.In the present embodiment, each row of data of structural data may include title data, content-data, when Between the data such as data, derived data, publisher's data.

According to the structural data of generation, all types of data being drawn into, data as shown in Figure 2 can be shown for user Show schematic diagram.In the data pick-up stage, it is drawn into title (title data), issuing time (time data), contribution url (original text The uniform resource locator of part, i.e. source address), contribution source (derived data), author's (publisher's data), and then generate knot Structure data, changing data with this configuration is that user carries out data displaying, can be drawn into according to the displaying of capable formal classification Data.

The present embodiment presets content extraction rule, using content extraction rule, carries out fine granularity to html text Structuring extract, and then various types of data can be obtained, the extraction speed that this implementations comes is fast, extract precision height, and It is very clear to the overview of html text that family can be used, help subsequent to analyze data.The present embodiment can be applied In big data concurrent data Extraction Projects.The present embodiment by preset content extraction rule can in different web pages it is complete Different content extraction rules is formulated in the extraction of pairs of different types of data without the webpage for different structure.

It, will in the following embodiments specifically to the extraction of title data, content-data, time data, derived data It describes in detail.

Embodiment two

The present embodiment will be further described through the row that is drawn into of title data.

Fig. 3 is the step flow chart that title data according to a second embodiment of the present invention extracts.

Step S310 utilizes the corresponding label of XPATH positioning title data in html text.

The corresponding label of title data is<title>.

Step S320, in the corresponding label position of title data, extracting header data.

Step S330 judges whether the title data being drawn into is one；If so, thening follow the steps S340；If not, Then follow the steps S350.

There may be the labels of multiple corresponding title datas in html text, at this moment can be corresponding according to label data Label, extract multiple label datas.

Such as：Comprising a main title and multiple subtitles in article, in response to this, so that it may be drawn into a master Title and multiple subtitles need to retain the title for best embodying html text theme.

Step S340 returns to the title data being drawn into.

Step S350 returns to title data corresponding to the maximum label of weight according to preset html tag weight.

Obtain preset html tag weight score table；For the multiple title datas being drawn into, according to each title number According to corresponding tag attributes, the corresponding weight of each title data is inquired, the maximum title data of weighted value is selected to return.

In html tag weight score table, according to tag attributes, to the corresponding weight of label out.

Such as：The corresponding weight of title 1 of italic is that the corresponding weight of title 2 of 1, H1 font size is 5, the weight of title 2 Greater than the weight of title 1, the title 2 of H1 font size is at this moment selected to return.

Embodiment three

The present embodiment will be further described through the row that is drawn into of content-data.Due to there is useless number in content-data more According to obtaining pure content so the present embodiment further handles content-data.

Fig. 4 is the step flow chart that content-data according to a third embodiment of the present invention extracts.

Step S410 utilizes the corresponding label of XPATH positioning content-data in html text.

The corresponding label of content-data is<body>.

Step S420 obtains the corresponding label data of content-data in the corresponding label position of content-data.

Html text is obtained, DOM document tree is parsed, obtains the label in HTML<body>Under label data.

Step S430 is inquired in the label data and is believed with the presence or absence of rubbish website according to preset rubbish site database Breath；If so, thening follow the steps S440；If not, thening follow the steps S450.

Default rubbish site database saves default rubbish site information in the rubbish station data.

Rubbish site information, including but not limited to：The address url of rubbish website.

E.g.：The template of lottery website, the template etc. of gambling site.

Step S440 filters out rubbish site information in the label data.

The address url of the rubbish website in label data can be filtered out by the step.

Step S450, the web page extraction algorithm based on heuristic rule and unsupervised learning are filtering out rubbish website letter The corresponding label data of text is extracted in the label data of breath, executes step S460 later.

Web page extraction algorithm based on heuristic rule and unsupervised learning, can be used for identifying text.Based on heuristic The web page extraction algorithm of rule and unsupervised learning can be MSS (Maximum Subsequence Segmentation, vision Conspicuousness detection algorithm).

Further, web analysis can be a token sequence by MSS algorithm, and be each of token sequence Token assigns a score, wherein -3.25 points of a label assignment, a text assignment 1 are divided；It searches in token sequence and divides It is worth maximum subsequence, which is the text in webpage.This rule is understood from another angle, i.e., from html source code A subsequence is found out in character string, this subsequence should include as far as possible more text and less label, because of algorithm In to label impart biggish negative point (- 3.25) of absolute value, imparted for text and lesser just divide (1).

Step S460 carries out data pick-up according to text corresponding label data, and the data extracted are encapsulated as content-data.

Example IV

The present embodiment will be further described through the row that is drawn into of time data.The time can not be drawn by XPATH When data, can also mode through this embodiment extract time data.

In the present embodiment, in html text, the text in preset range, the text in the preset range are obtained Middle extraction time data；Alternatively, carrying out full-text search to html text, time data are extracted in the html text.

A specific example is given below to illustrate to extract time data within a preset range：

Fig. 5 is the step flow chart of time data pick-up according to a fourth embodiment of the present invention.

Step S510 utilizes the corresponding label position of XPATH positioning title data in html text.

Step S520 extracts the title data in the corresponding label position of the title data.

Step S530 obtains the text being located in the title data preset range in html text.

Since position of the time data in html text is relatively more fixed, the issuing time of especially news generally will appear Around title data, it is possible to be extracted according to the appearance position of time data.

In the present embodiment, the text of the upward N of title data (N >=1) row is obtained, and/or, the text between title and text This, to extract time data in the text in these preset ranges.

Step S540 extracts time data in the text in the preset range.

In addition, some time data do not occur around title, it is also possible to ending place of html text, then The corresponding preset range of time data can be set at text ending.

In the present embodiment, if being drawn into multiple time numbers for meeting preset condition in text within a preset range According to, then it is scored using preset time-critical word each time data, the highest time data of reservation scoring.

The preset condition can be：Content of text in preset range is more than preset threshold value.Such as：In the preset range Text, Chinese is more than 80 words (or English more than 200 words), then is not considered as to will appear time data in the text.

The standards of grading to be scored using time-critical word time data, including：

If 1) distinctive emblem sometimes in time data text of the row, plus 2 points；

If 2) in time data text of the row include time identifier word, plus 1 point；

3) if composing a piece of writing where time data in originally includes the common suffix of media, plus 2 points；

4) if composing a piece of writing where time data in originally includes other common words, plus 2 points.

Fig. 6 is common when distinctive emblem, and Fig. 7 is common time tag word, and Fig. 8 is the common suffix of media, and Fig. 9 is other Common word.Other common words and time correlation, can be arranged according to demand.

In the present embodiment, after being drawn into time data, time zone conversion can be carried out to time data.

1, the original time string in html text：GetSrcPubtime (), with ' 2015-09-26 00:00:00 time zone ' Format storage；

2, getTransferedPubtime (String srcTimeZone, String desTimeZone), first Parameter String srcTimeZone can fill in null if be labelled with for the time zone of issuing time, if not provided, Time zone then can be set；Second parameter String desTimeZone is target time zone, can be arranged according to demand, if passed Enter null, is then defaulted as eight area GMT+8 of east:00.And then pass through function getTransferedPubtime (String SrcTimeZone, String desTimeZone), time data are transformed into different time zones.

It when carrying out time data pick-up, is preferentially extracted according to label, is secondly extracted according to preset range, can not all be extracted To time data, extracted according to full-text search.

In addition to the extraction to time data can refer to the present embodiment, other kinds of data can also be according to data in HTML The position setting preset range often occurred in text, and then can achieve the purpose for extracting corresponding data within a preset range.

Such as：Derived data is generally present in the next line of title, there is also in source it is relevant come source key.Example Such as：Source：The www.xinhuanet.com.At this moment derived data directly can be extracted in title next line.

Embodiment five

The present embodiment will be further described through the row that is drawn into of derived data.

Except through XPATH positioning label mode extract derived data, can also by other content decimation rule come Extract derived data.

The present embodiment is illustrated in such a way that META extracts derived data.As shown in Figure 10, for according to the present invention The step flow chart that the derived data of 5th embodiment extracts.

Step S1010 extracts the list of META node element in html text.

Step S1020, in META node element list, inquiry come Source Description node, and it is described come Source Description node Extract derived data.

META refers to that element can provide the metamessage (meta-information) of related pages, draws such as search Hold up the description with update frequency and keyword.

META node element whole in html text is extracted, the list of META node element is formed.

It whether is to complete derived data carrying out Source Description node come Source Description node by nodename or determined property It extracts.

Embodiment six

The present embodiment is still illustrated the extraction of derived data.

If META mode can not be drawn into derived data, source can be extracted by way of retrieving come source key Data.In html text, retrieve it is preset come source key (such as：It retrieves in " source ")；According to described come where source key Position, extract derived data.Further, carry out source key in the predeterminated position retrieval of html text, come if there is this Source key then extracts derived data in the predeterminated position.The predeterminated position can be designated position, be also possible to preset position Set range.

Derived data is generally present in the next line of title.Such as：It include " source in the next line of title：The www.xinhuanet.com ". At this moment " source " directly can be retrieved in title next line, in the case where retrieving, extracted derived data " www.xinhuanet.com ".

If predeterminated position, which can not retrieve, carrys out source key, it can not be also drawn into derived data, full text can also be passed through The method of retrieval extracts derived data.

A more specific embodiment is provided below to illustrate：

As shown in figure 11, the step flow chart to be extracted according to the derived data of sixth embodiment of the invention.

Step S1110, after being drawn into title data, in the first row text below title data, whether retrieval There are preset source keywords；If so, thening follow the steps S1120；If not, thening follow the steps S1130.

Step S1120 extracts derived data in the position of the source keyword.

Step S1130 is determined whether there is in the first row text and is preset quasi- derived data；If it is, executing step Rapid S1140；If not, thening follow the steps S1160.

The derived data extracted is stored in advance as quasi- derived data.Such as：The www.xinhuanet.com, phoenix is stored in advance Net, www.qq.com etc..

Quasi- derived data according to the pre-stored data judges in the first row text with the presence or absence of quasi- derived data.

Denoising first can be carried out to the first row text, then determine whether there is preset quasi- derived data.

In the present invention, denoising can be the default word in removal text, such as " ", " if ", " " Jie Word, adverbial word.

Step S1140 identifies that second below the title data composes a piece of writing in originally with the presence or absence of the quasi- derived data；Such as Fruit is to then follow the steps S1150；If not, thening follow the steps S1160.

Step S1150 returns to identical quasi- derived data in the first row text and the second style of writing sheet.

The quasi- derived data of the return can be used as the derived data of html text.

If identical quasi- derived data is not present in the first row text and the second style of writing sheet, step can be executed S1160。

Step S1160 retrieves the TEXT categorical data in html text, it is determined whether exist and carry out source key； If so, thening follow the steps S1120；If not, thening follow the steps S1170.

Step S1170, according to META node element list, using title data as anchor point, upward five node layer of iteration is right The text of every node layer of iteration is denoised, the text size after calculating denoising, and text size is less than default derived data The text of threshold value is as derived data and returns.

If there are the texts that multiple text sizes are less than derived data threshold value in five node layers, by each text and in advance The quasi- derived data first stored is compared, and the text of matched quasi- derived data will be present as derived data.

The data pick-up method accuracy of the present embodiment is high, easily extension, support high concurrent extracts.

Embodiment seven

The present embodiment provides a kind of data pick-up equipment.As shown in figure 12, for according to the data of seventh embodiment of the invention The structure chart of extracting device.

In the present embodiment, the data pick-up equipment 1200, including but not limited to：Processor 1210, memory 1220.

The processor 1210 for executing the data extractor stored in memory 1220, with realize embodiment one~ Data pick-up method described in embodiment six.

Specifically, the processor 1210 is used to execute the data extractor stored in memory 1220, to realize Following steps：Obtain html text；According to preset content extraction rule, the number of preset kind is extracted in the html text According to；According to the data of the preset kind extracted in the html text, structural data is generated.

It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule, Including：In the html text, the corresponding label position of data for positioning the preset kind using XPATH, and described The label position of html text, extracts the data of the preset kind；And/or in the html text, obtain default Text in range extracts the data of the preset kind in the text in the preset range；And/or for described pre- If the data of type, full-text search is carried out to the html text, to extract the preset kind in the html text Data.

Optionally, the data of the preset kind include：Title data, content-data, time data, derived data and/ Or publisher's data.

Optionally, the data of preset kind are extracted in the text in the preset range, including：If default In text in range, multiple time data for meeting preset condition are drawn into, then using preset time-critical word to each The time data score, and retain the highest time data that score.

Optionally, before the generation structural data, the method also includes：Be drawn into the time data it Afterwards, time zone conversion is carried out to the time data.

It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule, Including：In the html text, metamessage META node element list is extracted；In the META node element list, look into Ask come Source Description node, and it is described come Source Description node extract derived data.

It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule, Including：In the html text, retrieve it is preset come source key；According to the position come where source key, extract Derived data.

Optionally, described in the html text, retrieve it is preset come source key, including：In the html text Predeterminated position in, retrieve it is preset come source key.

Embodiment eight

The embodiment of the invention also provides a kind of computers can storage medium.Here computer can storage medium be stored with One or more program.Wherein, computer can storage medium may include volatile memory, such as random access memory Device；Memory also may include nonvolatile memory, such as read-only memory, flash memory, hard disk or solid state hard disk；It deposits Reservoir can also include the combination of the memory of mentioned kind.

When computer one or more program can be executed in storage medium by one or more processor, to realize Above-mentioned data pick-up method.

Specifically, the processor is used to execute the data extractor stored in memory, to realize following steps： Obtain html text；According to preset content extraction rule, the data of preset kind are extracted in the html text；According to The data of the preset kind extracted in the html text generate structural data.

Although for illustrative purposes, the preferred embodiment of the present invention has been disclosed, those skilled in the art will recognize It is various improve, increase and replace be also it is possible, therefore, the scope of the present invention should be not limited to the above embodiments.

Claims

1. a kind of data pick-up method, which is characterized in that including：

Obtain HyperText Markup Language html text；

According to preset content extraction rule, the data of preset kind are extracted in the html text；

According to the data of the preset kind extracted in the html text, structural data is generated.

2. the method as described in claim 1, which is characterized in that it is described according to preset content extraction rule, in the HTML The data of preset kind are extracted in text, including：

In the html text, the data pair of the preset kind are positioned using extensible markup language path language XPATH The label position answered, and in the label position of the html text, extract the data of the preset kind；And/or

In the html text, the text in preset range is obtained, is extracted in the text in the preset range described pre- If the data of type；And/or

For the data of the preset kind, full-text search is carried out to the html text, to take out in the html text Take the data of the preset kind.

3. method according to claim 2, which is characterized in that the data of the preset kind include：Title data, content number According to, time data, derived data and/or publisher's data.

4. method as claimed in claim 3, which is characterized in that extract default class in the text in the preset range The data of type, including：

If in text within a preset range, being drawn into multiple time data for meeting preset condition, then when utilizing preset Between keyword score each time data, retain the highest time data of scoring.

5. the method as claimed in claim 3 or 4, which is characterized in that before the generation structural data, the method is also Including：

After being drawn into the time data, time zone conversion is carried out to the time data.

6. method as claimed in claim 3, which is characterized in that it is described according to preset content extraction rule, in the HTML The data of preset kind are extracted in text, including：

In the html text, metamessage META node element list is extracted；

In the META node element list, inquiry come Source Description node, and it is described come Source Description node extract source number According to.

7. method as claimed in claim 3, which is characterized in that it is described according to preset content extraction rule, in the HTML The data of preset kind are extracted in text, including：

In the html text, retrieve it is preset come source key；

According to the position come where source key, derived data is extracted.

8. method as claimed in claim 3, which is characterized in that it is described in the html text, it retrieves preset source and closes Key word, including：

In the predeterminated position of the html text, retrieve it is described it is preset come source key.

9. a kind of data pick-up equipment, which is characterized in that the data pick-up equipment includes processor, memory；The processing Device is for executing the data extractor stored in the memory, to realize data according to any one of claims 1 to 8 Abstracting method.

10. a kind of computer can storage medium, which is characterized in that the computer can storage medium be stored with one or more Program, one or more of programs can be executed by one or more processor, any in claim 1~8 to realize Data pick-up method described in.