CN108874870A - A kind of data pick-up method, equipment and computer can storage mediums - Google Patents
A kind of data pick-up method, equipment and computer can storage mediums Download PDFInfo
- Publication number
- CN108874870A CN108874870A CN201810375770.2A CN201810375770A CN108874870A CN 108874870 A CN108874870 A CN 108874870A CN 201810375770 A CN201810375770 A CN 201810375770A CN 108874870 A CN108874870 A CN 108874870A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- preset
- html
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of data pick-up method, equipment and computers can storage medium.The method includes:Obtain html text;According to preset content extraction rule, the data of preset kind are extracted in the html text;According to the data of the preset kind extracted in the html text, structural data is generated.The present invention presets content extraction rule, using content extraction rule, can carry out fine-grained structuring extraction to html text, and then various types of data can be obtained, the structural data comprising various types data is obtained, extraction speed is fast, and it is high to extract precision.
Description
Technical field
The present invention relates to big data technical fields, can deposit more particularly to a kind of data pick-up method, equipment and computer
Storage media.
Background technique
Currently, require to carry out data pick-up in application scenarios such as the analysis of public opinion, propagation analysis, data platform services, with
Just using the mass data of acquisition as subsequent data analysis or the data basis of data service business.
The quality of data pick-up influences the accuracy of data analysis result.But it is only simple that available data, which extracts mode,
Ground extracts data, without fine granularity, categorizedly extracts data, does so the data volume of extraction and includes greatly and in data
Inner capacities is big, does not distinguish the Various types of data content for including in data.Such as:Existing data pick-up method does not distinguish data
In include title, content, issuing time, source-information and distributor information.This data for resulting in extracting can not be had
Effect utilizes, and also produces adverse effect to the analysis of subsequent data or data service business.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of data pick-up method, equipment and computer can storage medium,
To solve the problems, such as that available data abstracting method does not carry out fine granularity extraction to data.
In order to solve the above-mentioned technical problem, the present invention solves by the following technical programs:
The present invention provides a kind of data pick-up methods, including:Obtain html text;It is advised according to preset content extraction
Then, the data of preset kind are extracted in the html text;According to the number of the preset kind extracted in the html text
According to generation structural data.
Wherein, described according to preset content extraction rule, the data of preset kind, packet are extracted in the html text
It includes:In the html text, the corresponding label position of data for positioning the preset kind using XPATH, and described
The label position of html text, extracts the data of the preset kind;And/or in the html text, obtain default
Text in range extracts the data of the preset kind in the text in the preset range;And/or for described pre-
If the data of type, full-text search is carried out to the html text, to extract the preset kind in the html text
Data.
Wherein, the data of the preset kind include:Title data, content-data, time data, derived data and/or
Publisher's data.
Wherein, the data of preset kind are extracted in the text in the preset range, including:If in default model
In text in enclosing, multiple time data for meeting preset condition are drawn into, then using preset time-critical word to each institute
It states time data to score, retains the highest time data that score.
Wherein, before the generation structural data, the method also includes:Be drawn into the time data it
Afterwards, time zone conversion is carried out to the time data.
Wherein, described according to preset content extraction rule, the data of preset kind, packet are extracted in the html text
It includes:In the html text, metamessage META node element list is extracted;In the META node element list, inquiry
Carry out Source Description node, and it is described come Source Description node extract derived data.
Wherein, described according to preset content extraction rule, the data of preset kind, packet are extracted in the html text
It includes:In the html text, retrieve it is preset come source key;According to the position come where source key, extracts and
Source data.
Wherein, described in the html text, retrieve it is preset come source key, including:In the html text
Predeterminated position, retrieve it is preset come source key.
The present invention provides a kind of data pick-up equipment, the data pick-up equipment includes processor, memory;The place
Reason device is for executing the data extractor stored in the memory, in real above-mentioned data pick-up method.
The present invention provides a kind of computer can storage medium, the computer can storage medium be stored with one or more
A program, one or more of programs can be executed by one or more processor, to realize above-mentioned data pick-up side
Method.
The present invention has the beneficial effect that:
The present invention presets content extraction rule, using content extraction rule, can carry out particulate to html text
The structuring of degree is extracted, and then can obtain various types of data, obtains the structural data comprising various types data, is taken out
It takes speed fast, it is high to extract precision.
Detailed description of the invention
Fig. 1 is the flow chart of data pick-up method according to a first embodiment of the present invention;
Fig. 2 is that data according to a first embodiment of the present invention show schematic diagram;
Fig. 3 is the step flow chart that title data according to a second embodiment of the present invention extracts;
Fig. 4 is the step flow chart that content-data according to a third embodiment of the present invention extracts;
Fig. 5 is the step flow chart of time data pick-up according to a fourth embodiment of the present invention;
Fig. 6 is when distinctive emblem schematic diagram according to a fourth embodiment of the present invention;
Fig. 7 is time tag word schematic diagram according to a fourth embodiment of the present invention;
Fig. 8 is the common suffix schematic diagram of media according to a fourth embodiment of the present invention;
Fig. 9 is other common word schematic diagrames according to a fourth embodiment of the present invention;
Figure 10 is the step flow chart that derived data according to a fifth embodiment of the present invention extracts;
Figure 11 is the step flow chart that derived data according to a sixth embodiment of the present invention extracts;
Figure 12 is the structure chart of data pick-up equipment according to a seventh embodiment of the present invention.
Specific embodiment
In order to solve problems in the prior art, the present invention provides a kind of data pick-up method, equipment and computers to deposit
Storage media, below in conjunction with attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that described herein
Specific embodiment be only used to explain the present invention, limit the present invention.
Embodiment one
Present embodiments provide a kind of data pick-up method.As shown in Figure 1, for according to the data of first embodiment of the invention
The flow chart of abstracting method.
Step S110 obtains HTML (Hyper Text Markup Language, HyperText Markup Language) text.
In the present embodiment, html text can be html source code file.
In the present embodiment, the type of html text, including:Details page and list page.
In the present embodiment, html text is, for example,:The data such as local and overseas news web page, forum page, blog page.
Specifically, the download service of ISP can be called, to obtain html text.
Step S120 extracts the data of preset kind according to preset content extraction rule in the html text.
The preset content extraction rule, including but not limited to:
In html text, positioned using XPATH (Xml Path Language, extensible markup language path language)
The corresponding label position of the data of preset kind, and in the label position of the html text, extract the preset kind
Data;And/or in html text, the text in preset range is obtained, institute is extracted in the text in the preset range
State the data of preset kind;And/or the data for preset kind, full-text search is carried out to html text, so as in HTML text
The data of the preset kind are extracted in this.
In the present embodiment, the data of preset kind include:Title data, content-data, time data, derived data
And/or publisher's data.But invention technician, which should be appreciated that the data of preset kind of the present invention, includes but not
It is limited to:Title data, content-data, time data, derived data and publisher's data.Such as:The data of preset kind may be used also
To include, the source address of html text.
The corresponding label of each type of data is obtained in advance;Number to be extracted is positioned in html text by XPATH
According to corresponding label, in the label position navigated to, the data to be extracted can be drawn into.
Title data corresponds in html tag<title>;Content-data corresponds in html tag<body>, when
Between data correspond in html tag<time>, derived data corresponds in html tag<source>, publisher's data correspondence
In html tag<address>.
Such as:In html text, the corresponding label of XPATH positioning publisher's data is utilized<address>, in HTML text
In this, in label<address>Position, extract publisher's data.
Certainly, during extracting data, it the operation such as can also be filtered, denoise, to remove useless data,
Keep data pick-up method of the invention more accurate.
Step S130 generates structural data according to the data of the preset kind extracted in the html text.
In the present embodiment, structural data is row data, and two-dimentional table structure can be used and come what logical expression was drawn into
The data of multiple types.In the present embodiment, each row of data of structural data may include title data, content-data, when
Between the data such as data, derived data, publisher's data.
According to the structural data of generation, all types of data being drawn into, data as shown in Figure 2 can be shown for user
Show schematic diagram.In the data pick-up stage, it is drawn into title (title data), issuing time (time data), contribution url (original text
The uniform resource locator of part, i.e. source address), contribution source (derived data), author's (publisher's data), and then generate knot
Structure data, changing data with this configuration is that user carries out data displaying, can be drawn into according to the displaying of capable formal classification
Data.
The present embodiment presets content extraction rule, using content extraction rule, carries out fine granularity to html text
Structuring extract, and then various types of data can be obtained, the extraction speed that this implementations comes is fast, extract precision height, and
It is very clear to the overview of html text that family can be used, help subsequent to analyze data.The present embodiment can be applied
In big data concurrent data Extraction Projects.The present embodiment by preset content extraction rule can in different web pages it is complete
Different content extraction rules is formulated in the extraction of pairs of different types of data without the webpage for different structure.
It, will in the following embodiments specifically to the extraction of title data, content-data, time data, derived data
It describes in detail.
Embodiment two
The present embodiment will be further described through the row that is drawn into of title data.
Fig. 3 is the step flow chart that title data according to a second embodiment of the present invention extracts.
Step S310 utilizes the corresponding label of XPATH positioning title data in html text.
The corresponding label of title data is<title>.
Step S320, in the corresponding label position of title data, extracting header data.
Step S330 judges whether the title data being drawn into is one;If so, thening follow the steps S340;If not,
Then follow the steps S350.
There may be the labels of multiple corresponding title datas in html text, at this moment can be corresponding according to label data
Label, extract multiple label datas.
Such as:Comprising a main title and multiple subtitles in article, in response to this, so that it may be drawn into a master
Title and multiple subtitles need to retain the title for best embodying html text theme.
Step S340 returns to the title data being drawn into.
Step S350 returns to title data corresponding to the maximum label of weight according to preset html tag weight.
Obtain preset html tag weight score table;For the multiple title datas being drawn into, according to each title number
According to corresponding tag attributes, the corresponding weight of each title data is inquired, the maximum title data of weighted value is selected to return.
In html tag weight score table, according to tag attributes, to the corresponding weight of label out.
Such as:The corresponding weight of title 1 of italic is that the corresponding weight of title 2 of 1, H1 font size is 5, the weight of title 2
Greater than the weight of title 1, the title 2 of H1 font size is at this moment selected to return.
Embodiment three
The present embodiment will be further described through the row that is drawn into of content-data.Due to there is useless number in content-data more
According to obtaining pure content so the present embodiment further handles content-data.
Fig. 4 is the step flow chart that content-data according to a third embodiment of the present invention extracts.
Step S410 utilizes the corresponding label of XPATH positioning content-data in html text.
The corresponding label of content-data is<body>.
Step S420 obtains the corresponding label data of content-data in the corresponding label position of content-data.
Html text is obtained, DOM document tree is parsed, obtains the label in HTML<body>Under label data.
Step S430 is inquired in the label data and is believed with the presence or absence of rubbish website according to preset rubbish site database
Breath;If so, thening follow the steps S440;If not, thening follow the steps S450.
Default rubbish site database saves default rubbish site information in the rubbish station data.
Rubbish site information, including but not limited to:The address url of rubbish website.
E.g.:The template of lottery website, the template etc. of gambling site.
Step S440 filters out rubbish site information in the label data.
The address url of the rubbish website in label data can be filtered out by the step.
Step S450, the web page extraction algorithm based on heuristic rule and unsupervised learning are filtering out rubbish website letter
The corresponding label data of text is extracted in the label data of breath, executes step S460 later.
Web page extraction algorithm based on heuristic rule and unsupervised learning, can be used for identifying text.Based on heuristic
The web page extraction algorithm of rule and unsupervised learning can be MSS (Maximum Subsequence Segmentation, vision
Conspicuousness detection algorithm).
Further, web analysis can be a token sequence by MSS algorithm, and be each of token sequence
Token assigns a score, wherein -3.25 points of a label assignment, a text assignment 1 are divided;It searches in token sequence and divides
It is worth maximum subsequence, which is the text in webpage.This rule is understood from another angle, i.e., from html source code
A subsequence is found out in character string, this subsequence should include as far as possible more text and less label, because of algorithm
In to label impart biggish negative point (- 3.25) of absolute value, imparted for text and lesser just divide (1).
Step S460 carries out data pick-up according to text corresponding label data, and the data extracted are encapsulated as content-data.
Example IV
The present embodiment will be further described through the row that is drawn into of time data.The time can not be drawn by XPATH
When data, can also mode through this embodiment extract time data.
In the present embodiment, in html text, the text in preset range, the text in the preset range are obtained
Middle extraction time data;Alternatively, carrying out full-text search to html text, time data are extracted in the html text.
A specific example is given below to illustrate to extract time data within a preset range:
Fig. 5 is the step flow chart of time data pick-up according to a fourth embodiment of the present invention.
Step S510 utilizes the corresponding label position of XPATH positioning title data in html text.
Step S520 extracts the title data in the corresponding label position of the title data.
Step S530 obtains the text being located in the title data preset range in html text.
Since position of the time data in html text is relatively more fixed, the issuing time of especially news generally will appear
Around title data, it is possible to be extracted according to the appearance position of time data.
In the present embodiment, the text of the upward N of title data (N >=1) row is obtained, and/or, the text between title and text
This, to extract time data in the text in these preset ranges.
Step S540 extracts time data in the text in the preset range.
In addition, some time data do not occur around title, it is also possible to ending place of html text, then
The corresponding preset range of time data can be set at text ending.
In the present embodiment, if being drawn into multiple time numbers for meeting preset condition in text within a preset range
According to, then it is scored using preset time-critical word each time data, the highest time data of reservation scoring.
The preset condition can be:Content of text in preset range is more than preset threshold value.Such as:In the preset range
Text, Chinese is more than 80 words (or English more than 200 words), then is not considered as to will appear time data in the text.
The standards of grading to be scored using time-critical word time data, including:
If 1) distinctive emblem sometimes in time data text of the row, plus 2 points;
If 2) in time data text of the row include time identifier word, plus 1 point;
3) if composing a piece of writing where time data in originally includes the common suffix of media, plus 2 points;
4) if composing a piece of writing where time data in originally includes other common words, plus 2 points.
Fig. 6 is common when distinctive emblem, and Fig. 7 is common time tag word, and Fig. 8 is the common suffix of media, and Fig. 9 is other
Common word.Other common words and time correlation, can be arranged according to demand.
In the present embodiment, after being drawn into time data, time zone conversion can be carried out to time data.
1, the original time string in html text:GetSrcPubtime (), with ' 2015-09-26 00:00:00 time zone '
Format storage;
2, getTransferedPubtime (String srcTimeZone, String desTimeZone), first
Parameter String srcTimeZone can fill in null if be labelled with for the time zone of issuing time, if not provided,
Time zone then can be set;Second parameter String desTimeZone is target time zone, can be arranged according to demand, if passed
Enter null, is then defaulted as eight area GMT+8 of east:00.And then pass through function getTransferedPubtime (String
SrcTimeZone, String desTimeZone), time data are transformed into different time zones.
It when carrying out time data pick-up, is preferentially extracted according to label, is secondly extracted according to preset range, can not all be extracted
To time data, extracted according to full-text search.
In addition to the extraction to time data can refer to the present embodiment, other kinds of data can also be according to data in HTML
The position setting preset range often occurred in text, and then can achieve the purpose for extracting corresponding data within a preset range.
Such as:Derived data is generally present in the next line of title, there is also in source it is relevant come source key.Example
Such as:Source:The www.xinhuanet.com.At this moment derived data directly can be extracted in title next line.
Embodiment five
The present embodiment will be further described through the row that is drawn into of derived data.
Except through XPATH positioning label mode extract derived data, can also by other content decimation rule come
Extract derived data.
The present embodiment is illustrated in such a way that META extracts derived data.As shown in Figure 10, for according to the present invention
The step flow chart that the derived data of 5th embodiment extracts.
Step S1010 extracts the list of META node element in html text.
Step S1020, in META node element list, inquiry come Source Description node, and it is described come Source Description node
Extract derived data.
META refers to that element can provide the metamessage (meta-information) of related pages, draws such as search
Hold up the description with update frequency and keyword.
META node element whole in html text is extracted, the list of META node element is formed.
It whether is to complete derived data carrying out Source Description node come Source Description node by nodename or determined property
It extracts.
Embodiment six
The present embodiment is still illustrated the extraction of derived data.
If META mode can not be drawn into derived data, source can be extracted by way of retrieving come source key
Data.In html text, retrieve it is preset come source key (such as:It retrieves in " source ");According to described come where source key
Position, extract derived data.Further, carry out source key in the predeterminated position retrieval of html text, come if there is this
Source key then extracts derived data in the predeterminated position.The predeterminated position can be designated position, be also possible to preset position
Set range.
Derived data is generally present in the next line of title.Such as:It include " source in the next line of title:The www.xinhuanet.com ".
At this moment " source " directly can be retrieved in title next line, in the case where retrieving, extracted derived data " www.xinhuanet.com ".
If predeterminated position, which can not retrieve, carrys out source key, it can not be also drawn into derived data, full text can also be passed through
The method of retrieval extracts derived data.
A more specific embodiment is provided below to illustrate:
As shown in figure 11, the step flow chart to be extracted according to the derived data of sixth embodiment of the invention.
Step S1110, after being drawn into title data, in the first row text below title data, whether retrieval
There are preset source keywords;If so, thening follow the steps S1120;If not, thening follow the steps S1130.
Step S1120 extracts derived data in the position of the source keyword.
Step S1130 is determined whether there is in the first row text and is preset quasi- derived data;If it is, executing step
Rapid S1140;If not, thening follow the steps S1160.
The derived data extracted is stored in advance as quasi- derived data.Such as:The www.xinhuanet.com, phoenix is stored in advance
Net, www.qq.com etc..
Quasi- derived data according to the pre-stored data judges in the first row text with the presence or absence of quasi- derived data.
Denoising first can be carried out to the first row text, then determine whether there is preset quasi- derived data.
In the present invention, denoising can be the default word in removal text, such as " ", " if ", " " Jie
Word, adverbial word.
Step S1140 identifies that second below the title data composes a piece of writing in originally with the presence or absence of the quasi- derived data;Such as
Fruit is to then follow the steps S1150;If not, thening follow the steps S1160.
Step S1150 returns to identical quasi- derived data in the first row text and the second style of writing sheet.
The quasi- derived data of the return can be used as the derived data of html text.
If identical quasi- derived data is not present in the first row text and the second style of writing sheet, step can be executed
S1160。
Step S1160 retrieves the TEXT categorical data in html text, it is determined whether exist and carry out source key;
If so, thening follow the steps S1120;If not, thening follow the steps S1170.
Step S1170, according to META node element list, using title data as anchor point, upward five node layer of iteration is right
The text of every node layer of iteration is denoised, the text size after calculating denoising, and text size is less than default derived data
The text of threshold value is as derived data and returns.
If there are the texts that multiple text sizes are less than derived data threshold value in five node layers, by each text and in advance
The quasi- derived data first stored is compared, and the text of matched quasi- derived data will be present as derived data.
The data pick-up method accuracy of the present embodiment is high, easily extension, support high concurrent extracts.
Embodiment seven
The present embodiment provides a kind of data pick-up equipment.As shown in figure 12, for according to the data of seventh embodiment of the invention
The structure chart of extracting device.
In the present embodiment, the data pick-up equipment 1200, including but not limited to:Processor 1210, memory 1220.
The processor 1210 for executing the data extractor stored in memory 1220, with realize embodiment one~
Data pick-up method described in embodiment six.
Specifically, the processor 1210 is used to execute the data extractor stored in memory 1220, to realize
Following steps:Obtain html text;According to preset content extraction rule, the number of preset kind is extracted in the html text
According to;According to the data of the preset kind extracted in the html text, structural data is generated.
It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule,
Including:In the html text, the corresponding label position of data for positioning the preset kind using XPATH, and described
The label position of html text, extracts the data of the preset kind;And/or in the html text, obtain default
Text in range extracts the data of the preset kind in the text in the preset range;And/or for described pre-
If the data of type, full-text search is carried out to the html text, to extract the preset kind in the html text
Data.
Optionally, the data of the preset kind include:Title data, content-data, time data, derived data and/
Or publisher's data.
Optionally, the data of preset kind are extracted in the text in the preset range, including:If default
In text in range, multiple time data for meeting preset condition are drawn into, then using preset time-critical word to each
The time data score, and retain the highest time data that score.
Optionally, before the generation structural data, the method also includes:Be drawn into the time data it
Afterwards, time zone conversion is carried out to the time data.
It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule,
Including:In the html text, metamessage META node element list is extracted;In the META node element list, look into
Ask come Source Description node, and it is described come Source Description node extract derived data.
It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule,
Including:In the html text, retrieve it is preset come source key;According to the position come where source key, extract
Derived data.
Optionally, described in the html text, retrieve it is preset come source key, including:In the html text
Predeterminated position in, retrieve it is preset come source key.
Embodiment eight
The embodiment of the invention also provides a kind of computers can storage medium.Here computer can storage medium be stored with
One or more program.Wherein, computer can storage medium may include volatile memory, such as random access memory
Device;Memory also may include nonvolatile memory, such as read-only memory, flash memory, hard disk or solid state hard disk;It deposits
Reservoir can also include the combination of the memory of mentioned kind.
When computer one or more program can be executed in storage medium by one or more processor, to realize
Above-mentioned data pick-up method.
Specifically, the processor is used to execute the data extractor stored in memory, to realize following steps:
Obtain html text;According to preset content extraction rule, the data of preset kind are extracted in the html text;According to
The data of the preset kind extracted in the html text generate structural data.
It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule,
Including:In the html text, the corresponding label position of data for positioning the preset kind using XPATH, and described
The label position of html text, extracts the data of the preset kind;And/or in the html text, obtain default
Text in range extracts the data of the preset kind in the text in the preset range;And/or for described pre-
If the data of type, full-text search is carried out to the html text, to extract the preset kind in the html text
Data.
Optionally, the data of the preset kind include:Title data, content-data, time data, derived data and/
Or publisher's data.
Optionally, the data of preset kind are extracted in the text in the preset range, including:If default
In text in range, multiple time data for meeting preset condition are drawn into, then using preset time-critical word to each
The time data score, and retain the highest time data that score.
Optionally, before the generation structural data, the method also includes:Be drawn into the time data it
Afterwards, time zone conversion is carried out to the time data.
It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule,
Including:In the html text, metamessage META node element list is extracted;In the META node element list, look into
Ask come Source Description node, and it is described come Source Description node extract derived data.
It is optionally, described that the data of preset kind are extracted in the html text according to preset content extraction rule,
Including:In the html text, retrieve it is preset come source key;According to the position come where source key, extract
Derived data.
Optionally, described in the html text, retrieve it is preset come source key, including:In the html text
Predeterminated position in, retrieve it is preset come source key.
Although for illustrative purposes, the preferred embodiment of the present invention has been disclosed, those skilled in the art will recognize
It is various improve, increase and replace be also it is possible, therefore, the scope of the present invention should be not limited to the above embodiments.
Claims (10)
1. a kind of data pick-up method, which is characterized in that including:
Obtain HyperText Markup Language html text;
According to preset content extraction rule, the data of preset kind are extracted in the html text;
According to the data of the preset kind extracted in the html text, structural data is generated.
2. the method as described in claim 1, which is characterized in that it is described according to preset content extraction rule, in the HTML
The data of preset kind are extracted in text, including:
In the html text, the data pair of the preset kind are positioned using extensible markup language path language XPATH
The label position answered, and in the label position of the html text, extract the data of the preset kind;And/or
In the html text, the text in preset range is obtained, is extracted in the text in the preset range described pre-
If the data of type;And/or
For the data of the preset kind, full-text search is carried out to the html text, to take out in the html text
Take the data of the preset kind.
3. method according to claim 2, which is characterized in that the data of the preset kind include:Title data, content number
According to, time data, derived data and/or publisher's data.
4. method as claimed in claim 3, which is characterized in that extract default class in the text in the preset range
The data of type, including:
If in text within a preset range, being drawn into multiple time data for meeting preset condition, then when utilizing preset
Between keyword score each time data, retain the highest time data of scoring.
5. the method as claimed in claim 3 or 4, which is characterized in that before the generation structural data, the method is also
Including:
After being drawn into the time data, time zone conversion is carried out to the time data.
6. method as claimed in claim 3, which is characterized in that it is described according to preset content extraction rule, in the HTML
The data of preset kind are extracted in text, including:
In the html text, metamessage META node element list is extracted;
In the META node element list, inquiry come Source Description node, and it is described come Source Description node extract source number
According to.
7. method as claimed in claim 3, which is characterized in that it is described according to preset content extraction rule, in the HTML
The data of preset kind are extracted in text, including:
In the html text, retrieve it is preset come source key;
According to the position come where source key, derived data is extracted.
8. method as claimed in claim 3, which is characterized in that it is described in the html text, it retrieves preset source and closes
Key word, including:
In the predeterminated position of the html text, retrieve it is described it is preset come source key.
9. a kind of data pick-up equipment, which is characterized in that the data pick-up equipment includes processor, memory;The processing
Device is for executing the data extractor stored in the memory, to realize data according to any one of claims 1 to 8
Abstracting method.
10. a kind of computer can storage medium, which is characterized in that the computer can storage medium be stored with one or more
Program, one or more of programs can be executed by one or more processor, any in claim 1~8 to realize
Data pick-up method described in.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810375770.2A CN108874870A (en) | 2018-04-24 | 2018-04-24 | A kind of data pick-up method, equipment and computer can storage mediums |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810375770.2A CN108874870A (en) | 2018-04-24 | 2018-04-24 | A kind of data pick-up method, equipment and computer can storage mediums |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108874870A true CN108874870A (en) | 2018-11-23 |
Family
ID=64326715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810375770.2A Pending CN108874870A (en) | 2018-04-24 | 2018-04-24 | A kind of data pick-up method, equipment and computer can storage mediums |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108874870A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110874428A (en) * | 2019-11-11 | 2020-03-10 | 汉口北进出口服务有限公司 | Structured data extraction device and method for e-commerce page and readable storage medium |
CN111831460A (en) * | 2020-06-30 | 2020-10-27 | 江西科技学院 | Text copying and pasting method and system and readable storage medium |
CN112069775A (en) * | 2020-08-21 | 2020-12-11 | 完美世界控股集团有限公司 | Data conversion method and device, storage medium and electronic device |
CN116484831A (en) * | 2023-02-22 | 2023-07-25 | 北京麦克斯泰科技有限公司 | Multi-dimension-based release time identification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101600A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Metadata automatic extraction method based on multiple rule in network search |
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
US20110302486A1 (en) * | 2010-06-03 | 2011-12-08 | Beijing Ruixin Online System Technology Co., Ltd | Method and apparatus for obtaining the effective contents of web page |
CN102750390A (en) * | 2012-07-05 | 2012-10-24 | 翁时锋 | Automatic news webpage element extracting method |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
CN106326314A (en) * | 2015-07-07 | 2017-01-11 | 腾讯科技(深圳)有限公司 | Web page information extraction method and device |
-
2018
- 2018-04-24 CN CN201810375770.2A patent/CN108874870A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101600A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Metadata automatic extraction method based on multiple rule in network search |
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
US20110302486A1 (en) * | 2010-06-03 | 2011-12-08 | Beijing Ruixin Online System Technology Co., Ltd | Method and apparatus for obtaining the effective contents of web page |
CN102750390A (en) * | 2012-07-05 | 2012-10-24 | 翁时锋 | Automatic news webpage element extracting method |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
CN106326314A (en) * | 2015-07-07 | 2017-01-11 | 腾讯科技(深圳)有限公司 | Web page information extraction method and device |
Non-Patent Citations (1)
Title |
---|
裴东辉: "中文新闻事件抽取方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)计算机软件及计算机应用》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110874428A (en) * | 2019-11-11 | 2020-03-10 | 汉口北进出口服务有限公司 | Structured data extraction device and method for e-commerce page and readable storage medium |
CN111831460A (en) * | 2020-06-30 | 2020-10-27 | 江西科技学院 | Text copying and pasting method and system and readable storage medium |
CN111831460B (en) * | 2020-06-30 | 2023-06-16 | 江西科技学院 | Text copying and pasting method, system and readable storage medium |
CN112069775A (en) * | 2020-08-21 | 2020-12-11 | 完美世界控股集团有限公司 | Data conversion method and device, storage medium and electronic device |
CN116484831A (en) * | 2023-02-22 | 2023-07-25 | 北京麦克斯泰科技有限公司 | Multi-dimension-based release time identification method and device |
CN116484831B (en) * | 2023-02-22 | 2024-03-12 | 北京麦克斯泰科技有限公司 | Multi-dimension-based release time identification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Dom based content extraction via text density | |
US7519621B2 (en) | Extracting information from Web pages | |
CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
CN102521248B (en) | Network user classification method and device | |
US20090319449A1 (en) | Providing context for web articles | |
US20090063538A1 (en) | Method for normalizing dynamic urls of web pages through hierarchical organization of urls from a web site | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
TWI695277B (en) | Automatic website data collection method | |
US20090248707A1 (en) | Site-specific information-type detection methods and systems | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
Peters et al. | Content extraction using diverse feature sets | |
CN108920434A (en) | A kind of general Web page subject method for extracting content and system | |
CN100444591C (en) | Method for acquiring front-page keyword and its application system | |
JP2008515049A (en) | Displaying search results based on document structure | |
CN103530429B (en) | Webpage content extracting method | |
CN111104801B (en) | Text word segmentation method, system, equipment and medium based on website domain name | |
CN103678412A (en) | Document retrieval method and device | |
CN109165373B (en) | Data processing method and device | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN110377796A (en) | Text extracting method, device, equipment and storage medium based on dom tree | |
CN106372232B (en) | Information mining method and device based on artificial intelligence | |
CN107145591A (en) | A kind of effective content metadata extracting method of webpage based on title | |
WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
CN104778232B (en) | Searching result optimizing method and device based on long query | |
US20090216739A1 (en) | Boosting extraction accuracy by handling training data bias |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181123 |
|
RJ01 | Rejection of invention patent application after publication |