CN107480134A - A kind of data processing method and system - Google Patents

A kind of data processing method and system Download PDF

Info

Publication number
CN107480134A
CN107480134A CN201710630757.2A CN201710630757A CN107480134A CN 107480134 A CN107480134 A CN 107480134A CN 201710630757 A CN201710630757 A CN 201710630757A CN 107480134 A CN107480134 A CN 107480134A
Authority
CN
China
Prior art keywords
record
webpage
preset
web page
key field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710630757.2A
Other languages
Chinese (zh)
Inventor
陈进宝
刘希
唐妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxin Youe Data Co Ltd
Original Assignee
Guoxin Youe Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoxin Youe Data Co Ltd filed Critical Guoxin Youe Data Co Ltd
Priority to CN201710630757.2A priority Critical patent/CN107480134A/en
Publication of CN107480134A publication Critical patent/CN107480134A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of data processing method, comprises the following steps:Web page is gathered from preset data source;It is determined that the webpage classification belonging to the web page of collection;Wherein, the webpage classification is the different objects division according to described by the webpage that the preset data source includes;Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;The effective information being drawn into is converted into preset standard form and stored.The present invention can improve the value of network data effectively by lengthy and tedious network data processing into desirable data.

Description

A kind of data processing method and system
Technical field
The invention belongs to data processing field, and in particular to a kind of data processing method and system.
Background technology
The basic object of data processing is extracted and derived from substantial amounts of, rambling, elusive data It is valuable, significant data for some specific people.Data processing is system engineering and the base that automatically controls This link.Every field of the data processing through social production and social life.The development and its application of data processing technique Breadth and depth, greatly affect the process of human social development.
Data processing can find in time and correct in data file can recognize that mistake and mistake is corrected, mainly Including checking data consistency, invalid value and missing values etc. are handled.Because the data in data warehouse are towards a certain theme The set of data, these data extract from multiple operation systems and comprising historical datas, thus unavoidable to have Data be wrong data, the data that have have conflict between each other, data that are these mistakes or having conflict are referred to as " dirty data ". These " dirty datas " are if without processing, it will produce interference to the real value of data, and then influence data value.It is existing The data processing method of technology mainly for the structuring from database data, it is and at full speed with computer networking technology Development, a large amount of valuable network datas are generated, and in network data are largely semi-structured and non-structured data, And lack the effective data processing method for network data in the prior art.
The content of the invention
For above-mentioned technical problem, the present invention provides a kind of data processing method and system, and network data can be carried out Processing, to extract effective information.
The technical solution adopted by the present invention is:
An aspect of of the present present invention provides a kind of data processing method, comprises the following steps:Web nets are gathered from preset data source Page;It is determined that the webpage classification belonging to the web page of collection;Wherein, the webpage classification is to be included according to the preset data source Webpage described by different objects division;Using wrapper corresponding to the webpage classification from the web page of the collection Middle extraction effective information;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage; The effective information being drawn into is converted into preset standard form and stored.
Alternatively, effective letter is being extracted from the web page of the collection using wrapper corresponding to the webpage classification Before breath, in addition to:The attribute of object according to described by the webpage classification corresponds to webpage, from the web page bag of the collection Key field corresponding to the attribute is extracted in the text information contained;And the key field based on extraction generates the webpage Wrapper corresponding to classification.
Alternatively, the wrapper is used for for corresponding webpage class declaration semantic feature identifier and contextual feature identification Device;The semantic feature identifier is used to know the text for meeting the semantic feature according to the semantic feature of key field Not;The contextual feature identifier is used to know the text for meeting the contextual feature according to the contextual feature of key field Not;Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification, specifically included:Pin To each key field of extraction, according to the semantic feature of the key field, using corresponding semantic feature identifier from The text of the semantic feature is determined for compliance with the text information that the web page of the collection includes;And language corresponding to use Border feature identifier identifies from the text for meeting the semantic feature meets the text of the key field contextual feature, and makees For textual value corresponding to the key field.
Alternatively, the wrapper is additionally operable to the reference format that definition of keywords field corresponds to textual value;By what is be drawn into Effective information is converted into preset standard form and stored, and specifically includes:For each key field, by the key field pair The textual value answered is converted to corresponding preset standard form and stored.
Alternatively, before the effective information being drawn into is stored, in addition to:To in the effective information that is drawn into The key field for describing same target carries out data normalization processing, between the mutually convertible key field of elimination sign Default conflict;The default conflict includes:Naming conflict, format conflicts.
Alternatively, before the effective information being drawn into is stored, in addition to:For the effective information being drawn into In every record, the missing degree that the key field included according to the record corresponds to textual value determines whether the record is not Complete documentation;And the endless complete record for determining, endless complete record is handled according to default processing rule;Using pre- Imputation method carries out repeating record detection, and the record of the repetition for detecting to the effective information being drawn into, and retains a note Record is stored;Wherein, describe every group of textual value corresponding to a set of keyword field of same target and be referred to as a record.
Alternatively, carry out repeating record detection using preset algorithm, specifically include:For any two records to be detected, The editing distance between textual value corresponding to same keyword field in two records to be detected is determined respectively;If exist any Editing distance between corresponding textual value is more than preset field similarity threshold, it is determined that this two records to be detected are not repetition Record;If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each keyword Weight information is preset corresponding to field, summation is weighted to each editing distance;Judging to obtain and value and each weight and value it Between business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.
Another aspect of the present invention provides a kind of data handling system, including:Data acquisition module, for from preset data Source gathers web page;Category determination module, for determining the webpage classification belonging to the web page of collection;Wherein, the webpage Classification is the different objects division according to described by the webpage that the preset data source includes;Information extraction module, for adopting Wrapper extracts effective information from the web page of the collection corresponding to the webpage classification;Wherein, the wrapper is The attribute generation of object according to described by the webpage classification corresponds to webpage;Message processing module, for effective by what is be drawn into Information is converted into preset standard form and stored.
Another aspect of the invention provides a kind of data processing equipment, including:Memory, processor and it is stored in described deposit On reservoir and the computer program that can run on the processor, realized described in the computing device during computer program The step of stating method.
Another aspect of the invention provides a kind of computer-readable recording medium, is deposited on the computer-readable recording medium Computer program is contained, when the computer program is run by processor the step of the execution above method.
Data processing method and system provided by the invention, this method gather web page from preset data source first, and really Surely the webpage classification belonging to the web page gathered, then using wrapper corresponding to the webpage classification from the web of the collection Effective information is extracted in webpage, and the effective information being drawn into is converted into preset standard form and stored, so as to have Effect by lengthy and tedious network data processing into desirable data, improve the value of network data.
Brief description of the drawings
Fig. 1 is the schematic flow sheet for the data processing method that one embodiment of the invention provides;
Fig. 2 is the schematic flow sheet for the data processing method that another embodiment of the present invention provides;
Fig. 3 is the structured flowchart of data handling system provided in an embodiment of the present invention;
Fig. 4 is the structured flowchart of data processing equipment provided in an embodiment of the present invention.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.
Fig. 1 is the schematic flow sheet for the data processing method that one embodiment of the invention provides.As shown in figure 1, the present invention is real The data processing method of example offer is provided, comprised the following steps:
S101, from preset data source gather web page.
The webpage classification belonging to web page that S102, determination gather;Wherein, the webpage classification is according to described default What the different objects described by the webpage that data source includes divided.
S103, effective information extracted from the web page of the collection using wrapper corresponding to the webpage classification;Its In, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage.
S104, the effective information being drawn into is converted into preset standard form and stored.
The data processing method that the present embodiment provides, WWW (web) webpage is gathered from preset data source first, and determined Webpage classification belonging to the web page of collection, then using wrapper corresponding to the webpage classification from the web nets of the collection Effective information is extracted in page, and the effective information being drawn into is converted into preset standard form and stored, so as to effective By lengthy and tedious network data processing into desirable data, improve the value of network data.
In another embodiment of the invention, the present invention provides a kind of data processing method, as shown in Fig. 2 including following Step:
S201, from preset data source gather web page.
S202, determine step S201 collection web page belonging to webpage classification;Wherein, according to the webpage classification What the different objects described by the webpage that the preset data source includes divided.
S203, according to step S202 determine webpage classification correspond to webpage described by object attribute, gathered from S201 The text information that includes of web page in extract key field corresponding to the attribute.
Wrapper corresponding to S204, the key field generation webpage classification extracted based on step S203.
Taken out in the web page that wrapper corresponding to S205, the webpage classification determined using step S204 is gathered from S201 Take effective information.
S206, the step S205 effective informations being drawn into are converted into preset standard form and stored.
Before the valid data being drawn into are stored, following steps are can also carry out:
Step 1: the key field of same target carries out data described in the effective information being drawn into step S205 Standardization, to eliminate the default conflict between the mutually convertible key field of sign;The default conflict includes:Name Conflict, format conflicts.
Step 2: every record in the effective information being drawn into for step S205, the pass included according to the record The missing degree that key word field corresponds to textual value determines whether the record is endless complete record;And the imperfect note for determining Record, endless complete record is handled according to default processing rule.Wherein, the set of keyword field for describing same target is corresponding Every group of textual value be referred to as a record.
Step 3: the effective information being drawn into using preset algorithm to step S205 carries out repeating record detection, and pin The repetition detected is recorded, retains a record and is stored.
Above-mentioned steps one perform no strict sequencing to step 3.
Further, in above-mentioned steps S201, web page can be gathered from preset data source by various sampling instruments, Preset data source can be the network data source according to specified by business demand.In one exemplary embodiment of the invention, may be used It is acquired using web information collector or protocol processor.
In the case where being acquired using Web information collector, web information collector is from an initial unified resource Finger URL (Uniform Resource Locator, URL) collection is set out, and these URL are all put into one and orderly wait to adopt Collect in queue.And collector takes out URL in order in this queue, by the agreement on web, the pointed page is obtained, Then new URL is extracted in the page that oneself obtains from these, and they are continued to be put into queue to be collected, is then repeated Process above, until collector stops collection according to the strategy of oneself.Needed for so, it is possible to collect from preset data source Web page.
In the case where being acquired using protocol processor, the collection of data is mainly completed by various agreements.One As for, agreement may include:HTTP (Hyper Text Transport Protocol, HTTP), file transmission Agreement (File Transfer Protocol, FTP), paddy Buddhist (Gopher), BBS (Bulletin Board System, BBS) etc., by taking HTTP as an example, acquisition step may include:
(1) according to page URL, targeted sites address and port numbers are extracted out, if slogan is set to default port 80 for no reason.Judge The connected mode of the website is set, and network connection is established with the address and port if being set to be directly connected to;If it is set to pass through generation Reason server (Proxy) connection then establishes network connection with specified Proxy addresses and port.
(2) if establishing network connection failure, illustrate that the website is unreachable, stop to capture the page and be discarded;Otherwise Continue next step and obtain specified page.
(3) by web page packaging HTTP request head, inserted if the website needs user's mark and password in request header, It transmit a request to targeted sites.Stop to capture the page if response message is not received more than certain time and be discarded;It is no Then continue next step response message.
(4) response header is analyzed, judges the conditional code returned:If conditional code is 2xx, the correct page is returned to, into step (5);If conditional code is 301 or 302, representation page is redirected, and new target URL is extracted from response header, is transferred to step (3);If returning to other conditional codes, instruction page connection failure, stop to capture the page and be discarded.
(5) page infos such as date, length, page type are extracted from response header.To be limited if there is provided page crawl, Necessary judgement and filtering are carried out, abandons the undesirable page.
(6) content of the page is read.
Required web page can be collected by above-mentioned steps (1)~(6).
Further, in above-mentioned steps S202, with specific reference to different right described by the web page of step S201 collections As come the webpage classification belonging to dividing gathered web page, for example, the object of certain web page description crawled is used car Information, then the web page can be divided into used car webpage, for another example the object of certain web page description crawled is second-hand family Information is occupied, then the web page can be divided into second-hand household webpage etc..
Further, in above-mentioned steps S203, webpage classification can be corresponded to according to determined by step 202 described by webpage Object attribute (such as:Covering domain, application and use object etc.), the word included from the web page of the collection Key field corresponding to the attribute is extracted in information, for example, for a used car webpage, its attribute may include brand, Mileage number, price, productive year and color etc., so, can be according to these attributes, the text included from the used car webpage of collection Key field corresponding with these attributes is extracted in word information, currently used any key field extracting method can be used To be extracted, the present invention is not particularly limited to this.
Further, in above-mentioned steps S204, the webpage can be generated based on the key field that step S203 is extracted Wrapper corresponding to classification, i.e., for different webpage classifications, using different wrappers.Used wrapper is corresponding Webpage classification formulates corresponding interface, to be established a connection with corresponding webpage classification.In one example, the present invention is adopted Wrapper is used for correspond to webpage class declaration semantic feature identifier and contextual feature identifier, and closed for defining Key word field corresponds to the reference format of textual value, wherein, the semantic feature identifier is used for the semanteme according to key field The text for meeting the semantic feature is identified feature;The contextual feature identifier is used for the linguistic context according to key field The text for meeting the contextual feature is identified feature.
Further, in above-mentioned steps S205, using the wrapper of step S204 generations from the web page gathered Middle extraction effective information, wrapper can use the information extraction rules defined to come out the information extraction in web page, change Into the information of available specific form description, following steps are may particularly include:
First step, each key field extracted for step S203, according to the semantic feature of the key field, It is determined for compliance with the text information included using corresponding semantic feature identifier from the web page of the collection described semantic special The text of sign.The step is a preliminary screening process, and the step does not solve matching conflict, and the text of identification can serve as pair Answer candidate's textual value of key field.
Second step, identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet this The text of key field contextual feature, and as textual value corresponding to the key field.The step is that previous step is entered The process further screened of row, for correct textual value corresponding to being determined in candidate's textual value for being obtained from previous step, the step Suddenly the textual value obtained is the effective information being drawn into from the web page of collection.
By taking the sale of automobile website of collection as an example, pass through the attribute of object described by the sale of automobile website:Brand, mileage Number, price, productive year and color, keyword corresponding to the attribute is extracted in the text information included from the web page of collection Field, it is assumed that webpage to be extracted includes following content:" Nissan SE-V6,1997, red, camper shell, vehicle cover, CD, cruise, AC, tire is good, and mileage only 117K, profile is fine, but runs coarse, blue book:$ 7,415, charge:$5,900obo.」.
Further, the key field based on extraction generates wrapper corresponding to the webpage classification.The wrapper can be with The identifier of all data values, including semantic feature identifier and contextual feature identifier are defined by regular expression, such as: Numerical value is the numeral of 3-6 positions, and may include a comma before last three bit digital, while the first place of numeral can not be 0. Also, for this feature of price, its left context and right context can be defined, left context can specify that one it is legal Price numerical value should follow Ge Ci circle closely after, and the numerical value left side may have mark money symbol (etc.), in the symbol There can be blank character between numerical value, right context shows it must is Ge Ci circle.Define keyword field, it is possible to appear in The word or sentence of (not necessarily adjacent) near numerical value, to imply that the numerical value is effective during data pick-up.For Dictionary can also be embedded in the regular expression of semantic feature identifier and contextual feature identifier by improving one's powers of expression.Example Such as, the dictionary " carmake.dict " comprising various brands can be embedded when to brand recognition.
First, for each key field of extraction, according to the semantic feature of the key field, using corresponding language The text of the semantic feature is determined for compliance with the text information that adopted feature identifier includes from the web page of the collection, as a result It is as shown in table 1 below:
Table 1:The text of semantic feature identifier identification
Then, identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet the key The text of word field contextual feature, and as textual value corresponding to the key field.Nissan is that brand recognition device uniquely extracts , the textual value that wrapper can be using Nissan as brand, equally using the red textual value as color.
The problem of in identification process, is that " 1997 " are the textual values of productive year, mileage number or price, and " 117 ", " 7415 " and " 5900 " which is the textual value of price, and which is the textual value of mileage number.In order to solve this The precedence constraints that the semantic feature of extraction has according to itself can be ranked up by problem, wrapper to candidate's textual value. So, first heuristic rule be pay the utmost attention to which object set left and right expression formula whether can according to identification text Judge the left and right of direct neighbor.Before price (due to $) and mileage number are just come the productive year by this (due to K).Further Judge between price and mileage number, second heuristic rule is that the expression having in candidate's textual value can be used for Distinguish.So, before price being come into mileage number.Because mileage number does not have keyword match, and price has one " charge ". Once candidate's textual value is sorted, wrapper will be considered that each key field receives the text of its constraints permission successively It is worth quantity so that for price, mileage number and productive year an at most only value.In order to determine which textual value is to belong to Price, wrapper can abandon " 1997 " and " 117 " because they do not meet the left and right context constraint of price, and " 7415 " and " 5900 " contain " $ ".However, it is because " 5900 " that wrapper selection " 5900 ", which abandons " 7415 " as the textual value of price, There is price constraints keyword " charge " in the text.For it is remaining be probably that the textual value of mileage number has " 1997 ", " 117 " " 7415 ", wrapper can be by the textual value of " 117 " as mileage number, because " 117 " are that uniquely have to meet mileage number or so One condition " K " of context constraint.Finally, wrapper can be by the textual value of " 1997 " as the productive year, because it is Not in the range of any other keyword constraint.It can identify to obtain the text of key field contextual feature by so processing This, and as textual value corresponding to the key field.
Finally, wrapper can pair key field determined correspond to textual value and do a specification, such as by 117K specifications into 117000 etc..
Further, in step 206, the effective information that can be drawn into step S205 is converted into preset standard form, Such as XML format, and stored.Before the effective information that will convert into preset standard form is stored, can have to this Imitate information and perform following handle:
(1) data normalization is handled
Data normalization is to carry out data normalization processing to describing the key field of same target, to eliminate sign phase Default conflict between convertible key field;The default conflict includes:Naming conflict, format conflicts etc..Such as Field name " sex ", " sex ", " gender " unified standard turn to " sex ", field value conversion can be also completed, such as field value " man | female ", " 0 | 1 ", " male | female " unified standards turn to " man | female ".XML is self-described language, can turn into a kind of general Data interface standard.The structure and content model of XML document can be defined and described by XML Schema.That is, it It can define in the presence of the relation between which element and element in XML document, and the data class of element and attribute can be defined Type.One Schema document can both represent corresponding XML structure, can also represent XML semantic information.And different numbers Can have corresponding XML Schema, different XML Schema to reflect again according to XML document corresponding to the data of source collection The global XML Schema of a standard are mapped to, this mapping can be realized by mapping table, can also be by mapping letter Count to realize, all XML documents are stored into data warehouse or knowledge base afterwards according to the XML Schema conversions of standard, complete Into the standardization of data.
(2) processing of deficiency of data
If the characteristic value that certain record in effective information has one or more features is sky, then it is assumed that the record is present Missing values, it is incomplete data.Deficiency of data can be handled in the following way:
A) availability of data is judged:The integrated degree of every record is judged, if characteristic value missing in a record Too much, or keyword field value missing, by the record deletion.
B) value of missing feature is ignored:Can be that each feature determines weight, for the relatively low feature of weight, such as weight Feature less than 0.3, if its field value lacks, this feature can be ignored.
C) value of filling missing feature:It is more than 0.5 feature for the high feature of weight, such as weight, if its feature Value missing, the value of this feature can be filled.Following any method can generally be taken:Constant value substitutes (special to all missings Value indicative is filled with same constant), statistical method (by the analysis to data, draws the statistical information of data set, utilizes this A little information filling missing values), estimated value method (use related algorithm, as decision tree conclusion scheduling algorithm predicts this feature missing values Probable value, fill missing values with obtained predicted value), the method for classification.
(3) detection of record is repeated
Can be carried out using preset algorithm repeat record detection, and for detect repetition record, retain one record into Row storage.In the illustrative examples of the present invention, can repeat record detection using following algorithm:
For any two records to be detected, determine respectively in two records to be detected corresponding to same keyword field Editing distance between textual value.Same keyword field institute in two records to be detected can be determined using editing distance algorithm Editing distance between corresponding textual value, for example, the key field " red " that is obtained for above-mentioned sale of automobile website and " red Color ", editing distance between the two can be 1.
If the editing distance existed between any corresponding textual value is more than preset field similarity threshold, it is determined that this two Record to be detected does not record for repetition.Preset field similarity threshold can determine that the present invention is not spy based on actual conditions Do not limit.
If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each key Weight information is preset corresponding to word field, summation is weighted to each editing distance;Judgement obtains and value and each weight and value Between business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.Respectively Weight information and preset recording similarity threshold are preset corresponding to key field to be determined based on actual conditions, the present invention And it is not specially limited.
In the case where determining to detect that repetition records by above-mentioned algorithm, for example, two shown in table 2 below are recorded as phase Like repeating to record, then only retain the record of a reference format and stored.
Table 2:The duplicated records of detection
Brand Price Color Mileage
Nissan 5900 It is red 117000
Nissan 5900 Red color 117000
It should be noted that the effective information after above-mentioned (2) and (3) are handled can also be stored in data warehouse or know Know in storehouse, the data warehouse or knowledge base can provide unified pattern and the general-purpose interface accessed, so that user Information needed can easily be obtained.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of data handling system, by the system is solved Certainly the principle of problem is similar to foregoing determination system, therefore the implementation of the system may refer to the implementation of preceding method, repetition Place repeats no more.
As described in Figure 3, embodiments of the invention provide a kind of data handling system, including:
Data acquisition module 301, for gathering web page from preset data source;
Category determination module 302, for determining the webpage classification belonging to the web page of collection;Wherein, the webpage classification Different objects division described by the webpage that is included according to the preset data source;
Information extraction module 303, for using wrapper corresponding to the webpage classification from the web page of the collection Extract effective information;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;
Message processing module 304, for the effective information being drawn into be converted into preset standard form and stored.
Further, described information abstraction module 303 is using wrapper corresponding to the webpage classification from the collection Web page in extract effective information before, be additionally operable to:
The attribute of object according to described by the webpage classification corresponds to webpage, the text included from the web page of the collection Key field corresponding to the attribute is extracted in word information;And
Key field based on extraction generates wrapper corresponding to the webpage classification.
Further, the wrapper is used for for corresponding webpage class declaration semantic feature identifier and contextual feature identification Device;
The semantic feature identifier is used for according to the semantic feature of key field to meeting the text of the semantic feature It is identified;
The contextual feature identifier is used for according to the contextual feature of key field to meeting the text of the contextual feature It is identified;
Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification, specific bag Include:
For each key field of extraction, according to the semantic feature of the key field, using corresponding semantic special The text of the semantic feature is determined for compliance with the text information that sign identifier includes from the web page of the collection;And
Identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet the keyword word The text of section contextual feature, and as textual value corresponding to the key field.
The wrapper is additionally operable to the reference format that definition of keywords field corresponds to textual value;
The effective information being drawn into is converted into preset standard form and stored, is specifically included:
For each key field, textual value corresponding to the key field is converted into corresponding preset standard form And store.
Further, the wrapper is additionally operable to the reference format that definition of keywords field corresponds to textual value;
The effective information being drawn into is converted into preset standard form and stored, is specifically included:
For each key field, textual value corresponding to the key field is converted into corresponding preset standard form And store.
Further, described information processing module 304 is before the effective information being drawn into is stored, in addition to:
Data normalization processing is carried out to the key field of same target described in the effective information that is drawn into, to disappear Except the default conflict characterized between mutually convertible key field;The default conflict includes:Naming conflict, format conflicts.
Further, described information processing module 304 is before the effective information being drawn into is stored, in addition to:
For every record in the effective information that is drawn into, text is corresponded to according to the key field that the record includes The missing degree of value determines whether the record is endless complete record;And the endless complete record for determining, according to default processing Rule is handled endless complete record;
The effective information being drawn into is carried out using preset algorithm to repeat record detection, and the note of the repetition for detecting Record, retain a record and stored;
Wherein, describe every group of textual value corresponding to a set of keyword field of same target and be referred to as a record.
Further, described information processing module 304 carries out repeating record detection using preset algorithm, specifically includes:
For any two records to be detected, determine respectively in two records to be detected corresponding to same keyword field Editing distance between textual value;
If the editing distance existed between any corresponding textual value is more than preset field similarity threshold, it is determined that this two Record to be detected does not record for repetition;
If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each key Weight information is preset corresponding to word field, summation is weighted to each editing distance;Judgement obtains and value and each weight and value Between business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.
Embodiments of the invention additionally provide a kind of data processing equipment, as shown in figure 4, the equipment include memory 401, Processor 402 and it is stored on the memory 402 and the computer program run on device 402 can be managed in this place, wherein, the place The step of reason device 402 realizes above-mentioned data processing method when performing above computer program.
Specifically, above-mentioned memory 401 and processor 402 can be general memory and processor, do not do have here Body limits, and when the computer program of the run memory 401 of processor 402 storage, is able to carry out above-mentioned data processing method, from And solve the problems, such as to be effectively treated network data in correlation technique.
The embodiment of the present invention additionally provides a kind of computer-readable recording medium, is stored on the computer-readable recording medium There is computer program, the computer program performs above-mentioned vegetables office processing method when being run by processor the step of.
Specifically, the storage medium can be general storage medium, such as mobile disk, hard disk, in the storage medium Computer program when being run, above-mentioned data processing method is able to carry out, so as to solve in correlation technique to network data not The problem of being effectively treated.
The respective handling step that the function of above-mentioned each unit may correspond in flow shown in Fig. 1 to 2, will not be repeated here.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
Although having been described for the preferred embodiment of the application, those skilled in the art once know basic creation Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into having altered and changing for the application scope.Obviously, those skilled in the art can be to the application Embodiment carries out the spirit and scope of various changes and modification without departing from the embodiment of the present application.So, if the application is implemented These modifications and variations of example belong within the scope of the application claim and its equivalent technologies, then the application is also intended to include Including these changes and modification.

Claims (10)

1. a kind of data processing method, it is characterised in that comprise the following steps:
Web page is gathered from preset data source;
It is determined that the webpage classification belonging to the web page of collection;Wherein, the webpage classification is to be included according to the preset data source Webpage described by different objects division;
Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification;Wherein, the bag Fill attribute generation of the device for the object according to described by the webpage classification corresponds to webpage;
The effective information being drawn into is converted into preset standard form and stored.
2. data processing method according to claim 1, it is characterised in that using packaging corresponding to the webpage classification Before device extracts effective information from the web page of the collection, in addition to:
The attribute of object according to described by the webpage classification corresponds to webpage, the word letter included from the web page of the collection Key field corresponding to the attribute is extracted in breath;And
Key field based on extraction generates wrapper corresponding to the webpage classification.
3. data processing method according to claim 2, it is characterised in that the wrapper is used for for corresponding webpage classification Define semantic feature identifier and contextual feature identifier;
The semantic feature identifier is used to carry out the text for meeting the semantic feature according to the semantic feature of key field Identification;
The contextual feature identifier is used to carry out the text for meeting the contextual feature according to the contextual feature of key field Identification;
Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification, specifically included:
For each key field of extraction, according to the semantic feature of the key field, known using corresponding semantic feature The text of the semantic feature is determined for compliance with the text information that other device includes from the web page of the collection;And
Identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet the key field language The text of border feature, and as textual value corresponding to the key field.
4. data processing method according to claim 3, it is characterised in that the wrapper is additionally operable to definition of keywords word The reference format of the corresponding textual value of section;
The effective information being drawn into is converted into preset standard form and stored, is specifically included:
For each key field, textual value corresponding to the key field is converted into corresponding preset standard form and deposited Storage.
5. according to the data processing method described in claim any one of 2-4, it is characterised in that in the effective information that will be drawn into Before being stored, in addition to:
Data normalization processing is carried out to the key field of same target described in the effective information that is drawn into, to eliminate table Default conflict between the mutually convertible key field of sign;The default conflict includes:Naming conflict, format conflicts.
6. according to the data processing method described in claim any one of 2-4, it is characterised in that in the effective information that will be drawn into Before being stored, in addition to:
For every record in the effective information that is drawn into, textual value is corresponded to according to the key field that the record includes Missing degree determines whether the record is endless complete record;And the endless complete record for determining, according to default processing rule Endless complete record is handled;
The effective information being drawn into is carried out using preset algorithm to repeat record detection, and the record of the repetition for detecting, Retain a record to be stored;
Wherein, describe every group of textual value corresponding to a set of keyword field of same target and be referred to as a record.
7. data processing method according to claim 6, it is characterised in that carry out repeating record inspection using preset algorithm Survey, specifically include:
For any two records to be detected, text corresponding to same keyword field in two records to be detected is determined respectively Editing distance between value;
If the editing distance existed between any corresponding textual value is more than preset field similarity threshold, it is determined that this two to be checked Record is surveyed not record for repetition;
If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each keyword word Weight information is preset corresponding to section, summation is weighted to each editing distance;Judge obtain between value and each weight and value Business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.
A kind of 8. data handling system, it is characterised in that including:
Data acquisition module, for gathering web page from preset data source;
Category determination module, for determining the webpage classification belonging to the web page of collection;Wherein, according to the webpage classification What the different objects described by the webpage that the preset data source includes divided;
Information extraction module, have for being extracted using wrapper corresponding to the webpage classification from the web page of the collection Imitate information;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;
Message processing module, for the effective information being drawn into be converted into preset standard form and stored.
9. a kind of data processing equipment, including:Memory, processor and it is stored on the memory and can be in the processor The computer program of upper operation, it is characterised in that realize the claims 1 described in the computing device during computer program The step of to method described in 7 any one.
10. a kind of computer-readable recording medium, computer program is stored with the computer-readable recording medium, its feature Be, when the computer program is run by processor perform any one of the claims 1 to 7 described in method the step of.
CN201710630757.2A 2017-07-28 2017-07-28 A kind of data processing method and system Pending CN107480134A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710630757.2A CN107480134A (en) 2017-07-28 2017-07-28 A kind of data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710630757.2A CN107480134A (en) 2017-07-28 2017-07-28 A kind of data processing method and system

Publications (1)

Publication Number Publication Date
CN107480134A true CN107480134A (en) 2017-12-15

Family

ID=60596830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710630757.2A Pending CN107480134A (en) 2017-07-28 2017-07-28 A kind of data processing method and system

Country Status (1)

Country Link
CN (1) CN107480134A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109343993A (en) * 2018-09-28 2019-02-15 郑州云海信息技术有限公司 A kind of error message processing method and processing device of cloud platform
CN109885532A (en) * 2019-02-11 2019-06-14 中国银行股份有限公司 A kind of transaction data standardized method and device
CN110572435A (en) * 2019-08-05 2019-12-13 慧镕电子系统工程股份有限公司 Data processing method of cloud computing system
CN110781655A (en) * 2019-10-29 2020-02-11 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device for title column, computer equipment and storage medium
CN110795654A (en) * 2019-10-29 2020-02-14 深圳前海环融联易信息科技服务有限公司 Webpage data display method and device, computer equipment and storage medium
CN110825944A (en) * 2019-10-29 2020-02-21 深圳前海环融联易信息科技服务有限公司 Webpage table data acquisition method and device, computer equipment and storage medium
CN111143554A (en) * 2019-12-10 2020-05-12 中盈优创资讯科技有限公司 Data sampling method and device based on big data platform
CN111935231A (en) * 2020-07-13 2020-11-13 支付宝(杭州)信息技术有限公司 Information processing method and device
CN113536754A (en) * 2020-04-21 2021-10-22 阿里巴巴集团控股有限公司 Text generation method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7444325B2 (en) * 2005-01-14 2008-10-28 Im2, Inc. Method and system for information extraction
CN101350019A (en) * 2008-06-20 2009-01-21 浙江大学 Method for abstracting web page information based on vector model between predefined slots
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN104281703A (en) * 2014-10-22 2015-01-14 小米科技有限责任公司 Method and device for calculating similarity among uniform resource locators (URL)

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7444325B2 (en) * 2005-01-14 2008-10-28 Im2, Inc. Method and system for information extraction
CN101350019A (en) * 2008-06-20 2009-01-21 浙江大学 Method for abstracting web page information based on vector model between predefined slots
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN104281703A (en) * 2014-10-22 2015-01-14 小米科技有限责任公司 Method and device for calculating similarity among uniform resource locators (URL)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贺令亚,柳佳刚: "基于Web的包装器技术的现状与发展", 《电脑开发与应用》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109343993A (en) * 2018-09-28 2019-02-15 郑州云海信息技术有限公司 A kind of error message processing method and processing device of cloud platform
CN109885532A (en) * 2019-02-11 2019-06-14 中国银行股份有限公司 A kind of transaction data standardized method and device
CN110572435B (en) * 2019-08-05 2022-02-11 慧镕电子系统工程股份有限公司 Data processing method of cloud computing system
CN110572435A (en) * 2019-08-05 2019-12-13 慧镕电子系统工程股份有限公司 Data processing method of cloud computing system
CN110781655A (en) * 2019-10-29 2020-02-11 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device for title column, computer equipment and storage medium
CN110825944A (en) * 2019-10-29 2020-02-21 深圳前海环融联易信息科技服务有限公司 Webpage table data acquisition method and device, computer equipment and storage medium
CN110795654A (en) * 2019-10-29 2020-02-14 深圳前海环融联易信息科技服务有限公司 Webpage data display method and device, computer equipment and storage medium
CN110781655B (en) * 2019-10-29 2023-10-27 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device for title column, computer equipment and storage medium
CN111143554A (en) * 2019-12-10 2020-05-12 中盈优创资讯科技有限公司 Data sampling method and device based on big data platform
CN111143554B (en) * 2019-12-10 2024-03-12 中盈优创资讯科技有限公司 Data sampling method and device based on big data platform
CN113536754A (en) * 2020-04-21 2021-10-22 阿里巴巴集团控股有限公司 Text generation method and device and electronic equipment
CN113536754B (en) * 2020-04-21 2024-06-25 阿里巴巴集团控股有限公司 Text generation method and device and electronic equipment
CN111935231A (en) * 2020-07-13 2020-11-13 支付宝(杭州)信息技术有限公司 Information processing method and device

Similar Documents

Publication Publication Date Title
CN107480134A (en) A kind of data processing method and system
US8190621B2 (en) Method, system, and computer readable recording medium for filtering obscene contents
US8856129B2 (en) Flexible and scalable structured web data extraction
US20100169301A1 (en) System and method for aggregating and ranking data from a plurality of web sites
CN103309862B (en) Webpage type recognition method and system
CN102054015A (en) System and method of organizing community intelligent information by using organic matter data model
US20080147578A1 (en) System for prioritizing search results retrieved in response to a computerized search query
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
EP2291812A2 (en) Forum web page clustering based on repetitive regions
TW201115370A (en) Systems and methods for capturing and managing collective social intelligence information
US20080147588A1 (en) Method for discovering data artifacts in an on-line data object
CN112650923A (en) Public opinion processing method and device for news events, storage medium and computer equipment
CN107943514A (en) The method for digging and system of core code element in a kind of software document
CN106446124B (en) A kind of Website classification method based on cyberrelationship figure
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN109271489A (en) A kind of Method for text detection and device
CN110209659A (en) A kind of resume filter method, system and computer readable storage medium
CN107741958A (en) A kind of data processing method and system
CN108536664A (en) The knowledge fusion method in commodity field
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
CN102460440B (en) Searching methods and devices
US20160321345A1 (en) Chain understanding in search
CN110348877B (en) Intelligent service recommendation algorithm based on big data and computer readable storage medium
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District

Applicant after: Guoxin Youyi Data Co., Ltd

Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing

Applicant before: SIC YOUE DATA Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171215