CN107480134A - A kind of data processing method and system - Google Patents
A kind of data processing method and system Download PDFInfo
- Publication number
- CN107480134A CN107480134A CN201710630757.2A CN201710630757A CN107480134A CN 107480134 A CN107480134 A CN 107480134A CN 201710630757 A CN201710630757 A CN 201710630757A CN 107480134 A CN107480134 A CN 107480134A
- Authority
- CN
- China
- Prior art keywords
- record
- webpage
- preset
- web page
- key field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of data processing method, comprises the following steps:Web page is gathered from preset data source;It is determined that the webpage classification belonging to the web page of collection;Wherein, the webpage classification is the different objects division according to described by the webpage that the preset data source includes;Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;The effective information being drawn into is converted into preset standard form and stored.The present invention can improve the value of network data effectively by lengthy and tedious network data processing into desirable data.
Description
Technical field
The invention belongs to data processing field, and in particular to a kind of data processing method and system.
Background technology
The basic object of data processing is extracted and derived from substantial amounts of, rambling, elusive data
It is valuable, significant data for some specific people.Data processing is system engineering and the base that automatically controls
This link.Every field of the data processing through social production and social life.The development and its application of data processing technique
Breadth and depth, greatly affect the process of human social development.
Data processing can find in time and correct in data file can recognize that mistake and mistake is corrected, mainly
Including checking data consistency, invalid value and missing values etc. are handled.Because the data in data warehouse are towards a certain theme
The set of data, these data extract from multiple operation systems and comprising historical datas, thus unavoidable to have
Data be wrong data, the data that have have conflict between each other, data that are these mistakes or having conflict are referred to as " dirty data ".
These " dirty datas " are if without processing, it will produce interference to the real value of data, and then influence data value.It is existing
The data processing method of technology mainly for the structuring from database data, it is and at full speed with computer networking technology
Development, a large amount of valuable network datas are generated, and in network data are largely semi-structured and non-structured data,
And lack the effective data processing method for network data in the prior art.
The content of the invention
For above-mentioned technical problem, the present invention provides a kind of data processing method and system, and network data can be carried out
Processing, to extract effective information.
The technical solution adopted by the present invention is:
An aspect of of the present present invention provides a kind of data processing method, comprises the following steps:Web nets are gathered from preset data source
Page;It is determined that the webpage classification belonging to the web page of collection;Wherein, the webpage classification is to be included according to the preset data source
Webpage described by different objects division;Using wrapper corresponding to the webpage classification from the web page of the collection
Middle extraction effective information;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;
The effective information being drawn into is converted into preset standard form and stored.
Alternatively, effective letter is being extracted from the web page of the collection using wrapper corresponding to the webpage classification
Before breath, in addition to:The attribute of object according to described by the webpage classification corresponds to webpage, from the web page bag of the collection
Key field corresponding to the attribute is extracted in the text information contained;And the key field based on extraction generates the webpage
Wrapper corresponding to classification.
Alternatively, the wrapper is used for for corresponding webpage class declaration semantic feature identifier and contextual feature identification
Device;The semantic feature identifier is used to know the text for meeting the semantic feature according to the semantic feature of key field
Not;The contextual feature identifier is used to know the text for meeting the contextual feature according to the contextual feature of key field
Not;Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification, specifically included:Pin
To each key field of extraction, according to the semantic feature of the key field, using corresponding semantic feature identifier from
The text of the semantic feature is determined for compliance with the text information that the web page of the collection includes;And language corresponding to use
Border feature identifier identifies from the text for meeting the semantic feature meets the text of the key field contextual feature, and makees
For textual value corresponding to the key field.
Alternatively, the wrapper is additionally operable to the reference format that definition of keywords field corresponds to textual value;By what is be drawn into
Effective information is converted into preset standard form and stored, and specifically includes:For each key field, by the key field pair
The textual value answered is converted to corresponding preset standard form and stored.
Alternatively, before the effective information being drawn into is stored, in addition to:To in the effective information that is drawn into
The key field for describing same target carries out data normalization processing, between the mutually convertible key field of elimination sign
Default conflict;The default conflict includes:Naming conflict, format conflicts.
Alternatively, before the effective information being drawn into is stored, in addition to:For the effective information being drawn into
In every record, the missing degree that the key field included according to the record corresponds to textual value determines whether the record is not
Complete documentation;And the endless complete record for determining, endless complete record is handled according to default processing rule;Using pre-
Imputation method carries out repeating record detection, and the record of the repetition for detecting to the effective information being drawn into, and retains a note
Record is stored;Wherein, describe every group of textual value corresponding to a set of keyword field of same target and be referred to as a record.
Alternatively, carry out repeating record detection using preset algorithm, specifically include:For any two records to be detected,
The editing distance between textual value corresponding to same keyword field in two records to be detected is determined respectively;If exist any
Editing distance between corresponding textual value is more than preset field similarity threshold, it is determined that this two records to be detected are not repetition
Record;If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each keyword
Weight information is preset corresponding to field, summation is weighted to each editing distance;Judging to obtain and value and each weight and value it
Between business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.
Another aspect of the present invention provides a kind of data handling system, including:Data acquisition module, for from preset data
Source gathers web page;Category determination module, for determining the webpage classification belonging to the web page of collection;Wherein, the webpage
Classification is the different objects division according to described by the webpage that the preset data source includes;Information extraction module, for adopting
Wrapper extracts effective information from the web page of the collection corresponding to the webpage classification;Wherein, the wrapper is
The attribute generation of object according to described by the webpage classification corresponds to webpage;Message processing module, for effective by what is be drawn into
Information is converted into preset standard form and stored.
Another aspect of the invention provides a kind of data processing equipment, including:Memory, processor and it is stored in described deposit
On reservoir and the computer program that can run on the processor, realized described in the computing device during computer program
The step of stating method.
Another aspect of the invention provides a kind of computer-readable recording medium, is deposited on the computer-readable recording medium
Computer program is contained, when the computer program is run by processor the step of the execution above method.
Data processing method and system provided by the invention, this method gather web page from preset data source first, and really
Surely the webpage classification belonging to the web page gathered, then using wrapper corresponding to the webpage classification from the web of the collection
Effective information is extracted in webpage, and the effective information being drawn into is converted into preset standard form and stored, so as to have
Effect by lengthy and tedious network data processing into desirable data, improve the value of network data.
Brief description of the drawings
Fig. 1 is the schematic flow sheet for the data processing method that one embodiment of the invention provides;
Fig. 2 is the schematic flow sheet for the data processing method that another embodiment of the present invention provides;
Fig. 3 is the structured flowchart of data handling system provided in an embodiment of the present invention;
Fig. 4 is the structured flowchart of data processing equipment provided in an embodiment of the present invention.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool
Body embodiment is described in detail.
Fig. 1 is the schematic flow sheet for the data processing method that one embodiment of the invention provides.As shown in figure 1, the present invention is real
The data processing method of example offer is provided, comprised the following steps:
S101, from preset data source gather web page.
The webpage classification belonging to web page that S102, determination gather;Wherein, the webpage classification is according to described default
What the different objects described by the webpage that data source includes divided.
S103, effective information extracted from the web page of the collection using wrapper corresponding to the webpage classification;Its
In, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage.
S104, the effective information being drawn into is converted into preset standard form and stored.
The data processing method that the present embodiment provides, WWW (web) webpage is gathered from preset data source first, and determined
Webpage classification belonging to the web page of collection, then using wrapper corresponding to the webpage classification from the web nets of the collection
Effective information is extracted in page, and the effective information being drawn into is converted into preset standard form and stored, so as to effective
By lengthy and tedious network data processing into desirable data, improve the value of network data.
In another embodiment of the invention, the present invention provides a kind of data processing method, as shown in Fig. 2 including following
Step:
S201, from preset data source gather web page.
S202, determine step S201 collection web page belonging to webpage classification;Wherein, according to the webpage classification
What the different objects described by the webpage that the preset data source includes divided.
S203, according to step S202 determine webpage classification correspond to webpage described by object attribute, gathered from S201
The text information that includes of web page in extract key field corresponding to the attribute.
Wrapper corresponding to S204, the key field generation webpage classification extracted based on step S203.
Taken out in the web page that wrapper corresponding to S205, the webpage classification determined using step S204 is gathered from S201
Take effective information.
S206, the step S205 effective informations being drawn into are converted into preset standard form and stored.
Before the valid data being drawn into are stored, following steps are can also carry out:
Step 1: the key field of same target carries out data described in the effective information being drawn into step S205
Standardization, to eliminate the default conflict between the mutually convertible key field of sign;The default conflict includes:Name
Conflict, format conflicts.
Step 2: every record in the effective information being drawn into for step S205, the pass included according to the record
The missing degree that key word field corresponds to textual value determines whether the record is endless complete record;And the imperfect note for determining
Record, endless complete record is handled according to default processing rule.Wherein, the set of keyword field for describing same target is corresponding
Every group of textual value be referred to as a record.
Step 3: the effective information being drawn into using preset algorithm to step S205 carries out repeating record detection, and pin
The repetition detected is recorded, retains a record and is stored.
Above-mentioned steps one perform no strict sequencing to step 3.
Further, in above-mentioned steps S201, web page can be gathered from preset data source by various sampling instruments,
Preset data source can be the network data source according to specified by business demand.In one exemplary embodiment of the invention, may be used
It is acquired using web information collector or protocol processor.
In the case where being acquired using Web information collector, web information collector is from an initial unified resource
Finger URL (Uniform Resource Locator, URL) collection is set out, and these URL are all put into one and orderly wait to adopt
Collect in queue.And collector takes out URL in order in this queue, by the agreement on web, the pointed page is obtained,
Then new URL is extracted in the page that oneself obtains from these, and they are continued to be put into queue to be collected, is then repeated
Process above, until collector stops collection according to the strategy of oneself.Needed for so, it is possible to collect from preset data source
Web page.
In the case where being acquired using protocol processor, the collection of data is mainly completed by various agreements.One
As for, agreement may include:HTTP (Hyper Text Transport Protocol, HTTP), file transmission
Agreement (File Transfer Protocol, FTP), paddy Buddhist (Gopher), BBS (Bulletin Board
System, BBS) etc., by taking HTTP as an example, acquisition step may include:
(1) according to page URL, targeted sites address and port numbers are extracted out, if slogan is set to default port 80 for no reason.Judge
The connected mode of the website is set, and network connection is established with the address and port if being set to be directly connected to;If it is set to pass through generation
Reason server (Proxy) connection then establishes network connection with specified Proxy addresses and port.
(2) if establishing network connection failure, illustrate that the website is unreachable, stop to capture the page and be discarded;Otherwise
Continue next step and obtain specified page.
(3) by web page packaging HTTP request head, inserted if the website needs user's mark and password in request header,
It transmit a request to targeted sites.Stop to capture the page if response message is not received more than certain time and be discarded;It is no
Then continue next step response message.
(4) response header is analyzed, judges the conditional code returned:If conditional code is 2xx, the correct page is returned to, into step
(5);If conditional code is 301 or 302, representation page is redirected, and new target URL is extracted from response header, is transferred to step
(3);If returning to other conditional codes, instruction page connection failure, stop to capture the page and be discarded.
(5) page infos such as date, length, page type are extracted from response header.To be limited if there is provided page crawl,
Necessary judgement and filtering are carried out, abandons the undesirable page.
(6) content of the page is read.
Required web page can be collected by above-mentioned steps (1)~(6).
Further, in above-mentioned steps S202, with specific reference to different right described by the web page of step S201 collections
As come the webpage classification belonging to dividing gathered web page, for example, the object of certain web page description crawled is used car
Information, then the web page can be divided into used car webpage, for another example the object of certain web page description crawled is second-hand family
Information is occupied, then the web page can be divided into second-hand household webpage etc..
Further, in above-mentioned steps S203, webpage classification can be corresponded to according to determined by step 202 described by webpage
Object attribute (such as:Covering domain, application and use object etc.), the word included from the web page of the collection
Key field corresponding to the attribute is extracted in information, for example, for a used car webpage, its attribute may include brand,
Mileage number, price, productive year and color etc., so, can be according to these attributes, the text included from the used car webpage of collection
Key field corresponding with these attributes is extracted in word information, currently used any key field extracting method can be used
To be extracted, the present invention is not particularly limited to this.
Further, in above-mentioned steps S204, the webpage can be generated based on the key field that step S203 is extracted
Wrapper corresponding to classification, i.e., for different webpage classifications, using different wrappers.Used wrapper is corresponding
Webpage classification formulates corresponding interface, to be established a connection with corresponding webpage classification.In one example, the present invention is adopted
Wrapper is used for correspond to webpage class declaration semantic feature identifier and contextual feature identifier, and closed for defining
Key word field corresponds to the reference format of textual value, wherein, the semantic feature identifier is used for the semanteme according to key field
The text for meeting the semantic feature is identified feature;The contextual feature identifier is used for the linguistic context according to key field
The text for meeting the contextual feature is identified feature.
Further, in above-mentioned steps S205, using the wrapper of step S204 generations from the web page gathered
Middle extraction effective information, wrapper can use the information extraction rules defined to come out the information extraction in web page, change
Into the information of available specific form description, following steps are may particularly include:
First step, each key field extracted for step S203, according to the semantic feature of the key field,
It is determined for compliance with the text information included using corresponding semantic feature identifier from the web page of the collection described semantic special
The text of sign.The step is a preliminary screening process, and the step does not solve matching conflict, and the text of identification can serve as pair
Answer candidate's textual value of key field.
Second step, identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet this
The text of key field contextual feature, and as textual value corresponding to the key field.The step is that previous step is entered
The process further screened of row, for correct textual value corresponding to being determined in candidate's textual value for being obtained from previous step, the step
Suddenly the textual value obtained is the effective information being drawn into from the web page of collection.
By taking the sale of automobile website of collection as an example, pass through the attribute of object described by the sale of automobile website:Brand, mileage
Number, price, productive year and color, keyword corresponding to the attribute is extracted in the text information included from the web page of collection
Field, it is assumed that webpage to be extracted includes following content:" Nissan SE-V6,1997, red, camper shell, vehicle cover, CD, cruise,
AC, tire is good, and mileage only 117K, profile is fine, but runs coarse, blue book:$ 7,415, charge:$5,900obo.」.
Further, the key field based on extraction generates wrapper corresponding to the webpage classification.The wrapper can be with
The identifier of all data values, including semantic feature identifier and contextual feature identifier are defined by regular expression, such as:
Numerical value is the numeral of 3-6 positions, and may include a comma before last three bit digital, while the first place of numeral can not be 0.
Also, for this feature of price, its left context and right context can be defined, left context can specify that one it is legal
Price numerical value should follow Ge Ci circle closely after, and the numerical value left side may have mark money symbol (etc.), in the symbol
There can be blank character between numerical value, right context shows it must is Ge Ci circle.Define keyword field, it is possible to appear in
The word or sentence of (not necessarily adjacent) near numerical value, to imply that the numerical value is effective during data pick-up.For
Dictionary can also be embedded in the regular expression of semantic feature identifier and contextual feature identifier by improving one's powers of expression.Example
Such as, the dictionary " carmake.dict " comprising various brands can be embedded when to brand recognition.
First, for each key field of extraction, according to the semantic feature of the key field, using corresponding language
The text of the semantic feature is determined for compliance with the text information that adopted feature identifier includes from the web page of the collection, as a result
It is as shown in table 1 below:
Table 1:The text of semantic feature identifier identification
Then, identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet the key
The text of word field contextual feature, and as textual value corresponding to the key field.Nissan is that brand recognition device uniquely extracts
, the textual value that wrapper can be using Nissan as brand, equally using the red textual value as color.
The problem of in identification process, is that " 1997 " are the textual values of productive year, mileage number or price, and
" 117 ", " 7415 " and " 5900 " which is the textual value of price, and which is the textual value of mileage number.In order to solve this
The precedence constraints that the semantic feature of extraction has according to itself can be ranked up by problem, wrapper to candidate's textual value.
So, first heuristic rule be pay the utmost attention to which object set left and right expression formula whether can according to identification text
Judge the left and right of direct neighbor.Before price (due to $) and mileage number are just come the productive year by this (due to K).Further
Judge between price and mileage number, second heuristic rule is that the expression having in candidate's textual value can be used for
Distinguish.So, before price being come into mileage number.Because mileage number does not have keyword match, and price has one " charge ".
Once candidate's textual value is sorted, wrapper will be considered that each key field receives the text of its constraints permission successively
It is worth quantity so that for price, mileage number and productive year an at most only value.In order to determine which textual value is to belong to
Price, wrapper can abandon " 1997 " and " 117 " because they do not meet the left and right context constraint of price, and " 7415 " and
" 5900 " contain " $ ".However, it is because " 5900 " that wrapper selection " 5900 ", which abandons " 7415 " as the textual value of price,
There is price constraints keyword " charge " in the text.For it is remaining be probably that the textual value of mileage number has " 1997 ", " 117 "
" 7415 ", wrapper can be by the textual value of " 117 " as mileage number, because " 117 " are that uniquely have to meet mileage number or so
One condition " K " of context constraint.Finally, wrapper can be by the textual value of " 1997 " as the productive year, because it is
Not in the range of any other keyword constraint.It can identify to obtain the text of key field contextual feature by so processing
This, and as textual value corresponding to the key field.
Finally, wrapper can pair key field determined correspond to textual value and do a specification, such as by 117K specifications into
117000 etc..
Further, in step 206, the effective information that can be drawn into step S205 is converted into preset standard form,
Such as XML format, and stored.Before the effective information that will convert into preset standard form is stored, can have to this
Imitate information and perform following handle:
(1) data normalization is handled
Data normalization is to carry out data normalization processing to describing the key field of same target, to eliminate sign phase
Default conflict between convertible key field;The default conflict includes:Naming conflict, format conflicts etc..Such as
Field name " sex ", " sex ", " gender " unified standard turn to " sex ", field value conversion can be also completed, such as field value
" man | female ", " 0 | 1 ", " male | female " unified standards turn to " man | female ".XML is self-described language, can turn into a kind of general
Data interface standard.The structure and content model of XML document can be defined and described by XML Schema.That is, it
It can define in the presence of the relation between which element and element in XML document, and the data class of element and attribute can be defined
Type.One Schema document can both represent corresponding XML structure, can also represent XML semantic information.And different numbers
Can have corresponding XML Schema, different XML Schema to reflect again according to XML document corresponding to the data of source collection
The global XML Schema of a standard are mapped to, this mapping can be realized by mapping table, can also be by mapping letter
Count to realize, all XML documents are stored into data warehouse or knowledge base afterwards according to the XML Schema conversions of standard, complete
Into the standardization of data.
(2) processing of deficiency of data
If the characteristic value that certain record in effective information has one or more features is sky, then it is assumed that the record is present
Missing values, it is incomplete data.Deficiency of data can be handled in the following way:
A) availability of data is judged:The integrated degree of every record is judged, if characteristic value missing in a record
Too much, or keyword field value missing, by the record deletion.
B) value of missing feature is ignored:Can be that each feature determines weight, for the relatively low feature of weight, such as weight
Feature less than 0.3, if its field value lacks, this feature can be ignored.
C) value of filling missing feature:It is more than 0.5 feature for the high feature of weight, such as weight, if its feature
Value missing, the value of this feature can be filled.Following any method can generally be taken:Constant value substitutes (special to all missings
Value indicative is filled with same constant), statistical method (by the analysis to data, draws the statistical information of data set, utilizes this
A little information filling missing values), estimated value method (use related algorithm, as decision tree conclusion scheduling algorithm predicts this feature missing values
Probable value, fill missing values with obtained predicted value), the method for classification.
(3) detection of record is repeated
Can be carried out using preset algorithm repeat record detection, and for detect repetition record, retain one record into
Row storage.In the illustrative examples of the present invention, can repeat record detection using following algorithm:
For any two records to be detected, determine respectively in two records to be detected corresponding to same keyword field
Editing distance between textual value.Same keyword field institute in two records to be detected can be determined using editing distance algorithm
Editing distance between corresponding textual value, for example, the key field " red " that is obtained for above-mentioned sale of automobile website and " red
Color ", editing distance between the two can be 1.
If the editing distance existed between any corresponding textual value is more than preset field similarity threshold, it is determined that this two
Record to be detected does not record for repetition.Preset field similarity threshold can determine that the present invention is not spy based on actual conditions
Do not limit.
If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each key
Weight information is preset corresponding to word field, summation is weighted to each editing distance;Judgement obtains and value and each weight and value
Between business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.Respectively
Weight information and preset recording similarity threshold are preset corresponding to key field to be determined based on actual conditions, the present invention
And it is not specially limited.
In the case where determining to detect that repetition records by above-mentioned algorithm, for example, two shown in table 2 below are recorded as phase
Like repeating to record, then only retain the record of a reference format and stored.
Table 2:The duplicated records of detection
Brand | Price | Color | Mileage |
Nissan | 5900 | It is red | 117000 |
Nissan | 5900 | Red color | 117000 |
It should be noted that the effective information after above-mentioned (2) and (3) are handled can also be stored in data warehouse or know
Know in storehouse, the data warehouse or knowledge base can provide unified pattern and the general-purpose interface accessed, so that user
Information needed can easily be obtained.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of data handling system, by the system is solved
Certainly the principle of problem is similar to foregoing determination system, therefore the implementation of the system may refer to the implementation of preceding method, repetition
Place repeats no more.
As described in Figure 3, embodiments of the invention provide a kind of data handling system, including:
Data acquisition module 301, for gathering web page from preset data source;
Category determination module 302, for determining the webpage classification belonging to the web page of collection;Wherein, the webpage classification
Different objects division described by the webpage that is included according to the preset data source;
Information extraction module 303, for using wrapper corresponding to the webpage classification from the web page of the collection
Extract effective information;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;
Message processing module 304, for the effective information being drawn into be converted into preset standard form and stored.
Further, described information abstraction module 303 is using wrapper corresponding to the webpage classification from the collection
Web page in extract effective information before, be additionally operable to:
The attribute of object according to described by the webpage classification corresponds to webpage, the text included from the web page of the collection
Key field corresponding to the attribute is extracted in word information;And
Key field based on extraction generates wrapper corresponding to the webpage classification.
Further, the wrapper is used for for corresponding webpage class declaration semantic feature identifier and contextual feature identification
Device;
The semantic feature identifier is used for according to the semantic feature of key field to meeting the text of the semantic feature
It is identified;
The contextual feature identifier is used for according to the contextual feature of key field to meeting the text of the contextual feature
It is identified;
Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification, specific bag
Include:
For each key field of extraction, according to the semantic feature of the key field, using corresponding semantic special
The text of the semantic feature is determined for compliance with the text information that sign identifier includes from the web page of the collection;And
Identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet the keyword word
The text of section contextual feature, and as textual value corresponding to the key field.
The wrapper is additionally operable to the reference format that definition of keywords field corresponds to textual value;
The effective information being drawn into is converted into preset standard form and stored, is specifically included:
For each key field, textual value corresponding to the key field is converted into corresponding preset standard form
And store.
Further, the wrapper is additionally operable to the reference format that definition of keywords field corresponds to textual value;
The effective information being drawn into is converted into preset standard form and stored, is specifically included:
For each key field, textual value corresponding to the key field is converted into corresponding preset standard form
And store.
Further, described information processing module 304 is before the effective information being drawn into is stored, in addition to:
Data normalization processing is carried out to the key field of same target described in the effective information that is drawn into, to disappear
Except the default conflict characterized between mutually convertible key field;The default conflict includes:Naming conflict, format conflicts.
Further, described information processing module 304 is before the effective information being drawn into is stored, in addition to:
For every record in the effective information that is drawn into, text is corresponded to according to the key field that the record includes
The missing degree of value determines whether the record is endless complete record;And the endless complete record for determining, according to default processing
Rule is handled endless complete record;
The effective information being drawn into is carried out using preset algorithm to repeat record detection, and the note of the repetition for detecting
Record, retain a record and stored;
Wherein, describe every group of textual value corresponding to a set of keyword field of same target and be referred to as a record.
Further, described information processing module 304 carries out repeating record detection using preset algorithm, specifically includes:
For any two records to be detected, determine respectively in two records to be detected corresponding to same keyword field
Editing distance between textual value;
If the editing distance existed between any corresponding textual value is more than preset field similarity threshold, it is determined that this two
Record to be detected does not record for repetition;
If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each key
Weight information is preset corresponding to word field, summation is weighted to each editing distance;Judgement obtains and value and each weight and value
Between business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.
Embodiments of the invention additionally provide a kind of data processing equipment, as shown in figure 4, the equipment include memory 401,
Processor 402 and it is stored on the memory 402 and the computer program run on device 402 can be managed in this place, wherein, the place
The step of reason device 402 realizes above-mentioned data processing method when performing above computer program.
Specifically, above-mentioned memory 401 and processor 402 can be general memory and processor, do not do have here
Body limits, and when the computer program of the run memory 401 of processor 402 storage, is able to carry out above-mentioned data processing method, from
And solve the problems, such as to be effectively treated network data in correlation technique.
The embodiment of the present invention additionally provides a kind of computer-readable recording medium, is stored on the computer-readable recording medium
There is computer program, the computer program performs above-mentioned vegetables office processing method when being run by processor the step of.
Specifically, the storage medium can be general storage medium, such as mobile disk, hard disk, in the storage medium
Computer program when being run, above-mentioned data processing method is able to carry out, so as to solve in correlation technique to network data not
The problem of being effectively treated.
The respective handling step that the function of above-mentioned each unit may correspond in flow shown in Fig. 1 to 2, will not be repeated here.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
Although having been described for the preferred embodiment of the application, those skilled in the art once know basic creation
Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent
Select embodiment and fall into having altered and changing for the application scope.Obviously, those skilled in the art can be to the application
Embodiment carries out the spirit and scope of various changes and modification without departing from the embodiment of the present application.So, if the application is implemented
These modifications and variations of example belong within the scope of the application claim and its equivalent technologies, then the application is also intended to include
Including these changes and modification.
Claims (10)
1. a kind of data processing method, it is characterised in that comprise the following steps:
Web page is gathered from preset data source;
It is determined that the webpage classification belonging to the web page of collection;Wherein, the webpage classification is to be included according to the preset data source
Webpage described by different objects division;
Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification;Wherein, the bag
Fill attribute generation of the device for the object according to described by the webpage classification corresponds to webpage;
The effective information being drawn into is converted into preset standard form and stored.
2. data processing method according to claim 1, it is characterised in that using packaging corresponding to the webpage classification
Before device extracts effective information from the web page of the collection, in addition to:
The attribute of object according to described by the webpage classification corresponds to webpage, the word letter included from the web page of the collection
Key field corresponding to the attribute is extracted in breath;And
Key field based on extraction generates wrapper corresponding to the webpage classification.
3. data processing method according to claim 2, it is characterised in that the wrapper is used for for corresponding webpage classification
Define semantic feature identifier and contextual feature identifier;
The semantic feature identifier is used to carry out the text for meeting the semantic feature according to the semantic feature of key field
Identification;
The contextual feature identifier is used to carry out the text for meeting the contextual feature according to the contextual feature of key field
Identification;
Effective information is extracted from the web page of the collection using wrapper corresponding to the webpage classification, specifically included:
For each key field of extraction, according to the semantic feature of the key field, known using corresponding semantic feature
The text of the semantic feature is determined for compliance with the text information that other device includes from the web page of the collection;And
Identified using corresponding contextual feature identifier from the text for meeting the semantic feature and meet the key field language
The text of border feature, and as textual value corresponding to the key field.
4. data processing method according to claim 3, it is characterised in that the wrapper is additionally operable to definition of keywords word
The reference format of the corresponding textual value of section;
The effective information being drawn into is converted into preset standard form and stored, is specifically included:
For each key field, textual value corresponding to the key field is converted into corresponding preset standard form and deposited
Storage.
5. according to the data processing method described in claim any one of 2-4, it is characterised in that in the effective information that will be drawn into
Before being stored, in addition to:
Data normalization processing is carried out to the key field of same target described in the effective information that is drawn into, to eliminate table
Default conflict between the mutually convertible key field of sign;The default conflict includes:Naming conflict, format conflicts.
6. according to the data processing method described in claim any one of 2-4, it is characterised in that in the effective information that will be drawn into
Before being stored, in addition to:
For every record in the effective information that is drawn into, textual value is corresponded to according to the key field that the record includes
Missing degree determines whether the record is endless complete record;And the endless complete record for determining, according to default processing rule
Endless complete record is handled;
The effective information being drawn into is carried out using preset algorithm to repeat record detection, and the record of the repetition for detecting,
Retain a record to be stored;
Wherein, describe every group of textual value corresponding to a set of keyword field of same target and be referred to as a record.
7. data processing method according to claim 6, it is characterised in that carry out repeating record inspection using preset algorithm
Survey, specifically include:
For any two records to be detected, text corresponding to same keyword field in two records to be detected is determined respectively
Editing distance between value;
If the editing distance existed between any corresponding textual value is more than preset field similarity threshold, it is determined that this two to be checked
Record is surveyed not record for repetition;
If the editing distance between arbitrarily corresponding textual value is no more than preset field similarity threshold, according to each keyword word
Weight information is preset corresponding to section, summation is weighted to each editing distance;Judge obtain between value and each weight and value
Business whether be less than preset recording similarity threshold;If, it is determined that this two records to be detected do not record for repetition.
A kind of 8. data handling system, it is characterised in that including:
Data acquisition module, for gathering web page from preset data source;
Category determination module, for determining the webpage classification belonging to the web page of collection;Wherein, according to the webpage classification
What the different objects described by the webpage that the preset data source includes divided;
Information extraction module, have for being extracted using wrapper corresponding to the webpage classification from the web page of the collection
Imitate information;Wherein, the wrapper generates for the attribute of the object according to described by the webpage classification corresponds to webpage;
Message processing module, for the effective information being drawn into be converted into preset standard form and stored.
9. a kind of data processing equipment, including:Memory, processor and it is stored on the memory and can be in the processor
The computer program of upper operation, it is characterised in that realize the claims 1 described in the computing device during computer program
The step of to method described in 7 any one.
10. a kind of computer-readable recording medium, computer program is stored with the computer-readable recording medium, its feature
Be, when the computer program is run by processor perform any one of the claims 1 to 7 described in method the step of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710630757.2A CN107480134A (en) | 2017-07-28 | 2017-07-28 | A kind of data processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710630757.2A CN107480134A (en) | 2017-07-28 | 2017-07-28 | A kind of data processing method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107480134A true CN107480134A (en) | 2017-12-15 |
Family
ID=60596830
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710630757.2A Pending CN107480134A (en) | 2017-07-28 | 2017-07-28 | A kind of data processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480134A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109343993A (en) * | 2018-09-28 | 2019-02-15 | 郑州云海信息技术有限公司 | A kind of error message processing method and processing device of cloud platform |
CN109885532A (en) * | 2019-02-11 | 2019-06-14 | 中国银行股份有限公司 | A kind of transaction data standardized method and device |
CN110572435A (en) * | 2019-08-05 | 2019-12-13 | 慧镕电子系统工程股份有限公司 | Data processing method of cloud computing system |
CN110781655A (en) * | 2019-10-29 | 2020-02-11 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device for title column, computer equipment and storage medium |
CN110795654A (en) * | 2019-10-29 | 2020-02-14 | 深圳前海环融联易信息科技服务有限公司 | Webpage data display method and device, computer equipment and storage medium |
CN110825944A (en) * | 2019-10-29 | 2020-02-21 | 深圳前海环融联易信息科技服务有限公司 | Webpage table data acquisition method and device, computer equipment and storage medium |
CN111143554A (en) * | 2019-12-10 | 2020-05-12 | 中盈优创资讯科技有限公司 | Data sampling method and device based on big data platform |
CN111935231A (en) * | 2020-07-13 | 2020-11-13 | 支付宝(杭州)信息技术有限公司 | Information processing method and device |
CN113536754A (en) * | 2020-04-21 | 2021-10-22 | 阿里巴巴集团控股有限公司 | Text generation method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7444325B2 (en) * | 2005-01-14 | 2008-10-28 | Im2, Inc. | Method and system for information extraction |
CN101350019A (en) * | 2008-06-20 | 2009-01-21 | 浙江大学 | Method for abstracting web page information based on vector model between predefined slots |
CN101464905A (en) * | 2009-01-08 | 2009-06-24 | 中国科学院计算技术研究所 | Web page information extraction system and method |
CN104281703A (en) * | 2014-10-22 | 2015-01-14 | 小米科技有限责任公司 | Method and device for calculating similarity among uniform resource locators (URL) |
-
2017
- 2017-07-28 CN CN201710630757.2A patent/CN107480134A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7444325B2 (en) * | 2005-01-14 | 2008-10-28 | Im2, Inc. | Method and system for information extraction |
CN101350019A (en) * | 2008-06-20 | 2009-01-21 | 浙江大学 | Method for abstracting web page information based on vector model between predefined slots |
CN101464905A (en) * | 2009-01-08 | 2009-06-24 | 中国科学院计算技术研究所 | Web page information extraction system and method |
CN104281703A (en) * | 2014-10-22 | 2015-01-14 | 小米科技有限责任公司 | Method and device for calculating similarity among uniform resource locators (URL) |
Non-Patent Citations (1)
Title |
---|
贺令亚,柳佳刚: "基于Web的包装器技术的现状与发展", 《电脑开发与应用》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109343993A (en) * | 2018-09-28 | 2019-02-15 | 郑州云海信息技术有限公司 | A kind of error message processing method and processing device of cloud platform |
CN109885532A (en) * | 2019-02-11 | 2019-06-14 | 中国银行股份有限公司 | A kind of transaction data standardized method and device |
CN110572435B (en) * | 2019-08-05 | 2022-02-11 | 慧镕电子系统工程股份有限公司 | Data processing method of cloud computing system |
CN110572435A (en) * | 2019-08-05 | 2019-12-13 | 慧镕电子系统工程股份有限公司 | Data processing method of cloud computing system |
CN110781655A (en) * | 2019-10-29 | 2020-02-11 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device for title column, computer equipment and storage medium |
CN110825944A (en) * | 2019-10-29 | 2020-02-21 | 深圳前海环融联易信息科技服务有限公司 | Webpage table data acquisition method and device, computer equipment and storage medium |
CN110795654A (en) * | 2019-10-29 | 2020-02-14 | 深圳前海环融联易信息科技服务有限公司 | Webpage data display method and device, computer equipment and storage medium |
CN110781655B (en) * | 2019-10-29 | 2023-10-27 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device for title column, computer equipment and storage medium |
CN111143554A (en) * | 2019-12-10 | 2020-05-12 | 中盈优创资讯科技有限公司 | Data sampling method and device based on big data platform |
CN111143554B (en) * | 2019-12-10 | 2024-03-12 | 中盈优创资讯科技有限公司 | Data sampling method and device based on big data platform |
CN113536754A (en) * | 2020-04-21 | 2021-10-22 | 阿里巴巴集团控股有限公司 | Text generation method and device and electronic equipment |
CN113536754B (en) * | 2020-04-21 | 2024-06-25 | 阿里巴巴集团控股有限公司 | Text generation method and device and electronic equipment |
CN111935231A (en) * | 2020-07-13 | 2020-11-13 | 支付宝(杭州)信息技术有限公司 | Information processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107480134A (en) | A kind of data processing method and system | |
US8190621B2 (en) | Method, system, and computer readable recording medium for filtering obscene contents | |
US8856129B2 (en) | Flexible and scalable structured web data extraction | |
US20100169301A1 (en) | System and method for aggregating and ranking data from a plurality of web sites | |
CN103309862B (en) | Webpage type recognition method and system | |
CN102054015A (en) | System and method of organizing community intelligent information by using organic matter data model | |
US20080147578A1 (en) | System for prioritizing search results retrieved in response to a computerized search query | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
EP2291812A2 (en) | Forum web page clustering based on repetitive regions | |
TW201115370A (en) | Systems and methods for capturing and managing collective social intelligence information | |
US20080147588A1 (en) | Method for discovering data artifacts in an on-line data object | |
CN112650923A (en) | Public opinion processing method and device for news events, storage medium and computer equipment | |
CN107943514A (en) | The method for digging and system of core code element in a kind of software document | |
CN106446124B (en) | A kind of Website classification method based on cyberrelationship figure | |
CN110737821B (en) | Similar event query method, device, storage medium and terminal equipment | |
CN109271489A (en) | A kind of Method for text detection and device | |
CN110209659A (en) | A kind of resume filter method, system and computer readable storage medium | |
CN107741958A (en) | A kind of data processing method and system | |
CN108536664A (en) | The knowledge fusion method in commodity field | |
CN111680506A (en) | External key mapping method and device of database table, electronic equipment and storage medium | |
CN112132238A (en) | Method, device, equipment and readable medium for identifying private data | |
CN102460440B (en) | Searching methods and devices | |
US20160321345A1 (en) | Chain understanding in search | |
CN110348877B (en) | Intelligent service recommendation algorithm based on big data and computer readable storage medium | |
CN110083760B (en) | Multi-recording dynamic webpage information extraction method based on visual block |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District Applicant after: Guoxin Youyi Data Co., Ltd Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing Applicant before: SIC YOUE DATA Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171215 |