CN101226548A - System and method for abstraction of Web data based on vision - Google Patents

System and method for abstraction of Web data based on vision Download PDF

Info

Publication number
CN101226548A
CN101226548A CNA2008100561034A CN200810056103A CN101226548A CN 101226548 A CN101226548 A CN 101226548A CN A2008100561034 A CNA2008100561034 A CN A2008100561034A CN 200810056103 A CN200810056103 A CN 200810056103A CN 101226548 A CN101226548 A CN 101226548A
Authority
CN
China
Prior art keywords
module
data
vision
node
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008100561034A
Other languages
Chinese (zh)
Other versions
CN100590623C (en
Inventor
孟小峰
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN200810056103A priority Critical patent/CN100590623C/en
Publication of CN101226548A publication Critical patent/CN101226548A/en
Application granted granted Critical
Publication of CN100590623C publication Critical patent/CN100590623C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Web data extraction system and a method based on vision, wherein the system comprises an input module to input pages containing records; a pre-processing module to pre-process the input pages; a page displaying module to display pages by vision; a data recording and extracting module to extract a complete record from pages; a data item extraction module to decompose each extracted record into data item sequences and to align data sequences with same properties; an output module to output structural data forms.

Description

Web data pick-up system and method based on vision
Technical field
The present invention relates to the Computer Database field, especially relate to a kind of Web data pick-up system and method based on vision.
Background technology
Along with the develop rapidly of Web, contained the information of magnanimity among the Web, according to conservative estimation, present whole Web has surpassed 200, the quantity of information of 000TB, and still increasing fast, and these information have covered the every field (such as commerce, amusement, physical culture etc.) of real world.This makes Web become people gradually and obtains one of most important approach of useful information.Yet the information of magnanimity also often makes people can not find the information of oneself wanting rapidly and accurately from Web.How from current huge Web, to obtain Useful Information efficiently and become the new challenge that people face.In order to address this problem, many researchers are being devoted to how to help people to finish with automated method information among the Web are effectively being obtained, and one of them hot research problem is the Web data pick-up, and promptly Automatic Extraction goes out structurized data from webpage.
The information publisher mainly is to be that carrier externally releases news by webpage in Web, so people obtain information among the Web by the mode of browsing page, such as browsing news, inspection information, shopping online etc.Webpage is to be write by specific programming language, and most at present webpages realizes that by most popular Html language it is a kind of HTML (Hypertext Markup Language), be characterized in being widely used, form is simple, utilizes the specific markers formatted text, to reach specific effect.Because of it is easy to use, and can write out colourful webpage, so adopted by vast Web page maker.Along with the development of Web and the raising of people's demand, the version of Html language is constantly improving, and new webpage programming language also occurred simultaneously, such as XHtml and XML.Different designs person has also caused the unified standard of expression shortage of information among the Web to the personalized design of webpage.As everyone knows, the purpose of webpage design is for people browse reading, rather than handles automatically for computing machine.Generally speaking, most webpage also can comprise a large amount of useless information except comprising the data that will extract, and such as advertisement, navigation information etc., so just makes the Web data pick-up become a very thing of difficulty.
Present existing Web data pick-up method mainly is at the Html webpage, and basic thought is therefrom to find the data that will extract by analyzing the Html source file.Though these methods are in the extraction that can finish preferably at that time the Web data, they depend on specific page program language.Therefore, their shortcoming seems along with the development of Web and becomes increasingly conspicuous, mainly shows:
The new version of Html language is in continuous appearance, and just the method that proposes at current version need be done bigger change and could adapt to new version;
The appearance of new webpage programming language makes existing method can not gather effect fully.
Yet, the same as the webpage of information release carrier with magazine, TV, information represent the visual custom of browsing that all meets people.Under this background, we have proposed a kind of Web data pick-up method based on webpage visual information.
Summary of the invention
In order to solve above-mentioned traditional problem, so one object of the present invention is exactly to have proposed a kind of Web data pick-up system and method based on vision.
In one aspect of the invention, a kind of Web data pick-up system based on vision comprises: load module is used to import the page that comprises record; Pretreatment module is used for the page of input is carried out pre-service; Page functional modules is used for the page is carried out visual representation; The data recording abstraction module is used for extracting complete record from the page; The data item abstraction module is used for each record that is drawn into is resolved into the data item sequence, and the data item alignment of expression same alike result; And output module, be used for the export structure data form.
According to this aspect, wherein the data recording abstraction module further comprises: find module, be used to find the data area; Remove module, be used to remove noise data; Sort module is used for the vision piece is classified; Recombination module is used for the vision piece that belongs to same record is combined.
According to this aspect, find that wherein module further comprises: set up module, be used to set up an initial sets B, put into all child nodes of Visual tree root node; Scan module is used for each node of B is scanned; Judge module is used for when scanning one of them node b, judges whether it meets two conditions: the one, passed through by the perpendicular bisector of webpage; The 2nd, with the area of whole webpage than greater than value 0.4; Removing module is used under incongruent situation, with its deletion; Add module, be used under situation about meeting, b is added among the set B s, father's node of deletion b from Bs adds B to all child nodes of b; Output module if be used under the situation that all nodes of B have all scanned, is exported the node of area minimum among the Bs as the data area.
According to this aspect, wherein this removing module further comprises: obtain module topmost, be used for obtaining data area vision piece b topmost TopObtain module bottom, be used for obtaining data area vision piece b bottom BottomThe top removing module is if be used at b TopUnder the situation that next piece adjacent with it does not align, then with its deletion; Removing module bottom is if be used at b TopUnder the situation that next piece adjacent with it does not align, then with its deletion.
According to this aspect, wherein vision block sort module further comprises: set up module, be used to set up a set B, this set B comprises all child nodes in the data area; Judge module is used for the B node is classified by vision; Output module is used to export a set C, comprises the plurality of sub set among the C, the corresponding classification of each subclass.
According to this aspect, wherein this recombination module further comprises: select module, be used for selecting first subclass c from the set C of vision block sort module output 1As c MaxSet up module, be used to take out c MaxIn each node b i, set up an initial subclass r iPut into module, be used for all initial subclass are put into set R; Scan module is used for each subclass c to C iScan; Respective modules is used for c iIn node correspond to the middle r of R according to position on webpage iIn; Output module, if be used under the situation that all subclass of C have all scanned, output R, each subclass is combined into a record among the R.
According to this aspect, wherein this data item abstraction module further comprises: receiver module is used to receive clever set; The cutting module is used for a record is cut in proper order according to the attribute appearance sequence of a data item; Alignment module is used for the data item of each record is alignd according to attribute.
According to this aspect, wherein this alignment module further comprises: scan module is used for each record is scanned; Put into module, be used for that each is write down current unjustified first data item and put into set C; Sort module, the data item that is used for pair set C is classified; Chosen module is used for the most preceding classification of current classification select location according to sort module, and data item is wherein alignd; Load module is if be used under the data item of all records is all alignd situation about finishing the export structure form.
In another aspect of this invention, a kind of Web data pick-up method based on vision comprises step: A, input comprise the page of record; B, to the input the page carry out pre-service; C, the page is carried out visual representation; D, from the page, extract complete record; E, each record that is drawn into is resolved into the data item sequence, and the data item alignment of expression same alike result; And F, output module export structure data form.
According to this aspect, wherein step D further comprises step: D1, finds the data area; D2, removing noise data; D3, the vision piece is classified; And D4, the vision piece that belongs to same record is combined.
According to this aspect, wherein step D1 further comprises step: D1_1, sets up an initial sets B, puts into all child nodes of Visual tree root node; D1_2, each node among the B is scanned; D1_3, when scanning one of them node b, judge whether it meets two conditions: the one, passed through by the perpendicular bisector of webpage; The 2nd, with the area of whole webpage than greater than value 0.4; If D1_4 does not meet the condition among the step D1_3, then with its deletion; If D1_5 meets the condition among the step D1_3, b is added among the set B s, father's node of deletion b from Bs adds B to all child nodes of b; If all nodes have all scanned among the D1_6 B, just the node of area minimum among the Bs is exported as the data area.
According to this aspect, wherein step D2 further comprises step: D2_1, obtains vision piece b in the top in the data area TopD2_2, obtain in the data area vision piece b bottom BottomIf D2_3 is b TopNext piece adjacent with it does not align, then with its deletion; If D2_4 is b TopNext piece adjacent with it does not align, then with its deletion.
According to this aspect, wherein step D3 further comprises step: D3_1, sets up a set B, and this set B comprises all child nodes in the data area; D3_2, node among the B is classified by vision; D3_3, a set of output C comprise the plurality of sub set among the C, the corresponding classification of each subclass.
According to this aspect, wherein step D4 further comprises step: D4_1, selects first subclass c among the set C that exports from step D3 1As c MaxD4_2, taking-up c MaxIn each node b i, set up an initial subclass r iD4_3, all initial subclass put into the set R; D4_4, to each subclass c among the C iScan; D4_5, with c iIn node correspond to the middle r of R according to position on webpage iIn; If all subclass have all scanned among the D4_6 C, output R, each subclass is combined into a record among the R.
According to this aspect, wherein step e further comprises: E0, receiving record set; E1, a record is cut in proper order the sequence of a data item according to the attribute appearance; E2, each the record in data item align according to attribute.
According to this aspect, wherein step e 2 further comprises: E2_1, scan each record; E2_2, with each write down current unjustified first data item put into the set C; E2_3, the data item that will gather among the C are classified; E2_4, according to the most preceding classification of current classification select location among the step e 2_2, data item is wherein alignd; If all aliging, the data item of all records of E2_5 finishes, export structure form then, otherwise change step e 2_1.
Description of drawings
In conjunction with accompanying drawing subsequently, what may be obvious that from following detailed description draws above-mentioned and other purpose of the present invention, feature and advantage.In the accompanying drawings:
Fig. 1 has provided the block scheme according to the Web data pick-up system based on vision of the present invention;
Fig. 2 has provided the total process flow diagram according to the Web data pick-up method based on vision of the present invention;
Fig. 3 has provided the example according to the Web data pick-up method based on vision of the present invention;
Fig. 4 has provided according to Visual tree example of the present invention;
Fig. 5 has provided the block scheme according to data recording abstraction module of the present invention;
Fig. 6 has provided the process flow diagram according to data recording abstraction module of the present invention;
Fig. 7 has provided the concrete block scheme according to the discovery module of data recording abstraction module of the present invention;
Fig. 8 has provided the process flow diagram according to the discovery module of data recording abstraction module of the present invention;
Fig. 9 has provided the concrete block scheme according to the removing module of data recording abstraction module of the present invention;
Figure 10 has provided the process flow diagram according to the removing module of data recording abstraction module of the present invention;
Figure 11 has provided the concrete block scheme according to the vision block sort module of data recording abstraction module of the present invention;
Figure 12 has provided the process flow diagram according to the vision block sort module of data recording abstraction module of the present invention;
Figure 13 has provided the concrete block scheme according to the recombination module of data recording abstraction module of the present invention;
Figure 14 has provided the process flow diagram according to the recombination module of data recording abstraction module of the present invention;
Figure 15 has provided the concrete block scheme according to data item abstraction module of the present invention;
Figure 16 has provided the process flow diagram according to data item abstraction module of the present invention;
Figure 17 has provided the concrete block scheme according to the alignment module of data item abstraction module of the present invention; And
Figure 18 has provided the process flow diagram according to the alignment module of data item abstraction module of the present invention.
Embodiment
At first, with reference to figure 1-3, the performed flow process of one-piece construction and this system of Entity recognition according to the present invention system is described.
Fig. 1 has provided the block scheme according to the Web data pick-up system based on vision of the present invention.As shown in Figure 1, this system comprises load module, pretreatment module, page functional modules, data recording abstraction module, data item abstraction module and output module.
The load module input comprises the page of record.
Pretreatment module is carried out pre-service to the page of importing.Further, the function of pre-treatment step B promptly is exactly to obtain the visual information of webpage Chinese version and picture.These visual informations obtain by the API that the invoking web page browser provides.That we call in native system is the API that InternetExplorer provides.The text that is obtained and the visual information of picture comprise as shown in table 1.
Table 1
Figure A20081005610300101
Page functional modules is carried out visual representation to the page, promptly webpage is changed into a Visual tree, as shown in Figure 4.Specifically, utilize the visual information that is obtained, in calculator memory, webpage is converted to a Visual tree (Visual Block tree), the extraction of data recording and data item after realizing by this tree.Fig. 4 is the example of Visual tree, and wherein (a) is the layout structure after a webpage removes content, (b) is the Visual tree that corresponding this webpage is set up.A rectangular area in the Visual tree in the corresponding webpage of each node, the zone of father node comprises the zone of child node, and is not overlapping with zone between the node of one deck.The tree the root node correspondence whole webpage.The construction method of Visual tree is referring to technical report:
(http://research.microsoft.com/research/pubs/view.aspx?tr_id=690)。
The data recording abstraction module extracts complete record from the page, subsequently it is described in detail.
The data item abstraction module resolves into the data item sequence to each record that is drawn into, and the data item alignment of expression same alike result, subsequently it is described in detail.
Output module export structure data form.
Fig. 2 has provided the total process flow diagram according to the Web data pick-up method based on vision of the present invention.The method comprising the steps of: A, input comprise the page of record; B, to the input the page carry out pre-service; C, the page is carried out visual representation, promptly webpage is changed into as shown in Figure 4 Visual tree, wherein this step promptly is exactly: utilize the visual information that is obtained, in calculator memory, webpage is converted to a Visual tree (Visual Block tree), the extraction of data recording and data item after realizing by this tree; D, from the page, extract complete record; E, each record that is drawn into is resolved into the data item sequence, and the data item alignment of expression same alike result; And F, output module export structure data form.
Fig. 3 has provided a typical case of the method according to this invention, and wherein (a) is the page that comprises some books record, with it as input; (b) be the visual performance form of this page; (c) be the books set of records ends that extracts from this page; (d) be each books record be decomposed into data item sequence (title, author, publishing house ...), the row alignment of going forward side by side, the data item of promptly representing title is at row, expression author's data item is at row etc.; (e) be the form of final output, books record of each line display, an attribute is shown in each tabulation.
Below in conjunction with Fig. 5 and Fig. 6, the block scheme and the performed process flow diagram of this data recording abstraction module of data recording abstraction module according to the present invention is described in detail.
Fig. 5 has provided the block scheme according to data recording abstraction module of the present invention.As shown in Figure 5, this data recording abstraction module comprises the discovery module, removes module, sort module and recombination module.Find that module is used to find the data area, use b here RegionRefer to the data area.With Fig. 4 is example, if b 3_2, b 3_3, b 3_4, b 3_5The data recording that we will be extracted, b so 3It then is the data area that we will look for.Subsequently this is described in detail.
Remove module and be used to remove noise data, wherein noise information is meant the information that does not belong to any record.Such as the statistical information (" 1,040,000 Query Result that meets is arranged approximately ") of Query Result and page turning link information (" 123...Next ") etc.Subsequently this is described in detail.
Sort module is used for the vision piece is classified, and is meant that specifically the vision piece in the data area is classified according to visual similarity on Visual tree.Subsequently this is described in detail.
Recombination module is used for the vision piece that belongs to same record is combined.Subsequently this is described in detail.
Fig. 6 has provided the process flow diagram according to data recording abstraction module of the present invention.This flow and method comprises step: D1, discovery data area; D2, removing noise data; D3, the vision piece is classified; And D4, the vision piece that belongs to same record is combined.
Below in conjunction with Fig. 7 and Fig. 8, the block scheme and the performed process flow diagram of this discovery module of the discovery module of data recording abstraction module according to the present invention is described in detail.
Fig. 7 has provided the concrete block scheme according to the discovery module of data recording abstraction module of the present invention.As shown in Figure 7, this discovery module comprises: set up module, be used to set up an initial sets B, put into all child nodes of Visual tree root node; Scan module is used for each node of B is scanned; Judge module is used for when scanning one of them node b, judges whether it meets two conditions: the one, passed through by the perpendicular bisector of webpage; The 2nd, with the area of whole webpage than greater than value 0.4; Removing module is used under incongruent situation, with its deletion; Add module, be used under situation about meeting, b is added among the set B s, father's node of deletion b from Bs adds B to all child nodes of b; Output module if be used under the situation that all nodes of B have all scanned, is exported the node of area minimum among the Bs as the data area.
Fig. 8 has provided the particular flow sheet according to the discovery module of data recording abstraction module of the present invention.This flow and method comprises step: in step D1_1, set up an initial sets B, put into all child nodes of Visual tree root node; In step D1_2, each node among the B is scanned; In step D1_3, when scanning one of them node b, judge whether it meets two conditions: the one, passed through by the perpendicular bisector of webpage; The 2nd, with the area of whole webpage than greater than value 0.4; Do not enter among the step D1_4 its deletion if meet, otherwise enter step D1_5, b is added among the set B s, father's node of deletion b from Bs adds B to all child nodes of b; In step D1_6,, just the node of area minimum among the Bs is exported as the data area if all nodes have all scanned among the B.
Below with reference to Fig. 9 and Figure 10, the block scheme and the performed process flow diagram of this removing module of the removing module of data recording abstraction module according to the present invention is described in detail.
Fig. 9 has provided the concrete block scheme according to the removing module of data recording abstraction module of the present invention.As shown in Figure 9, this removing module comprises: obtain module topmost, be used for obtaining data area vision piece b topmost TopObtain module bottom, be used for obtaining data area vision piece b bottom BottomThe top removing module is if be used at b TopUnder the situation that next piece adjacent with it does not align, then with its deletion; Removing module bottom is if be used at b TopUnder the situation that next piece adjacent with it does not align, then with its deletion.
Figure 10 has provided the particular flow sheet according to the removing module of data recording abstraction module of the present invention.This flow and method comprises step: in step D2_1, obtain vision piece b in the top in the data area TopIn step D2_2, obtain in the data area vision piece b bottom BottomIn step D2_3, if b TopNext piece adjacent with it does not align, then with its deletion; In step D2_4, if b TopNext piece adjacent with it does not align, then with its deletion.
Below with reference to Figure 11 and Figure 12, the block scheme and the performed process flow diagram of this sort module of the sort module of data recording abstraction module according to the present invention is described in detail.
Figure 11 has provided the concrete block scheme according to the vision block sort module of data recording abstraction module of the present invention.As shown in figure 11, this vision block sort module comprises: set up module, be used to set up a set B, this set B comprises all child nodes in the data area; Judge module is used for the B node is classified by vision, and promptly the similar node of vision is a class, judges that whether similar two nodes formula as follows: sim (b 1, b 2)=w i* simIMG (b 1, b 2)+w Pt* simPT (b 1, b 2)+w 1t* simLT (b 1, b 2), every the giving an explaination in the following table 2 pair formula; Output module is used to export a set C, comprises the plurality of sub set among the C, the corresponding classification of each subclass.
Figure 12 has provided the particular flow sheet according to the vision block sort module of data recording abstraction module of the present invention.This flow and method comprises step: in D3_1, set up a set B, this set B comprises all child nodes in the data area; In D3_2, node among the B is classified by vision, promptly the similar node of vision is a class; In step D3_3, export a set C, comprise the plurality of sub set among the C, the corresponding classification of each subclass.
Figure A20081005610300131
Figure A20081005610300141
Below with reference to Figure 13 and Figure 14, the block scheme and the performed process flow diagram of this recombination module of the recombination module of data recording abstraction module according to the present invention is described in detail.
Figure 13 has provided the concrete block scheme according to the recombination module of data recording abstraction module of the present invention.As shown in figure 13, this recombination module comprises: select module, be used for selecting first subclass c from the set C of vision block sort module output 1As c MaxSet up module, be used to take out c MaxIn each node b i, set up an initial subclass r iPut into module, be used for all initial subclass are put into set R; Scan module is used for each subclass c to C iScan; Respective modules is used for c iIn node correspond to the middle r of R according to position on webpage iIn; Output module, if be used under the situation that all subclass of C have all scanned, output R, each subclass is combined into a record among the R.
Figure 14 has provided the particular flow sheet according to the recombination module of data recording abstraction module of the present invention.This flow and method comprises step: in step D4_1, select first subclass c among the set C that exports from step D3_3 1As c MaxIn step D4_2, take out c MaxIn each node b i, set up an initial subclass r iIn step D4_3, all initial subclass are put into set R; In step D4_4, to each subclass c among the C iCarry out the processing of following step: in step D4_5, with c iIn node correspond to the middle r of R according to position on webpage iIn; In step D4_6, if all subclass have all scanned among the C, output R, each subclass is combined into a record among the R.
Below in conjunction with Figure 15 and Figure 16, the concrete block scheme and the performed process flow diagram of this data item abstraction module of data item abstraction module according to the present invention is described in detail.
Figure 15 has provided the concrete block scheme according to data item abstraction module of the present invention.As shown in figure 15, this data item abstraction module comprises: receiver module is used to receive clever set; The cutting module is used for a record is cut in proper order according to the attribute appearance sequence of a data item; Alignment module is used for the data item of each record is alignd according to attribute, and promptly the data item of same alike result lists the same of form.
Figure 16 has provided the process flow diagram according to data item abstraction module of the present invention.This flow and method comprises: E0, receiving record set; E1, a record is cut in proper order the sequence of a data item according to the attribute appearance; E2, each the record in data item align according to attribute, promptly the data item of same alike result lists the same of form.
Below with reference to Figure 17 and Figure 18, the block scheme and the performed process flow diagram of this alignment module of the alignment module of data item abstraction module according to the present invention is described in detail.
Figure 17 has provided the concrete block scheme according to the alignment module of data item abstraction module of the present invention.As shown in figure 17, this alignment module comprises: scan module is used for each record is scanned; Put into module, be used for that each is write down current unjustified first data item and put into set C; Sort module, the data item that is used for pair set C is classified, and whether the method for classification is similar on font and position according to two data item; Chosen module is used for the most preceding classification of current classification select location according to sort module, with wherein data item as alignment; Load module is if be used under the data item of all records is all alignd situation about finishing the export structure form.
Figure 18 has provided the particular flow sheet according to the alignment module of data item abstraction module of the present invention.This flow and method comprises step: in step e 2_1, scan each record; In step e 2_2, each is write down current unjustified first data item put into set C; In step e 2_3, the data item of set among the C classified, whether the method for classification is similar on font and position according to two data item; In step e 2_4, according to the most preceding classification of current classification select location among the step e 2_2, with wherein data item as alignment; In step e 2_5, if all aliging, the data item of all records finishes, export structure form then, otherwise change step e 2_1.
From the above description as can be known, proposition is based on the Web data pick-up method and system of vision.The input of system is the webpage that includes a group record.The output of system is a structurized form, a record in each line display webpage of form, and an attribute in the record is shown in each tabulation.Our method is different with the previous methods maximum, is exactly the visual information of having utilized the page fully.
What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims (10)

1. Web data pick-up system based on vision, this system comprises:
Load module is used to import the page that comprises record;
Pretreatment module is used for the page of input is carried out pre-service;
Page functional modules is used for the page is carried out visual representation;
The data recording abstraction module is used for extracting complete record from the page;
The data item abstraction module is used for each record that is drawn into is resolved into the data item sequence, and the data item alignment of expression same alike result;
Output module is used for the export structure data form.
2. according to the system of claim 1, wherein the data recording abstraction module further comprises:
Find module, be used to find the data area;
Remove module, be used to remove noise data;
Sort module is used for the vision piece is classified;
Recombination module is used for the vision piece that belongs to same record is combined.
3. according to the system of claim 2, find that wherein module further comprises:
Set up module, be used to set up an initial sets B, put into all child nodes of Visual tree root node;
Scan module is used for each node of B is scanned;
Judge module is used for when scanning one of them node b, judges whether it meets two conditions: the one, passed through by the perpendicular bisector of webpage; The 2nd, with the area of whole webpage than greater than value 0.4;
Removing module is used under incongruent situation, with its deletion;
Add module, be used under situation about meeting, b is added among the set B s, father's node of deletion b from Bs adds B to all child nodes of b;
Output module if be used under the situation that all nodes of B have all scanned, is exported the node of area minimum among the Bs as the data area.
4. according to the system of claim 2, wherein this removing module further comprises:
Obtain module topmost, be used for obtaining data area vision piece b topmost Top
Obtain module bottom, be used for obtaining data area vision piece b bottom Bottom
The top removing module is if be used at b TopUnder the situation that next piece adjacent with it does not align, then with its deletion;
Removing module bottom is if be used at b TopUnder the situation that next piece adjacent with it does not align, then with its deletion.
5. according to the system of claim 2, wherein vision block sort module further comprises:
Set up module, be used to set up a set B, this set B comprises all child nodes in the data area;
Judge module is used for the B node is classified by vision;
Output module is used to export a set C, comprises the plurality of sub set among the C, the corresponding classification of each subclass.
6. Web data pick-up method based on vision, the method comprising the steps of:
A, input comprise the page of record;
B, to the input the page carry out pre-service;
C, the page is carried out visual representation;
D, from the page, extract complete record;
E, each record that is drawn into is resolved into the data item sequence, and the data item alignment of expression same alike result; And
F, output module export structure data form.
7. according to the method for claim 6, wherein step D further comprises step:
D1, discovery data area;
D2, removing noise data;
D3, the vision piece is classified; And
D4, the vision piece that belongs to same record is combined.
8. according to the method for claim 6, wherein step D1 further comprises step:
D1_1, set up an initial sets B, put into all child nodes of Visual tree root node;
D1_2, each node among the B is scanned;
D1_3, when scanning one of them node b, judge whether it meets two conditions: the one, passed through by the perpendicular bisector of webpage; The 2nd, with the area of whole webpage than greater than value 0.4;
If D1_4 does not meet the condition among the step D1_3, then with its deletion;
If D1_5 meets the condition among the step D1_3, b is added among the set B s, father's node of deletion b from Bs adds B to all child nodes of b;
If all nodes have all scanned among the D1_6 B, just the node of area minimum among the Bs is exported as the data area.
9. according to the method for claim 6, wherein step D2 further comprises step:
D2_1, obtain in the data area vision piece b topmost Top
D2_2, obtain in the data area vision piece b bottom Bottom
If D2_3 is b TopNext piece adjacent with it does not align, then with its deletion;
If D2_4 is b TopNext piece adjacent with it does not align, then with its deletion.
10. according to the method for claim 6, wherein step D3 further comprises step:
D3_1, set up a set B, this set B comprises all child nodes in the data area;
D3_2, node among the B is classified by vision;
D3_3, a set of output C comprise the plurality of sub set among the C, the corresponding classification of each subclass.
CN200810056103A 2008-01-11 2008-01-11 System and method for abstraction of Web data based on vision Expired - Fee Related CN100590623C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810056103A CN100590623C (en) 2008-01-11 2008-01-11 System and method for abstraction of Web data based on vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810056103A CN100590623C (en) 2008-01-11 2008-01-11 System and method for abstraction of Web data based on vision

Publications (2)

Publication Number Publication Date
CN101226548A true CN101226548A (en) 2008-07-23
CN100590623C CN100590623C (en) 2010-02-17

Family

ID=39858543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810056103A Expired - Fee Related CN100590623C (en) 2008-01-11 2008-01-11 System and method for abstraction of Web data based on vision

Country Status (1)

Country Link
CN (1) CN100590623C (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN102929930A (en) * 2012-09-24 2013-02-13 南京大学 Automatic Web text data extraction template generating and extracting method for small samples
CN103218420A (en) * 2013-04-01 2013-07-24 北京鹏宇成软件技术有限公司 Method and device for extracting page titles
CN105045769A (en) * 2015-06-01 2015-11-11 中国人民解放军装备学院 Structure recognition based Web table information extraction method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN101944109B (en) * 2010-09-06 2012-06-27 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN102929930A (en) * 2012-09-24 2013-02-13 南京大学 Automatic Web text data extraction template generating and extracting method for small samples
CN103218420A (en) * 2013-04-01 2013-07-24 北京鹏宇成软件技术有限公司 Method and device for extracting page titles
CN103218420B (en) * 2013-04-01 2016-12-28 北京创世泰克科技股份有限公司 A kind of web page title extracting method and device
CN105045769A (en) * 2015-06-01 2015-11-11 中国人民解放军装备学院 Structure recognition based Web table information extraction method

Also Published As

Publication number Publication date
CN100590623C (en) 2010-02-17

Similar Documents

Publication Publication Date Title
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Liu et al. Vide: A vision-based approach for deep web data extraction
Peters et al. Content extraction using diverse feature sets
CN102279851B (en) Intelligent navigation method, device and system
CN101957816B (en) Webpage metadata automatic extraction method and system based on multi-page comparison
CN101794311B (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN102663023B (en) Implementation method for extracting web content
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN102156737B (en) Method for extracting subject content of Chinese webpage
CN103559234B (en) System and method for automated semantic annotation of RESTful Web services
CN102096717A (en) Search method and search engine
CN101788988B (en) Information extraction method
CN103544178A (en) Method and equipment for providing reconstruction page corresponding to target page
CN103324622A (en) Method and device for automatic generating of front page abstract
CN108021715B (en) Heterogeneous label fusion system based on semantic structure feature analysis
CN102306201B (en) Method and system for analyzing webpage title
CN106446072A (en) Webpage content processing method and apparatus
CN102693304A (en) Search engine feedback information processing method and search engine
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN103440233A (en) Automatic sScientific paper standardization automatic detecting and editing system
CN100590623C (en) System and method for abstraction of Web data based on vision
CN102073654A (en) Methods and equipment for generating and maintaining web content extraction template
CN100477593C (en) Method and device for selecting correlative discussion zone in network community
CN103778141A (en) Mixed PDF book catalogue automatic extracting algorithm
CN110020312A (en) The method and apparatus for extracting Web page text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100217

Termination date: 20130111