CN101178708A - Automatic moulding plate information locating method for structured web page - Google Patents
Automatic moulding plate information locating method for structured web page Download PDFInfo
- Publication number
- CN101178708A CN101178708A CNA2006101378554A CN200610137855A CN101178708A CN 101178708 A CN101178708 A CN 101178708A CN A2006101378554 A CNA2006101378554 A CN A2006101378554A CN 200610137855 A CN200610137855 A CN 200610137855A CN 101178708 A CN101178708 A CN 101178708A
- Authority
- CN
- China
- Prior art keywords
- attribute
- key word
- attribute key
- property value
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000000465 moulding Methods 0.000 title claims description 15
- 230000008878 coupling Effects 0.000 description 7
- 238000010168 coupling process Methods 0.000 description 7
- 238000005859 coupling reaction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000007115 recruitment Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000002950 deficient Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000009412 basement excavation Methods 0.000 description 1
- 210000000459 calcaneus Anatomy 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000013011 mating Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
Images
Abstract
The invention discloses an automatic module information positioning method for a structural web page. The existing positioning method has the shortcomings that the match is not accurate enough, and the judge on the reasonable content is difficult. In order to solve the problems, the invention is characterized in that the attribute key word is positioned by a regular expression, so as to determine the distance between the attribute key word and the attribute value; finally, the whole attribute value is positioned according to the attribute key word and the distance between the attribute key word and the attribute value. The invention can accurately and effectively position the searched information. The invention is suitable for various net information searching engine.
Description
Technical field
The present invention relates to a kind of automatic moulding plate information locating method at the structuring webpage.
Background technology
The info web extractive technique is important content in the internet information excavation applications.The problem that the info web extractive technique will solve is how to extract specified message from webpage.Such as all the recruitment information web page extractions from a recruitment information issuing web site go out information such as company and position.Technology in the past is to utilize regular expression to go to mate appointed information in the webpage, extracts the most rational content then from the information of coupling.This method exists a lot of defectives.Maximum problem is that regular expression can only mate the information of enumerating in advance, such as, the search information that we can enumerate in advance by regular expression search, as " position vacant " is " teacher " " secretarial's " information, but for then can not searching for as " sales manager " information such as " network engineers " that we do not enumerate, in fact we can not carry out exhaustive to position vacant; In addition, we search for position information, but actual information may not have position title and just with one section word position is described, and can't search for for such information regular expression, will omit because of mating not accurate enough and search being appearred in can't judging of reasonable content like this.
Regular expression: regular expression by the specific syntactic description of a cover a kind of pattern of string matching, can be used for checking whether a character string contains certain substring, the substring of coupling be done replaced or from certain character string, take out the substring that meets certain condition etc.
Separator: the set of user-defined some specific html tag.The structure of these html tag calcaneus rete pages or leaves has much relations, and the order of occurrence number and appearance is all more stable in the webpage of same structure.For example,<table 〉,</table 〉,<td 〉,</td 〉,<tr 〉,</tr〉or the like.
Distance: in the webpage of same structure, key word zone is to the number of the separator at property value zone institute interval.As shown in Figure 2:
If we are<td 〉,</td 〉,<span 〉,</span〉be defined as separator, 4 separators of being separated by between " position vacant: " and " sales manager " so that is to say that distance is 4.
Zone: the content in the webpage between two separators.Separator among Fig. 1<span id=" lb_office " style=" font-size:12px for example; "〉and</span between " sales manager ".
Summary of the invention
Defective and deficiency at the prior art existence, the invention provides a kind of automatic moulding plate information locating method at the structuring webpage, utilize the relatively fixing characteristics of structuring webpage format, obtain the template of webpage, utilize this template directly to extract customizing messages again by statistics.
In order to reach the foregoing invention purpose, the present invention is directed to the automatic moulding plate information locating method of structuring webpage, by regular expression location attribute key word, determine the distance of attribute key word, finally locate whole property values to the distance of property value by attribute key word and attribute key word to property value.
In the above-mentioned automatic moulding plate information locating method, specifically may further comprise the steps at the structuring webpage:
(1A) position of attribute key word by the attribute in regular expression location and the part property value corresponding with this attribute key word;
(2A) determine distance between attribute key word and the part property value position corresponding with this attribute key word;
(3A) by the distance between attribute key word and attribute key word and the part property value position corresponding, determine the position of the whole property values corresponding with this attribute key word with this attribute key word.
In the above-mentioned automatic moulding plate information locating method, specifically may further comprise the steps at the structuring webpage:
(1B) the attribute key position by the attribute in regular expression location, and search in a plurality of structuring webpages zone with the nearest appearance content change of the attribute key position of a described attribute;
(2B) determine the attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance;
(3B) by attribute key word and attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance, determine the position of the whole property values corresponding with this attribute key word.
In the above-mentioned automatic moulding plate information locating method, specifically may further comprise the steps at the structuring webpage:
(1C) position of the part property value by the attribute in regular expression location, and with the attribute key position of nearest another attribute in this part property value position;
(2C) determine described part property value and and the attribute key position of nearest another attribute in this part property value position between distance;
(3C) by attribute key word described and another attribute that this part property value position is nearest, and described part property value and and the attribute key position of nearest another attribute in this part property value position between distance, determine the position of whole property values of a described attribute.
In the above-mentioned automatic moulding plate information locating method at the structuring webpage, described distance is the number of separator.
The present invention can precise and high efficiency orient and want the information of searching for.
Description of drawings
Fig. 1 is the web page foreground synoptic diagram of first kind of Template Learning and positioning strategy;
Fig. 2 is the webpage Backstage Map of first kind of Template Learning and positioning strategy;
Fig. 3 is the web page foreground synoptic diagram of second kind of Template Learning and positioning strategy;
Fig. 4 is the webpage Backstage Map of second kind of Template Learning and positioning strategy;
Fig. 5 is the web page foreground synoptic diagram of the third Template Learning and positioning strategy;
Fig. 6 is the webpage Backstage Map of the third Template Learning and positioning strategy;
Fig. 7 is a key word diacritics synoptic diagram.
Embodiment
The present invention is described in further detail below in conjunction with accompanying drawing:
Now, most information issuing web site all adopts Automatic Program relevant information is published on the webpage.The form of this webpage is generally more fixing, thereby the fixed sturcture of webpage might be extracted, and identifies the position at interested attribute place, accurately extracts interested content in the webpage then.
In general, need the attribute of extraction all its correspondent keyword can be arranged.Such as in the recruitment information webpage, " work place " this attribute has its corresponding key word---" work place ", " work city " or the like.For a certain specific attribute, the quantity of its corresponding key word is few, and this is by the decision of the characteristics of natural language.The value of attribute then can be " Beijing ", " Guangzhou ", " Shanghai " or the like, and kind may be many.Usually, for same issuing web site, the employed key word of certain attribute is fixed, and property value then changes.In addition, " distance " between key word and the property value normally fixed.In addition, in the information issue page of same website, the order of the appearance of the key word of each attribute is all fixed usually.
According to above-mentioned characteristics, we have invented a kind of automatic moulding plate information location technology based on statistical information.For certain specific attribute, we have formulated the strategy of three kinds of Template Learning and location.Before locating information, we at first need the key word and the property value of the attribute of preparing the location are defined.The method of definition adopts regular expression.Because the quantity of attribute key word is fewer, so can all define out with regular expression basically.And the variation of property value is many, may be difficult to all property values are all defined with regular expression.Therefore, the defined attribute key word is necessary, and the defined attribute value is optional.After attribute key word and part property value defined with regular expression, we just can adopt the strategy of our appointment to comprise the location of the customizing messages of whole property values.Below be the description of three kinds of Template Learning and positioning strategy:
First, situation for the part property value that has defined attribute key word and correspondence, if the both has been mated, descend the property value of coupling and " distance " in key word place " zone " and property value place " zone " according to the character string of the key word on the coupling as index record so.Scan after a plurality of webpages, if the element of the set of the property value of the matched character string of certain attribute key word pairing identical " distance " is greater than some, to every matched character string that runs into this attribute key word in all webpages, we just can directly extract the property value of this attribute with same " distance " and whether can be realized that the property value that defines mates by us regardless of this property value so backward.
As shown in Figure 1 and Figure 2, key word " position vacant: " and property value " sales manager " all can be described out by regular expression.After a plurality of webpages of scanning, the set of property value may just include elements such as " software engineer ", " project manager ", " mechanical engineer ".And " distance " between key word and the property value do not change all the time.When the property value that can not mate appears in a new webpage, just can calculate " zone " at property value place, thereby extract property value so according to this fixing distance.
The second, for having defined the attribute key word, there is not the situation of defined attribute value, if on the attribute keyword matching, the character string of coming out with coupling is as index so, writes down this character string annex " distance " of a plurality of webpages and be 0 content in zone that arrives the position of n.Scan after a plurality of webpages, if check that from front to back the properties collection element number of finding the zone on certain " distance " greater than certain numerical value, will be somebody's turn to do the distance that " distance " is defined as attribute key word and property value so.
As shown in Figure 3, Figure 4, key word " recruitment content: " can be described out with regular expression, but property value is one section word, is difficult to describe out with regular expression.Scan after a plurality of webpages, we can find from the source code of webpage that " distance " is that 4 " zone " at first produced variation, so we just " zone " of this variation to be used as be the property value of this key word.
The 3rd, for having defined property value, there is not the situation of defined attribute key word.If the property value coupling has gone up, other nearest attribute key words that mate of " distance " before and after seeking so, and calculate this two distances.Respectively with the matched character string of other attribute key words on the coupling before and after it as index, note two information of " distance " and property value.Scan after a plurality of webpages,, so just can determine the position of this property value by other attribute key word according to this fixing " distance " if find that " distance " of property value and certain other attribute key word is more fixing.
As Fig. 5, shown in Figure 6, the key word of Business Name does not occur in webpage, property value " Beijing Hua Taidian stone information consultation company limited " only occurs.By scanning after a plurality of webpages, find that " distance " of key word " cut-off date: " of this property value and other attributes is more fixing.When property value can not mate, directly obtain property value place " zone ", thereby orient the property value of the content in this " zone " so as Business Name according to key word " cut-off date: " and this fixing " distance ".
At last, in order to distinguish the webpage of different structure, same attribute keyword record the historical information of property value of different grouping, other key words that each grouping occurs before and after this key word during with the scanning webpage are as index.
As shown in Figure 7, the property value set of records ends of recording key for " position type: " has the sign that a plurality of, different property value set of records ends is distinguished with other key words " position vacant: " before and after it and " minimum educational background: " conduct.
Claims (5)
1. automatic moulding plate information locating method at the structuring webpage, it is characterized in that: by regular expression location attribute key word, determine the distance of attribute key word, finally locate whole property values to the distance of property value by attribute key word and attribute key word to property value.
2. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: specifically may further comprise the steps:
(1A) position of attribute key word by the attribute in regular expression location and the part property value corresponding with this attribute key word;
(2A) determine distance between attribute key word and the part property value position corresponding with this attribute key word;
(3A) by the distance between attribute key word and attribute key word and the part property value position corresponding, determine the position of the whole property values corresponding with this attribute key word with this attribute key word.
3. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: specifically may further comprise the steps:
(1B) the attribute key position by the attribute in regular expression location, and search in a plurality of structuring webpages zone with the nearest appearance content change of the attribute key position of a described attribute;
(2B) determine the attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance;
(3B) by attribute key word and attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance, determine the position of the whole property values corresponding with this attribute key word.
4. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: specifically may further comprise the steps:
(1C) position of the part property value by the attribute in regular expression location, and with the attribute key position of nearest another attribute in this part property value position;
(2C) determine described part property value and and the attribute key position of nearest another attribute in this part property value position between distance;
(3C) by attribute key word described and another attribute that this part property value position is nearest, and described part property value and and the attribute key position of nearest another attribute in this part property value position between distance, determine the position of whole property values of a described attribute.
5. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: described distance is the number of separator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2006101378554A CN100562872C (en) | 2006-11-07 | 2006-11-07 | Automatic moulding plate information locating method at the structuring webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2006101378554A CN100562872C (en) | 2006-11-07 | 2006-11-07 | Automatic moulding plate information locating method at the structuring webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101178708A true CN101178708A (en) | 2008-05-14 |
CN100562872C CN100562872C (en) | 2009-11-25 |
Family
ID=39404966
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2006101378554A Expired - Fee Related CN100562872C (en) | 2006-11-07 | 2006-11-07 | Automatic moulding plate information locating method at the structuring webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100562872C (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013107308A1 (en) * | 2012-01-20 | 2013-07-25 | 华为终端有限公司 | Method and apparatus for aggregating information |
CN105574084A (en) * | 2015-12-10 | 2016-05-11 | 天津海量信息技术有限公司 | Extraction method of case information in webpage |
CN108664535A (en) * | 2017-04-01 | 2018-10-16 | 北京京东尚科信息技术有限公司 | Information output method and device |
CN105760290B (en) * | 2014-12-17 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The problem of being tested based on webpage front-end localization method and relevant apparatus, system |
CN109344355A (en) * | 2018-09-26 | 2019-02-15 | 北京因特睿软件有限公司 | Automatic returning detection and Block- matching adaptive approach and device for Web evolution |
CN110019084A (en) * | 2017-10-12 | 2019-07-16 | 航天信息股份有限公司 | Split layer index method and apparatus towards HDFS |
-
2006
- 2006-11-07 CN CNB2006101378554A patent/CN100562872C/en not_active Expired - Fee Related
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013107308A1 (en) * | 2012-01-20 | 2013-07-25 | 华为终端有限公司 | Method and apparatus for aggregating information |
CN105760290B (en) * | 2014-12-17 | 2018-11-13 | 阿里巴巴集团控股有限公司 | The problem of being tested based on webpage front-end localization method and relevant apparatus, system |
CN105574084A (en) * | 2015-12-10 | 2016-05-11 | 天津海量信息技术有限公司 | Extraction method of case information in webpage |
CN108664535A (en) * | 2017-04-01 | 2018-10-16 | 北京京东尚科信息技术有限公司 | Information output method and device |
CN108664535B (en) * | 2017-04-01 | 2022-08-12 | 北京京东尚科信息技术有限公司 | Information output method and device |
CN110019084A (en) * | 2017-10-12 | 2019-07-16 | 航天信息股份有限公司 | Split layer index method and apparatus towards HDFS |
CN110019084B (en) * | 2017-10-12 | 2022-01-14 | 航天信息股份有限公司 | HDFS (Hadoop distributed File System) -oriented split layer indexing method and device |
CN109344355A (en) * | 2018-09-26 | 2019-02-15 | 北京因特睿软件有限公司 | Automatic returning detection and Block- matching adaptive approach and device for Web evolution |
CN109344355B (en) * | 2018-09-26 | 2022-03-15 | 北京因特睿软件有限公司 | Automatic regression detection and block matching self-adaption method and device for webpage change |
Also Published As
Publication number | Publication date |
---|---|
CN100562872C (en) | 2009-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1955963B (en) | System and method for searching dates in electronic documents | |
CN104731941B (en) | A kind of method based on XBRL technologies from unstructured financial report crawl data | |
CN102279894B (en) | Method for searching, integrating and providing comment information based on semantics and searching system | |
CN102982076B (en) | Based on the various dimensions content mask method in semantic label storehouse | |
CN100573520C (en) | For retrieval is carried out pretreated method and apparatus to a plurality of documents | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN102722498B (en) | Search engine and implementation method thereof | |
CN100514323C (en) | System and method for automatically extracting by-line information | |
CN100444591C (en) | Method for acquiring front-page keyword and its application system | |
CN100562872C (en) | Automatic moulding plate information locating method at the structuring webpage | |
CN105468605A (en) | Entity information map generation method and device | |
CN102567494B (en) | Website classification method and device | |
CN104375992A (en) | Address matching method and device | |
CN101641674A (en) | Time series search engine | |
CN103617174A (en) | Distributed searching method based on cloud computing | |
US20070088743A1 (en) | Information processing device and information processing method | |
CN103324622A (en) | Method and device for automatic generating of front page abstract | |
CN105550189A (en) | Ontology-based intelligent retrieval system for information security event | |
CN101393565A (en) | Facing virtual museum searching method based on noumenon | |
CN102681994A (en) | Webpage information extracting method and system | |
Fu et al. | Web content extraction based on webpage layout analysis | |
CN102073641A (en) | Method, device and program for processing consumer-generated media information | |
CN106055618A (en) | Data processing method based on web crawlers and structural storage | |
CN101751439A (en) | Image retrieval method based on hierarchical clustering | |
CN105550375A (en) | Heterogeneous data integrating method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20091125 |