CN101178708A - Automatic moulding plate information locating method for structured web page - Google Patents

Automatic moulding plate information locating method for structured web page Download PDF

Info

Publication number
CN101178708A
CN101178708A CNA2006101378554A CN200610137855A CN101178708A CN 101178708 A CN101178708 A CN 101178708A CN A2006101378554 A CNA2006101378554 A CN A2006101378554A CN 200610137855 A CN200610137855 A CN 200610137855A CN 101178708 A CN101178708 A CN 101178708A
Authority
CN
China
Prior art keywords
attribute
key word
attribute key
property value
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2006101378554A
Other languages
Chinese (zh)
Other versions
CN100562872C (en
Inventor
陈华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kuxun Technology Co Ltd
Original Assignee
Beijing Kuxun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kuxun Technology Co Ltd filed Critical Beijing Kuxun Technology Co Ltd
Priority to CNB2006101378554A priority Critical patent/CN100562872C/en
Publication of CN101178708A publication Critical patent/CN101178708A/en
Application granted granted Critical
Publication of CN100562872C publication Critical patent/CN100562872C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an automatic module information positioning method for a structural web page. The existing positioning method has the shortcomings that the match is not accurate enough, and the judge on the reasonable content is difficult. In order to solve the problems, the invention is characterized in that the attribute key word is positioned by a regular expression, so as to determine the distance between the attribute key word and the attribute value; finally, the whole attribute value is positioned according to the attribute key word and the distance between the attribute key word and the attribute value. The invention can accurately and effectively position the searched information. The invention is suitable for various net information searching engine.

Description

Automatic moulding plate information locating method at the structuring webpage
Technical field
The present invention relates to a kind of automatic moulding plate information locating method at the structuring webpage.
Background technology
The info web extractive technique is important content in the internet information excavation applications.The problem that the info web extractive technique will solve is how to extract specified message from webpage.Such as all the recruitment information web page extractions from a recruitment information issuing web site go out information such as company and position.Technology in the past is to utilize regular expression to go to mate appointed information in the webpage, extracts the most rational content then from the information of coupling.This method exists a lot of defectives.Maximum problem is that regular expression can only mate the information of enumerating in advance, such as, the search information that we can enumerate in advance by regular expression search, as " position vacant " is " teacher " " secretarial's " information, but for then can not searching for as " sales manager " information such as " network engineers " that we do not enumerate, in fact we can not carry out exhaustive to position vacant; In addition, we search for position information, but actual information may not have position title and just with one section word position is described, and can't search for for such information regular expression, will omit because of mating not accurate enough and search being appearred in can't judging of reasonable content like this.
Regular expression: regular expression by the specific syntactic description of a cover a kind of pattern of string matching, can be used for checking whether a character string contains certain substring, the substring of coupling be done replaced or from certain character string, take out the substring that meets certain condition etc.
Separator: the set of user-defined some specific html tag.The structure of these html tag calcaneus rete pages or leaves has much relations, and the order of occurrence number and appearance is all more stable in the webpage of same structure.For example,<table 〉,</table 〉,<td 〉,</td 〉,<tr 〉,</tr〉or the like.
Distance: in the webpage of same structure, key word zone is to the number of the separator at property value zone institute interval.As shown in Figure 2:
If we are<td 〉,</td 〉,<span 〉,</span〉be defined as separator, 4 separators of being separated by between " position vacant: " and " sales manager " so that is to say that distance is 4.
Zone: the content in the webpage between two separators.Separator among Fig. 1<span id=" lb_office " style=" font-size:12px for example; "〉and</span between " sales manager ".
Summary of the invention
Defective and deficiency at the prior art existence, the invention provides a kind of automatic moulding plate information locating method at the structuring webpage, utilize the relatively fixing characteristics of structuring webpage format, obtain the template of webpage, utilize this template directly to extract customizing messages again by statistics.
In order to reach the foregoing invention purpose, the present invention is directed to the automatic moulding plate information locating method of structuring webpage, by regular expression location attribute key word, determine the distance of attribute key word, finally locate whole property values to the distance of property value by attribute key word and attribute key word to property value.
In the above-mentioned automatic moulding plate information locating method, specifically may further comprise the steps at the structuring webpage:
(1A) position of attribute key word by the attribute in regular expression location and the part property value corresponding with this attribute key word;
(2A) determine distance between attribute key word and the part property value position corresponding with this attribute key word;
(3A) by the distance between attribute key word and attribute key word and the part property value position corresponding, determine the position of the whole property values corresponding with this attribute key word with this attribute key word.
In the above-mentioned automatic moulding plate information locating method, specifically may further comprise the steps at the structuring webpage:
(1B) the attribute key position by the attribute in regular expression location, and search in a plurality of structuring webpages zone with the nearest appearance content change of the attribute key position of a described attribute;
(2B) determine the attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance;
(3B) by attribute key word and attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance, determine the position of the whole property values corresponding with this attribute key word.
In the above-mentioned automatic moulding plate information locating method, specifically may further comprise the steps at the structuring webpage:
(1C) position of the part property value by the attribute in regular expression location, and with the attribute key position of nearest another attribute in this part property value position;
(2C) determine described part property value and and the attribute key position of nearest another attribute in this part property value position between distance;
(3C) by attribute key word described and another attribute that this part property value position is nearest, and described part property value and and the attribute key position of nearest another attribute in this part property value position between distance, determine the position of whole property values of a described attribute.
In the above-mentioned automatic moulding plate information locating method at the structuring webpage, described distance is the number of separator.
The present invention can precise and high efficiency orient and want the information of searching for.
Description of drawings
Fig. 1 is the web page foreground synoptic diagram of first kind of Template Learning and positioning strategy;
Fig. 2 is the webpage Backstage Map of first kind of Template Learning and positioning strategy;
Fig. 3 is the web page foreground synoptic diagram of second kind of Template Learning and positioning strategy;
Fig. 4 is the webpage Backstage Map of second kind of Template Learning and positioning strategy;
Fig. 5 is the web page foreground synoptic diagram of the third Template Learning and positioning strategy;
Fig. 6 is the webpage Backstage Map of the third Template Learning and positioning strategy;
Fig. 7 is a key word diacritics synoptic diagram.
Embodiment
The present invention is described in further detail below in conjunction with accompanying drawing:
Now, most information issuing web site all adopts Automatic Program relevant information is published on the webpage.The form of this webpage is generally more fixing, thereby the fixed sturcture of webpage might be extracted, and identifies the position at interested attribute place, accurately extracts interested content in the webpage then.
In general, need the attribute of extraction all its correspondent keyword can be arranged.Such as in the recruitment information webpage, " work place " this attribute has its corresponding key word---" work place ", " work city " or the like.For a certain specific attribute, the quantity of its corresponding key word is few, and this is by the decision of the characteristics of natural language.The value of attribute then can be " Beijing ", " Guangzhou ", " Shanghai " or the like, and kind may be many.Usually, for same issuing web site, the employed key word of certain attribute is fixed, and property value then changes.In addition, " distance " between key word and the property value normally fixed.In addition, in the information issue page of same website, the order of the appearance of the key word of each attribute is all fixed usually.
According to above-mentioned characteristics, we have invented a kind of automatic moulding plate information location technology based on statistical information.For certain specific attribute, we have formulated the strategy of three kinds of Template Learning and location.Before locating information, we at first need the key word and the property value of the attribute of preparing the location are defined.The method of definition adopts regular expression.Because the quantity of attribute key word is fewer, so can all define out with regular expression basically.And the variation of property value is many, may be difficult to all property values are all defined with regular expression.Therefore, the defined attribute key word is necessary, and the defined attribute value is optional.After attribute key word and part property value defined with regular expression, we just can adopt the strategy of our appointment to comprise the location of the customizing messages of whole property values.Below be the description of three kinds of Template Learning and positioning strategy:
First, situation for the part property value that has defined attribute key word and correspondence, if the both has been mated, descend the property value of coupling and " distance " in key word place " zone " and property value place " zone " according to the character string of the key word on the coupling as index record so.Scan after a plurality of webpages, if the element of the set of the property value of the matched character string of certain attribute key word pairing identical " distance " is greater than some, to every matched character string that runs into this attribute key word in all webpages, we just can directly extract the property value of this attribute with same " distance " and whether can be realized that the property value that defines mates by us regardless of this property value so backward.
As shown in Figure 1 and Figure 2, key word " position vacant: " and property value " sales manager " all can be described out by regular expression.After a plurality of webpages of scanning, the set of property value may just include elements such as " software engineer ", " project manager ", " mechanical engineer ".And " distance " between key word and the property value do not change all the time.When the property value that can not mate appears in a new webpage, just can calculate " zone " at property value place, thereby extract property value so according to this fixing distance.
The second, for having defined the attribute key word, there is not the situation of defined attribute value, if on the attribute keyword matching, the character string of coming out with coupling is as index so, writes down this character string annex " distance " of a plurality of webpages and be 0 content in zone that arrives the position of n.Scan after a plurality of webpages, if check that from front to back the properties collection element number of finding the zone on certain " distance " greater than certain numerical value, will be somebody's turn to do the distance that " distance " is defined as attribute key word and property value so.
As shown in Figure 3, Figure 4, key word " recruitment content: " can be described out with regular expression, but property value is one section word, is difficult to describe out with regular expression.Scan after a plurality of webpages, we can find from the source code of webpage that " distance " is that 4 " zone " at first produced variation, so we just " zone " of this variation to be used as be the property value of this key word.
The 3rd, for having defined property value, there is not the situation of defined attribute key word.If the property value coupling has gone up, other nearest attribute key words that mate of " distance " before and after seeking so, and calculate this two distances.Respectively with the matched character string of other attribute key words on the coupling before and after it as index, note two information of " distance " and property value.Scan after a plurality of webpages,, so just can determine the position of this property value by other attribute key word according to this fixing " distance " if find that " distance " of property value and certain other attribute key word is more fixing.
As Fig. 5, shown in Figure 6, the key word of Business Name does not occur in webpage, property value " Beijing Hua Taidian stone information consultation company limited " only occurs.By scanning after a plurality of webpages, find that " distance " of key word " cut-off date: " of this property value and other attributes is more fixing.When property value can not mate, directly obtain property value place " zone ", thereby orient the property value of the content in this " zone " so as Business Name according to key word " cut-off date: " and this fixing " distance ".
At last, in order to distinguish the webpage of different structure, same attribute keyword record the historical information of property value of different grouping, other key words that each grouping occurs before and after this key word during with the scanning webpage are as index.
As shown in Figure 7, the property value set of records ends of recording key for " position type: " has the sign that a plurality of, different property value set of records ends is distinguished with other key words " position vacant: " before and after it and " minimum educational background: " conduct.

Claims (5)

1. automatic moulding plate information locating method at the structuring webpage, it is characterized in that: by regular expression location attribute key word, determine the distance of attribute key word, finally locate whole property values to the distance of property value by attribute key word and attribute key word to property value.
2. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: specifically may further comprise the steps:
(1A) position of attribute key word by the attribute in regular expression location and the part property value corresponding with this attribute key word;
(2A) determine distance between attribute key word and the part property value position corresponding with this attribute key word;
(3A) by the distance between attribute key word and attribute key word and the part property value position corresponding, determine the position of the whole property values corresponding with this attribute key word with this attribute key word.
3. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: specifically may further comprise the steps:
(1B) the attribute key position by the attribute in regular expression location, and search in a plurality of structuring webpages zone with the nearest appearance content change of the attribute key position of a described attribute;
(2B) determine the attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance;
(3B) by attribute key word and attribute key word and and the zone of the nearest appearance content change of the attribute key position of a described attribute between distance, determine the position of the whole property values corresponding with this attribute key word.
4. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: specifically may further comprise the steps:
(1C) position of the part property value by the attribute in regular expression location, and with the attribute key position of nearest another attribute in this part property value position;
(2C) determine described part property value and and the attribute key position of nearest another attribute in this part property value position between distance;
(3C) by attribute key word described and another attribute that this part property value position is nearest, and described part property value and and the attribute key position of nearest another attribute in this part property value position between distance, determine the position of whole property values of a described attribute.
5. the automatic moulding plate information locating method at the structuring webpage according to claim 1 is characterized in that: described distance is the number of separator.
CNB2006101378554A 2006-11-07 2006-11-07 Automatic moulding plate information locating method at the structuring webpage Expired - Fee Related CN100562872C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101378554A CN100562872C (en) 2006-11-07 2006-11-07 Automatic moulding plate information locating method at the structuring webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101378554A CN100562872C (en) 2006-11-07 2006-11-07 Automatic moulding plate information locating method at the structuring webpage

Publications (2)

Publication Number Publication Date
CN101178708A true CN101178708A (en) 2008-05-14
CN100562872C CN100562872C (en) 2009-11-25

Family

ID=39404966

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101378554A Expired - Fee Related CN100562872C (en) 2006-11-07 2006-11-07 Automatic moulding plate information locating method at the structuring webpage

Country Status (1)

Country Link
CN (1) CN100562872C (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107308A1 (en) * 2012-01-20 2013-07-25 华为终端有限公司 Method and apparatus for aggregating information
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage
CN108664535A (en) * 2017-04-01 2018-10-16 北京京东尚科信息技术有限公司 Information output method and device
CN105760290B (en) * 2014-12-17 2018-11-13 阿里巴巴集团控股有限公司 The problem of being tested based on webpage front-end localization method and relevant apparatus, system
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN110019084A (en) * 2017-10-12 2019-07-16 航天信息股份有限公司 Split layer index method and apparatus towards HDFS

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107308A1 (en) * 2012-01-20 2013-07-25 华为终端有限公司 Method and apparatus for aggregating information
CN105760290B (en) * 2014-12-17 2018-11-13 阿里巴巴集团控股有限公司 The problem of being tested based on webpage front-end localization method and relevant apparatus, system
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage
CN108664535A (en) * 2017-04-01 2018-10-16 北京京东尚科信息技术有限公司 Information output method and device
CN108664535B (en) * 2017-04-01 2022-08-12 北京京东尚科信息技术有限公司 Information output method and device
CN110019084A (en) * 2017-10-12 2019-07-16 航天信息股份有限公司 Split layer index method and apparatus towards HDFS
CN110019084B (en) * 2017-10-12 2022-01-14 航天信息股份有限公司 HDFS (Hadoop distributed File System) -oriented split layer indexing method and device
CN109344355A (en) * 2018-09-26 2019-02-15 北京因特睿软件有限公司 Automatic returning detection and Block- matching adaptive approach and device for Web evolution
CN109344355B (en) * 2018-09-26 2022-03-15 北京因特睿软件有限公司 Automatic regression detection and block matching self-adaption method and device for webpage change

Also Published As

Publication number Publication date
CN100562872C (en) 2009-11-25

Similar Documents

Publication Publication Date Title
CN1955963B (en) System and method for searching dates in electronic documents
CN104731941B (en) A kind of method based on XBRL technologies from unstructured financial report crawl data
CN102279894B (en) Method for searching, integrating and providing comment information based on semantics and searching system
CN102982076B (en) Based on the various dimensions content mask method in semantic label storehouse
CN100573520C (en) For retrieval is carried out pretreated method and apparatus to a plurality of documents
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN102722498B (en) Search engine and implementation method thereof
CN100514323C (en) System and method for automatically extracting by-line information
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN100562872C (en) Automatic moulding plate information locating method at the structuring webpage
CN105468605A (en) Entity information map generation method and device
CN102567494B (en) Website classification method and device
CN104375992A (en) Address matching method and device
CN101641674A (en) Time series search engine
CN103617174A (en) Distributed searching method based on cloud computing
US20070088743A1 (en) Information processing device and information processing method
CN103324622A (en) Method and device for automatic generating of front page abstract
CN105550189A (en) Ontology-based intelligent retrieval system for information security event
CN101393565A (en) Facing virtual museum searching method based on noumenon
CN102681994A (en) Webpage information extracting method and system
Fu et al. Web content extraction based on webpage layout analysis
CN102073641A (en) Method, device and program for processing consumer-generated media information
CN106055618A (en) Data processing method based on web crawlers and structural storage
CN101751439A (en) Image retrieval method based on hierarchical clustering
CN105550375A (en) Heterogeneous data integrating method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20091125