CN102722489B - The system and method for extracting object identifier from webpage - Google Patents

The system and method for extracting object identifier from webpage Download PDF

Info

Publication number
CN102722489B
CN102722489B CN201110078361.4A CN201110078361A CN102722489B CN 102722489 B CN102722489 B CN 102722489B CN 201110078361 A CN201110078361 A CN 201110078361A CN 102722489 B CN102722489 B CN 102722489B
Authority
CN
China
Prior art keywords
identifier
unit
fragment
webpage
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110078361.4A
Other languages
Chinese (zh)
Other versions
CN102722489A (en
Inventor
姜珊珊
谢宣松
孙军
郑继川
赵立军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201110078361.4A priority Critical patent/CN102722489B/en
Publication of CN102722489A publication Critical patent/CN102722489A/en
Application granted granted Critical
Publication of CN102722489B publication Critical patent/CN102722489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

Disclose the system and method for extracting object identifier from webpage.Described system comprises: identifier identification module, and for the symbol of identification marking from webpage block, described webpage comprises the object identifier relevant information of the various information representing object identifier, and identifier block is the one section of text comprising object identifier relevant information; Identifier fragment abstraction module, be connected with identifier identification module, for the positional information of each word in the identifier block that identifies according to identifier identification module and content information at least one of them, from identifier block, remove garbage, to obtain identifier fragment; And identifier element labeling module, be connected with identifier fragment abstraction module, the identifier fragment for being extracted by identifier fragment abstraction module is labeled as the object identifier being suitable for building object database.

Description

The system and method for extracting object identifier from webpage
Technical field
Present invention relates in general to and information processing and information extraction technique, and more specifically, relate to the system and method for identification and extracting object identifier from webpage.
Background technology
In current technical field of information processing, usually need to build object database, this just relates to and will to generate for object and object map provides object identifier with hierarchical structure, and expression object also sets up index.
Here, need object to be processed is usually directed to the webpage on internet.The object identifier (i.e. name) that object in real world has it unique, certainly, also object identifier can be represented with other another name or abbreviation sanctified by usage, such as, the phenomenon ubiquity that in different web pages, the name of same target is not quite similar.And in same webpage, normally consistent to the expression of same target.For realizing foundation and the object map of object database, object should have unique and consistent object identifier, and this object identifier can be incomplete here.
The name of object can ID product object, but only represents that object may bring ambiguity by name, because the name of multiple product may be closely similar sometimes, now just needs some supplementarys to help ID product object.Therefore, be used for representing that the name of product object and this kind of supplementary are commonly called object identifier relevant information, and this object identifier relevant information can be comprised in webpage.
In " Webpagetitleextractionanditsapplication; YeweiXue; YunhuaHu.InformationProcessing & Management; Vol.43; No.5.September2007; pp.1332-1347 ", disclose the technology that a kind of web page title extracts and applies.In the middle of above-mentioned document, employ supervised learning method SVM and CRF and extract web page title from html document, wherein, the feature that the extraction of web page title adopts is based on dom tree and view-based access control model information.
In addition, in " US6910004B2:Methodandcomputersystemforpart-of-speechtagg ingofincompletesentences, Xerox ", a kind of part-of-speech tagging method and system of computer implemented imperfect statement is disclosed.In the middle of above-mentioned document, provide a kind of identifier and artificial contextual information of using and carry out the method for part-of-speech tagging to phrase, phrase is wherein the set of word.
In addition, a kind of technology extracting ProductName is disclosed in " NameIt:Extractionofproductnames; GerhardFriedrich, KostyantynShchekotykhin.SixthIEEEInternationalConference onDataMining-WorkshopsICDMW ' 06 ".In the middle of above-mentioned document, mainly provide the method obtaining ProductName from webpage, first it extract ProductName information from the content " TITLE " label of webpage and " A " label, then two kinds of noises are removed: website correlation noise and product feature correlation noise, be integrated into ProductName through cluster afterwards.
But, the prior art more than related to mainly has following shortcoming: first, in the abstracting method of disclosed web page title or ProductName in the above documents, only adopt the feature of dom tree and visual information may not meet the precision of extracting object identifier related information.Further, for the above-mentioned object identifier relevant information extracted, also need to carry out further noise cleaning and the process of identifier element mark to obtain object identifier.
In sum, in fact in webpage object identifier extraction involved by be the problem how object identifier defines, and definition after, how to identify the problem of object identifier.Ideally, the title of webpage is the identifier of the object that it is talked about.And in actual conditions, title may contain partial objects identifier related information, need to carry out the subsequent treatment such as noise cleaning and unit mark and just can obtain object identifier.Further, the object identifier relevant information that single webpage provides is comprehensive not, needs to integrate to obtain object identifier to the object identifier relevant information from multiple web pages.
Therefore, for above-mentioned the problems of the prior art, need the system and method that a kind of extracting object identifier from webpage is provided, the precision of extracting object identifier related information from webpage can be improved, and effectively can carry out the subsequent treatment such as noise cleaning and unit mark to the object identifier relevant information extracted, to obtain building the object identifier needed for object database.
Summary of the invention
Therefore, the object of the invention is to solve above-mentioned one or more problem of the prior art and shortcoming.
The object of this invention is to provide the system and method for extracting object identifier from webpage, it can based on both the positional information of word each in object identifier relevant information or content information or its, noise cleaning process is carried out to the object identifier relevant information extracted, to obtain the identifier fragment meeting the demand building object database.
The present invention also aims to provide the system and method for extracting object identifier from webpage, the identifier fragment extracted can be labeled as 4-cellular chain, with the demand of satisfied structure object database based on the method for word frequency and mutual information by it.
The present invention also aims to provide the system and method for extracting object identifier from webpage, it can not use only the feature of dom tree and visual information, also by extracting to the judgement of content information the identifier block comprising object identifier relevant information from webpage, thus improve the extraction precision of object identifier relevant information.
According to an aspect of the present invention, provide the system of extracting object identifier from webpage, wherein, described webpage comprises the object identifier relevant information of the various information representing described object identifier, described system comprises: identifier block identification module, for the symbol of identification marking from webpage block, described identifier block is the one section of text comprising described object identifier relevant information; Identifier fragment abstraction module, be connected with described identifier block identification module, for the positional information of each word in the identifier block that identifies according to described identifier block identification module and content information at least one of them, garbage is removed, to obtain identifier fragment from described identifier block; And identifier element labeling module, be connected with described identifier fragment abstraction module, the identifier fragment for being extracted by described identifier fragment abstraction module is labeled as the object identifier being suitable for building object database.
Said system comprises further: disappearance unit complementary module, be connected with described identifier fragment abstraction module and described identifier element labeling module, identifier fragment for being extracted from multiple webpage by described identifier fragment abstraction module carries out the identifier fragment integrated after forming integration, and the identifier fragment after described integration is outputted to described identifier element labeling module to be labeled as the object identifier being suitable for building object database by described identifier element labeling module.
Said system comprises further: identifier match module, is connected with described identifier element labeling module, and the object identifier for marking out according to described identifier element labeling module identifies the object identifier representing identical product object.
In said system, described identifier block identification module comprises: Web Page Processing unit, for the treatment of webpage to obtain dom tree and visual information; Visual information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the visual information of each node in DOM; Structural information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the structural information of each node in DOM; Content information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the content information of each node in DOM; And weight selection unit, be connected with described visual information computing unit, described structural information computing unit and described content information computing unit, for the weight that basis is calculated each node in dom tree by described visual information computing unit, described structural information computing unit and described content information computing unit, the node selecting weight higher is as described identifier block.
In said system, described visual information computing unit is used for: can not be the node of identifier block for what have identical horizontal ordinate or identical ordinate, gives less weight; The Gaussian function of two dimension is used to evaluate the position of text node in webpage:
Wherein u represents horizontal ordinate and v represents ordinate, the value of function as position weight, constant (u 0, v 0, σ) adjust according to specific tasks; Font is larger, gives higher weight; If text is runic, give higher weight.
In said system, described structural information computing unit is used for: the weight calculation according to structural information increases heading label, as " H1 " label, and " H2 " label, the weight of " H3 " label etc.
In said system, described content information computing unit is used for: with the similarity of following formulae discovery node content with " TITLE " label text content: sim (e, e title)=| { w k| w k∈ e & w k∈ e title|/(log (| e|)+log (| e title|)), wherein e represents the content of node, e titlerepresent the content of " TITLE " label, a word in w representation node; If described webpage is the product requirement specification page, then following regular expression is used to mate to each word in node: " ([0-9]+[A-z]+)+[0-9] *", " ([A-z]+[0-9]+)+[A-z] *", " ([0-9]+[-] 0,1} [A-z]+[-] 0,1})+[0-9] *", " ([A-z]+[-] 0,1} [0-9]+[-] 0,1})+[A-z] *", wherein each regular expression represents the universal law that the ProductName of electronic product has; And give higher weight to the word in node with higher word frequency.
In said system, described identifier fragment abstraction module is used for: the size whether each word judging in described identifier block appears at beginning of the sentence is in the window of 5, and if it is score words is 1, otherwise marking is 0; Judge whether each word in described identifier block can find in general dictionary, and if it is score words is 0, otherwise marking is 1; If described webpage is the product requirement specification page, then judge whether each word in described identifier block mates specific regular expression, if it is score words is 1, otherwise marking is 0, and wherein said regular expression represents the universal law that the ProductName of electronic product has; And the word of described identifier block mid-score more than two 0 point is removed; Wherein, in described identifier block many groups be not removed the separated continuous word of word and will form a series of identifier element, to form identifier fragment.
In said system, described identifier element labeling module is used for: if described webpage is the product requirement specification page, then described identifier fragment is labeled as 4-cellular chain, that is, { classification, manufacturer, ProductName, attribute }.
In said system, described identifier fragment is by a series of identifier element: unit 0, unit 1 ..., unit n forms, and described identifier element labeling module is used for: suppose that unit 0 is manufacturer's unit, mutual information between computing unit 0 and unit 1: if mutual information is 0, then determine that manufacturer's unit is unit 0, if mutual information is 1, then determine that raw manufacturer unit is for (unit 0, unit 1); Suppose that the word that word frequency is the highest belongs to ProductName unit, mutual information between computing unit 1 and unit k+1: if described mutual information is less than certain threshold value, then determine that ProductName unit is (unit 1, unit 2, ..., unit k+1), wherein 0 < k < 5; And determine template(-let) be (unit k+2 ..., unit n).
In said system, described disappearance unit complementary module is used for: when new level chain is merged into already present level chain, calculates each unit in new level chain and the similarity that there is unit; There is identical unit if existed in chain, judge next unit; There is similar unit if existed in chain, judge the relation of inclusion of two unit, the unit of new level chain is connected to and exists after similar units; If new level chain is uncorrelated with the unit existed in level chain, the unit in new level chain is connected to after the virtual root node that there is chain; And according to the occurrence number of unit, from there is chain the identifier fragment selecting integration.
In said system, described identifier match module uses following formula to mate the object identifier representing identical product object:
Similarity(link i,link j)=∑w kSim(Unit[k] i,Unit[k] j),
Wherein 0≤k≤3, sim (u i, u j)=| { word|word ∈ u iaMP.AMp.Amp word ∈ u j|/(log (| u i|)+log (| u j|)), ∑ wk=1.
In said system, when described system is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, comprise further: identifier block taxon, be connected with described identifier fragment abstraction module with described identifier block recognition unit, identifier block for the multiple objects identified by described identifier block recognition unit is classified for each object, extracts identifier fragment for by described identifier fragment abstraction module from the identifier block corresponding with each object.
In said system, when described system is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, comprise further: identifier segment classification unit, be connected with described identifier fragment abstraction module and described identifier element labeling module, identifier fragment for the multiple objects extracted by described identifier fragment abstraction module is classified for each object, for the object identifier being marked to be formed each object being suitable for building object database by the described identifier element labeling module pair identifier fragment corresponding with each object.
According to a further aspect in the invention, provide the method for extracting object identifier from webpage, comprise step: identification marking symbol block from webpage, wherein said webpage comprises the object identifier relevant information of the various information representing described object identifier, and described identifier block is the one section of text comprising described object identifier relevant information; According to the positional information of each word in the described identifier block identified and content information at least one of them, from described identifier block, remove garbage, to obtain identifier fragment; And identifier fragment is labeled as the object identifier being suitable for building object database.
Said method comprises step further: the identifier fragment extracted from multiple webpage is carried out the identifier fragment integrated after forming integration, and the identifier fragment after described integration is labeled as the object identifier being suitable for building object database.
Said method comprises step further: identify the object identifier representing identical product object according to the object identifier marked out.
Said method comprises step further: when described method is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, the identifier block of the multiple objects identified is classified for each object, for the identifier fragment extracting each object from the identifier block corresponding with each object.
Said method comprises step further: when described method is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, the identifier fragment of the described multiple objects extracted is classified for each object, to mark to be formed to the identifier fragment corresponding with each object the object identifier being suitable for each object building object database.
According to the embodiment of the present invention from webpage extracting object identifier system and method in, not only adopt the feature of dom tree and visual information, also introduce content information to calculate the weight of each text node of dom tree, thus more accurately can identify the identifier block comprising object identifier relevant information from webpage, improve the precision of the object identifier building object database.
According to the embodiment of the present invention from webpage extracting object identifier system and method in, based on both the positional information of word each in identifier block or content information or its, garbage is removed from identifier block, thus more accurately can remove useless word from identifier block, improve the precision of obtained identifier fragment, and further increase the precision of the object identifier building object database.
According to the embodiment of the present invention from webpage extracting object identifier system and method in, adopt the method based on word frequency and mutual information that the unordered identifier fragment of dispersion is labeled as 4-cellular chain, wherein mark object in units of unit, each unit is again the set of a word, the object identifier of this structure is convenient to build object database, thus improves the facility of subsequent object Database process.
By reading the detailed description of the following the preferred embodiments of the present invention considered by reference to the accompanying drawings, above and other target of the present invention, feature, advantage and technology and industrial significance will be understood better.
Accompanying drawing explanation
Fig. 1 shows the figure of the example of the target web of the input of the object as the embodiment of the present invention;
Fig. 2 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of 4-cellular chain;
Fig. 3 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of identifier block;
Fig. 4 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of intermediate result;
Fig. 5 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the exemplary plot of object identifier;
Fig. 6 shows the block diagram of the exemplary configuration of the system of extracting object identifier from webpage of the embodiment of the present invention;
Fig. 7 show according to the embodiment of the present invention from webpage extracting object identifier system in identifier block identification module in the process flow diagram of the example of process that carries out;
Fig. 8 show according to the embodiment of the present invention from webpage extracting object identifier system in identifier fragment abstraction module in the process flow diagram of the example of process that carries out;
Fig. 9 show the embodiment of the present invention from webpage extracting object identifier system in disappearance unit complementary module in the process flow diagram of the example of process that carries out;
Figure 10 show the embodiment of the present invention from webpage extracting object identifier system in identifier element labeling module in the process flow diagram of the example of process that carries out;
Figure 11 shows the process flow diagram of the illustrative methods of extracting object identifier from webpage according to the embodiment of the present invention;
Figure 12 shows the hardware block diagram of the system of extracting object identifier from webpage realizing the embodiment of the present invention with computing machine.
Embodiment
Specific embodiments of the invention are described in detail below in conjunction with accompanying drawing.
First, the principle of the system of extracting object identifier that an embodiment of the present invention will be described from webpage.
As described in the background art, when building object database, usually need extracting object identifier from webpage, wherein, both from extracting object identifier the webpage describing single product object, also multiple object object identifier separately can be extracted from the webpage describing multi-product object.Object mentioned here typically refers to product, such as a digital camera in real world.Fig. 1 shows the figure of the example of the target web of the input of the object as the embodiment of the present invention.As shown in Figure 1, the object of the embodiment of the present invention can be the product requirement specification page of digital camera RicohGDDIGITALIII, and it is from different websites.Product requirement specification webpage is the ideal data source of object database, covers the classification of object, identifier, the key message of the various structure object such as attribute and value.Certainly, it will be understood by those skilled in the art that the target web inputted in the embodiment of the present invention is not limited in product requirement specification webpage, and also can be the webpage of other type.
Owing to building the needs of object database, the embodiment of the present invention from webpage extracting object identifier system and method in, the final object identifier exported as the product object through integrating, such as { DigitalCamera, Ricoh, GRDIGITALIII, 10.1MP}.Therefore, the system and method for the embodiment of the present invention mainly needs to address the problem: 1) how for object defines a unique object identifier; 2) the how object identifier relevant information of extracting object from webpage; 3) how complete object identifier is supplemented by the object identifier relevant information of multiple web pages; 4) object identifier representing same target how is identified.
Obviously, ProductName can identify a product object, but only represent that object may bring ambiguity with ProductName, sometimes the ProductName of two complete incoherent classifications may be closely similar, some supplementarys now may be needed to help mark, such as classification information, manufacturer's information or attribute information.In an embodiment of the present invention, the supplementary venue of ProductName and such as classification information, manufacturer's information and attribute information and so on is called object identifier relevant information.In addition, there are many general or sanctified by usage naming methods, such as people's custom is by now popular " AppleiPhone416GB/32GB " referred to as " iPhone ", and such naming method identifies very well in context of co-text, but concerning being incomplete machine and not unique.
Therefore, for the ease of building object database, the embodiment of the present invention from webpage extracting object identifier system and method in, according to the object identifier relevant information extracted from webpage, the object identifier extracted is labeled as a 4-cellular chain, using as the object identifier for building object database, this 4-cellular chain is { classification, manufacturer, ProductName, attribute }.Such as, the form of this 4-cellular chain can be Smartphone/mobile, Apple, iPhone4,16GB/32GB}, and DigitalCamera, Ricoh, GRDIGITALIII, 10.1MP}, etc.Fig. 2 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of 4-cellular chain.
According to the embodiment of the present invention from webpage extracting object identifier system and method in, first possible identifier block to be identified from webpage, here, identifier block in the embodiment of the present invention is a string text that may comprise object identifier relevant information, it typically is the title of webpage.But the identifier block that it will be understood by those skilled in the art that in the embodiment of the present invention here can be any one section of text in webpage, is not limited to the title of webpage.Fig. 3 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of identifier block.As shown in Figure 3, the content of surrounding in frame is identifier block, and being " GRDIGITALIII " in left figure, is " RicohGRDIGITALIIIReview " in right figure.Further, in " RicohGRDIGITALIIIReview " of right figure, " Review " does not belong to the content of object identifier, should remove in follow-up noise cleaning process.
Fig. 4 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of intermediate result.As shown in Figure 4, according to the embodiment of the present invention from webpage extracting object identifier system and method in intermediate result be identifier fragment, it is the structure remove uncorrelated word from identifier block after.Unit in identifier fragment is all a part for object identifier, removes " Review " and remain the identifier fragment that unit " RicohGRDIGITALIII " is the embodiment of the present invention afterwards in such as " RicohGRDIGITALIIIReview ".
Usually the identifier fragment got from single webpage is all incomplete, does not have attribute information, or do not have manufacturer's information in " CX3 " in such as " RicohGRDIGITALIII ".Missing information can be supplemented by the information in multiple web pages.
Finally, as shown in Figure 5, identifier fragment is labeled as identifier element, namely obtains object identifier.Fig. 5 show according to the embodiment of the present invention from webpage extracting object identifier system and method in the figure of example of object identifier.In the object identifier shown in Fig. 5, class location is obtained by other method, not within the scope of the embodiment of the present invention.Such as, class location obtains by following method: the html document object model dom tree of each single webpage is converted to the expandable mark language XML tree specifically comprising rendering result; Extracting object identifier from each described XML tree; Single level chain is extracted from each described XML tree; Multiple single level chains from difference list webpage are integrated into a directed tree; And according to the object identifier of single webpage and single level chain, obtain the classification information of this object from described directed tree.
According to an aspect of the present invention, provide the system of extracting object identifier from webpage, wherein, described webpage comprises the object identifier relevant information of the various information representing described object identifier, described system comprises: identifier block identification module, for the symbol of identification marking from webpage block, described identifier block is the one section of text comprising described object identifier relevant information; Identifier fragment abstraction module, be connected with described identifier block identification module, for the positional information of each word in the identifier block that identifies according to described identifier block identification module and content information at least one of them, garbage is removed, to obtain identifier fragment from described identifier block; And identifier element labeling module, be connected with described identifier fragment abstraction module, the identifier fragment for being extracted by described identifier fragment abstraction module is labeled as the object identifier being suitable for building object database.
Said system comprises further: disappearance unit complementary module, be connected with described identifier fragment abstraction module and described identifier element labeling module, identifier fragment for being extracted from multiple webpage by described identifier fragment abstraction module carries out the identifier fragment integrated after forming integration, and the identifier fragment after described integration is outputted to described identifier element labeling module to be labeled as the object identifier being suitable for building object database by described identifier element labeling module.
Said system comprises further: identifier match module, is connected with described identifier element labeling module, and the object identifier for marking out according to described identifier element labeling module identifies the object identifier representing identical product object.
, describe the system of extracting object identifier from webpage of the embodiment of the present invention with reference to Fig. 6 in detail below, Fig. 6 shows the block diagram of the exemplary configuration of the system of extracting object identifier from webpage of the embodiment of the present invention.
As shown in Figure 6, the system of extracting object identifier from webpage of the embodiment of the present invention comprises identifier block identification module 21, for identifying possible identifier block from webpage; Identifier fragment abstraction module 22, be connected with identifier block identification module 21, for positional information or the content information of each word in the identifier block that identifies according to identifier block identification module 21, from identifier block, remove garbage to obtain identifier fragment; Disappearance unit complementary module 23, is connected with identifier fragment abstraction module 22, integrates for the identifier fragment obtained from multiple webpage by identifier fragment abstraction module 22; Identifier element labeling module 24, is connected with disappearance unit complementary module 23, thus the identifier fragment after disappearance unit complementary module 23 being integrated is labeled as each identifier element, to form the object identifier being suitable for building object database; Identifier match module 25 is connected with identifier element labeling module 24, and the object identifier be made up of identifier element for marking out based on identifier element labeling module 24 identifies the object identifier representing identical product object.The embodiment of the present invention from webpage extracting object identifier system in, be input as the webpage of the object identifier relevant information comprising object, the product requirement specification webpage of single product is such as described, and output is the object identifier of product object, intermediate result is the identifier fragment obtained from webpage.
Below, be specifically described to above-mentioned each module of the system of extracting object identifier from webpage according to the embodiment of the present invention.
According to the embodiment of the present invention from webpage extracting object identifier system in, described identifier block identification module comprises: Web Page Processing unit, for the treatment of webpage to obtain dom tree and visual information; Visual information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the visual information of each node in DOM; Structural information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the structural information of each node in DOM; Content information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the content information of each node in DOM; And weight selection unit, be connected with described visual information computing unit, described structural information computing unit and described content information computing unit, for the weight that basis is calculated each node in dom tree by described visual information computing unit, described structural information computing unit and described content information computing unit, the node selecting weight higher is as identifier block.
In said system, described visual information computing unit is used for: can not be the node of identifier block for what have identical horizontal ordinate or identical ordinate, gives less weight; Use the Gaussian function of two dimension in the evaluation position of text node in webpage:
H ( u , v ) = e - D ( u , v ) 2 / 2 &sigma; 2 , D ( u , v ) = ( u - u 0 ) 2 + ( v - v 0 ) 2 ,
Wherein u represents horizontal ordinate and v represents ordinate, the value of function as position weight, constant (u 0, v 0, σ) adjust according to specific tasks; Font is larger, gives higher weight; If text is runic, give higher weight.
In said system, described structural information computing unit is used for: the weight calculation according to structural information increases heading label, as " H1 " label, and " H2 " label, the weight of " H3 " label etc.
In said system, described content information computing unit is used for: with the similarity of following formulae discovery node content with " TITLE " label text content: sim (e, e title)=| { w k| w k∈ e & w k∈ e title|/(log (| e|)+log (| e title|)), wherein e represents the content of node, e titlerepresent the content of " TITLE " label, a word in w representation node; If described webpage is the product requirement specification page, then following regular expression is used to mate to each word in node: " ([0-9]+[A-z]+)+[0-9] *", " ([A-z]+[0-9]+)+[A-z] *", " ([0-9]+[-] 0,1} [A-z]+[-] 0,1})+[0-9] *", " ([A-z]+[-] 0,1} [0-9]+[-] 0,1})+[A-z] *", wherein each regular expression represents the universal law that the ProductName of electronic product has; And give higher weight to the word in node with higher word frequency.
First identifier block identification module 21 is described.For the webpage comprising the object identifier relevant information of product object of the system as the embodiment of the present invention, first will in webpage position identifiers block.As mentioned above, identifier block is one section of text of the object identifier relevant information comprising product object in webpage or incomplete sentence, and in the single object product requirement specification page, namely usual identifier block is web page title.Hereinafter, for convenience's sake, using the example of the web page title in the single object product requirement specification page as the identifier block in the system and method for the embodiment of the present invention.But, it will be appreciated by those skilled in the art that, identifier block in the system and method for the embodiment of the present invention is not limited in the web page title in the single object product requirement specification page, and can be any text of the object identifier relevant information comprising product object in any webpage.
For the web page title in the single object product requirement specification page, it is not only stylistic is also in content.Namely stylistic title is the content of " TITLE " label, but a lot of webpage does not have " TITLE " label, or the content of " TITLE " label has nothing to do with object identifier.By browser (as MozillaFirefox, WindowsIntemetExplorer, GoogleChrome, AppleSafari) kernel, can analyzing web page HTML code and analyze its grammer, thus build a dom tree, and to be shown by browser window.The XULRunner that dom tree and visual information can be provided by Mozilla obtains simultaneously.Here, each node of dom tree is one section of text, and may comprise object identifier relevant information, therefore, in the identifier block identification module 21 of the system according to the embodiment of the present invention, employ three kinds of information and come to calculate weight to the content of text of each node, thus identify identifier block:
1) visual information, comprising:
Position (x coordinate and y coordinate) in webpage;
Font feature (font size, whether black matrix).
2) structural information
If html tag is title (as " H1 " label, " H2 " label, " H3 " label etc.), then more object identifier relevant information may be contained.
Disclose the method adopting the information of above type to extract theme from webpage in the prior art, therefore, will repeat no more in the description of the system and method for the embodiment of the present invention.
3) content information, comprising:
A) with the similarity degree of " TITL " label substance text;
B) regular expression, can help the ProductName identifying electronic product;
C) word frequency of each word in text.
By the node marking that above-mentioned information can be just in dom tree, to obtain identifier block.Fig. 7 show according to the embodiment of the present invention from webpage extracting object identifier system in identifier block identification module in the process flow diagram of the example of process that carries out.As shown in Figure 7, in step S41, the webpage as input object is processed, use foregoing third party's instrument from webpage, obtain dom tree and visual information.In step S42, perform a for-circulation with to the node calculate weight in dom tree, specifically calculate by step S43, step S44, the processing execution shown in step S45.
In step S43, calculate weight according to visual information, in the identifier block identification module of the system of the embodiment of the present invention, concrete consideration two kinds of visual informations:
1) position of text node in webpage, weight uses two-dimensional Gaussian function to calculate, and is represented by following formula 1:
H ( u , v ) = e - D ( u , v ) 2 / 2 &sigma; 2
D ( u , v ) = ( u - u 0 ) 2 + ( v - v 0 ) 2 Formula 1
Wherein, the horizontal ordinate of u representation node and the ordinate of v representation node, constant (u 0, v 0, σ) adjust according to specific tasks.Preferably, when getting u 0=200, v 0best results during=200, σ=200; For above-mentioned visual information, the node being in identical horizontal ordinate or identical ordinate can not be comparatively identifier block, therefore gives the weight that above-mentioned node is lower.
2) font feature, wherein font is larger, and weight is higher.Such as, if text is runic, then higher weights is given.
In step S44, calculate weight, to increase the weight of heading label according to structural information.
In step S45, calculate weight according to content information, in the identifier block identification module 21 of the system of the embodiment of the present invention, concrete consideration three kinds of content informations:
1) with the similarity of " TITLE " label substance as weight, represented by following formula 2:
Sim (e, e title)=| { w k| w k∈ e & w k∈ e title|/(log (| e|)+log (| e title|)) formula 2
Wherein e represents the content of node, e titlerepresent the content of " TITLE " label, w represents a word in text.
2) judge whether to mate regular expression, mate, weight is 1, otherwise is 0, some rules that the ProductName that these regular expressions illustrate electronic product generally has, such as { " ([A-z]+[0-9]+)+[A-z] *", " ([0-9]+[-] 0,1} [A-z]+[-] 0,1})+[0-9] *", " ([A-z]+[-] 0,1} [0-9]+[-] 0,1})+[A-z] *", " ([0-9]+[A-z]+)+[0-9] *", etc.
3) word frequency is as weight, and wherein the word frequency of word is higher, and explanation is that the possibility of identifier block is larger.
Certainly, it will be appreciated by those skilled in the art that, in the example of the process of above-mentioned identifier block identification module 21 execution, have employed and carry out identification marking with this three classes content information of the similarity degree of " TITL " label substance text, regular expression and word frequency and accord with block, but embodiments of the invention are not limited to this.Such as, according to the particular type of the identifier block that the system of the embodiment of the present invention extracts, different information can be adopted to carry out identification marking symbol block.Such as, when identified identifier block is not limited to the title of webpage, in above-mentioned visual information, the weight about one, the position in webpage reduces.In addition, if what extract is identifier block in the webpage of other type beyond the product requirement specification page, only can adopts and the similarity degree of " TITL " label substance text and word frequency information, and not consider whether mate regular expression.It will be understood by those skilled in the art that embodiments of the invention are not intended to carry out any restriction to this.
Finally, in step 46, according to the weight that above-mentioned information is composed, node in the higher dom tree of weight is selected as identifier block.The recognition result assessment of the identifier block of object in the process undertaken by the identifier block identification module of the system of the embodiment of the present invention has been shown in following table 1.
[table 1] identifies the outcome evaluation of the identifier block of object
In addition, it will be understood by those skilled in the art that for convenience of description, in the description of above-mentioned identifier block identification module 21, the method step mode of the process adopting identifier block identification module 21 to perform is illustrated.But each step of the process that above-mentioned identifier block identification module 21 performs also can be realized by corresponding hardware cell.Such as, identifier block identification module 21 can comprise for the treatment of webpage with obtain the Web Page Processing unit of dom tree and visual information, the visual information computing unit for the weight according to visual information computing node, the structural information computing unit for the weight according to structural information computing node, for the content information computing unit of the weight according to content information computing node with for selecting the weighting processing unit of identifier block according to weight.Wherein, each unit performs process as above, here just repeats no more.
According to the embodiment of the present invention from webpage extracting object identifier system in, described identifier fragment abstraction module is used for: the size whether each word judging in described identifier block appears at beginning of the sentence is in the window of 5, if it is score words is 1, otherwise marking is 0; Judge whether each word in described identifier block can find in general dictionary, and if it is score words is 0, otherwise marking is 1; If described webpage is the product requirement specification page, then judge whether each word in described identifier block mates specific regular expression, if it is score words is 1, otherwise marking is 0, and wherein said regular expression represents the universal law that the ProductName of electronic product has; And the word of described identifier block mid-score more than two 0 point is removed; Wherein, in described identifier block many groups be not removed the separated continuous word of word and will form a series of identifier element, to form identifier fragment.
Next, identifier fragment abstraction module 22 is described.From above table 1, the identifier block identified can not be object identifier usually just, can there is disappearance and the redundancy of information.The process that identifier fragment abstraction module 22 carries out will remove irrelevant information exactly from identifier block, object of reservation identifier related information, thus obtains identifier fragment.Shown in Fig. 4 described above, provide an identifier block (sequence of word), with reference to some feature, remove the word that some are irrelevant.Now, be removed word separated continuous print word and form multiple unit (each unit is the sequence of word), i.e. identifier fragment.
Fig. 8 show according to the embodiment of the present invention from webpage extracting object identifier system in identifier fragment abstraction module in the process flow diagram of the example of process that carries out.As shown in Figure 8, in step s 51, perform a for-circulation and judge whether each word in identifier block should retain:
1) in step S52, described word is judged whether in the reference position of sentence according to the positional information of each word.Such as, if word is arranged in the window that beginning of the sentence size is 5, then marking is 1, otherwise is 0.This is that in order to efficient presenting information, its key words is usually located at beginning of the sentence due to identifier block (i.e. web page title) normally phrase or an incomplete sentence.
2) in step S53, judge whether described word can find in dictionary according to the content information of each word.Such as, use the universaling dictionary that field is irrelevant, judge the whether individual special word of word, if can not find in dictionary, then show that this word is special word, and marking is 1, otherwise marking is 0.
3) in step S54, judge whether described word mates regular expression according to the content information of each word, the regular expression of use can with the identifier block identification module 21 of the system of the embodiment of the present invention as in the process of above-mentioned steps S44 the regular expression that uses identical.Further, if coupling, marking is 1, otherwise is 0.
Subsequently, in step S55, judge whether a word has nothing to do with identifier by whether there being at most a mark in the mark of above-mentioned steps S52, S53 and S54 for " 0 ".For " RicohGDDIGITALIIIDigitalCameraReviews & PhotographyTips ", be the mark of above-mentioned steps S52-S55 generation shown in table 2.
In step S56, the word irrelevant with identifier is removed.Such as, in the example shown in table 2, remaining word is " Ricoh ", " GD ", " DIGITAL ", " III ".
In step S57, continuous print word in residue word is combined, such as, " RicohGDDIGITALIII ", thus the identifier fragment of system as the embodiment of the present invention.
The example that table 2 is given a mark according to feature
Ricoh GR DIGITAL III Digital Camera Reviews & Photography Tips
S52 1 1 1 1 1 0 0 0 0 0
S53 1 1 0 0 0 0 0 1 0 0
S54 0 1 1 1 0 0 0 0 0 0
S55 1 1 1 1 0 0 0 0 0 0
Above reference diagram 8 describe according to the embodiment of the present invention from webpage extracting object identifier system in identifier fragment abstraction module in the example of process of carrying out, from identifier block, remove garbage to obtain identifier fragment according to the positional information of each word of identifier block and content information.It will be understood by those skilled in the art that and also can remove garbage according to the positional information of each word in identifier block or content information separately.Such as, can carry out the word word frequency in compute identifiers block simply according to the content information of each word, retain the word that word frequency is in front 4, wherein the calculating of word frequency is the set formed based on the identifier block extracted in all webpages.For " RicohGDDIGITALIIIDigitalCameraReviews & PhotographyTips ", the identifier fragment obtained by calculating word frequency can be " RicohGDDIGITALIIIDigital ".
Certainly, as previously discussed, according to the difference of handled concrete webpage, more specific location information and the content information of each word in identifier block referenced here also may change.Such as, if what extract is identifier block in the webpage of other type beyond the product requirement specification page, then can not judges whether described word mates regular expression, and only retain the high word of word frequency according to the word frequency of each word.Further, also can select the number of the word retained as required, such as, the word that word frequency is in front 5 can be retained, it will be appreciated by those skilled in the art that embodiments of the invention are not intended to carry out any restriction to this.
According to the embodiment of the present invention from webpage extracting object identifier system in, described disappearance unit complementary module is used for: when new level chain is merged into already present level chain, calculates each unit in new level chain and the similarity that there is unit; There is identical unit if existed in chain, judge next unit; There is similar unit if existed in chain, judge the relation of inclusion of two unit, the unit of new level chain is connected to and exists after similar units; If new level chain is uncorrelated with the unit existed in level chain, the unit in new level chain is connected to after the virtual root node that there is chain; And according to the occurrence number of unit, from there is chain the identifier fragment selecting integration.
Next, disappearance unit complementary module 23 is described.Usually, the object identifier relevant information obtained from single webpage is incomplete, such as, may only comprise ProductName information, and not have manufacturer's information and attribute information etc.Therefore, in order to build object database better, need the object identifier relevant information lacked in supplementary schedule webpage from multiple web pages source.By the process that above-described identifier fragment abstraction module 22 carries out, the identifier fragment extracted is made up of orderly unit, can regard a stratified chain as.Therefore, for disappearance unit complementary module 23, it is exactly the problem that the chain that has levels is integrated that the disappearance unit that it carries out supplements process.
Fig. 9 show the embodiment of the present invention from webpage extracting object identifier system in disappearance unit complementary module in the process flow diagram of the example of process that carries out.As shown in Figure 9, suppose there is an already present chain, and already present chain there is a virtual root node, a new level chain be integrated.In step S61, perform the integrated operation that a for-circulation judges each unit in new level chain:
In step S62, calculate the similarity of unit in the unit that existed in chain and new level chain, this can pass through similarity mode, participle, to look up the dictionary etc. to carry out.
In step S63, judge whether there is unit identical in chain according to similarity, if it is jump out this circulation.
In step S64, whether there is unit similar in chain according to certain threshold decision, supposed that unit A comprises unit B, so A has been connected to (child node/follow-up) after B.
In step S65, if be judged as YES in step S64, then the unit in new level chain be connected to and existed in chain after similar units.
And if be judged as NO in step S64, then in step S66, after the unit of new level chain is connected to the root node that there is chain.
Finally, in step S67, according to the occurrence number that there is each unit in chain, obtain the identifier fragment integrated.
It will be understood by those skilled in the art that according to the embodiment of the present invention from webpage extracting object identifier system in, above-mentioned disappearance unit complementary module is optional.That is, when requiring lower for the object identifier building database, that is, time not high for the integrity demands of object identifier relevant information included in object identifier, disappearance unit complementary module can not also be comprised according to the system of the embodiment of the present invention.Or, when according to the embodiment of the present invention from webpage, the system of extracting object identifier is only applied to extracting object identifier from single webpage time, because it does not need to process multiple webpage, naturally do not need to comprise disappearance unit complementary module yet.Like this, when the above-mentioned processing demands for system is not high or when being applied to special scenes, can simplifying the configuration of system, thus reduce costs.
According to the embodiment of the present invention from webpage extracting object identifier system in, described identifier element labeling module is used for: if described webpage is the product requirement specification page, then described identifier fragment is labeled as 4-cellular chain, i.e. { classification, manufacturer, ProductName, attribute }.
In said system, described identifier fragment is by a series of identifier element: unit 0, unit 1 ..., unit n forms, and described identifier element labeling module is used for: suppose that unit 0 is manufacturer's unit, mutual information between computing unit 0 and unit 1: if mutual information is 0, then determine that manufacturer's unit is unit 0, if mutual information is 1, then determine that raw manufacturer unit is for (unit 0, unit 1); Suppose that the word that word frequency is the highest belongs to ProductName unit, mutual information between computing unit 1 and unit k+1: if described mutual information is less than certain threshold value, then determine that ProductName unit is (unit 1, unit 2, ..., unit k+1), wherein 0 < k < 5; And determine template(-let) be (unit k+2 ..., unit n).
Next, identifier element labeling module 24 will be described.Supplementing of the object identifier relevant information of being undertaken by the extraction for identifier fragment of above-mentioned identifier fragment abstraction module 22 and disappearance unit complementary module 23, the identifier fragment after the integration obtained is made up of each multiple identifier element formed by some words.And in order to build database, preferably this identifier fragment is labeled as the identifier element pre-defined, according to embodiments of the invention be, manufacturer's unit, ProductName unit, template(-let).The embodiment of the present invention from webpage extracting object identifier system in, the mark process of the identifier element that identifier element labeling module 24 is carried out mainly utilizes mutual information and word frequency to excavate the association between identifier element.
Figure 10 show the embodiment of the present invention from webpage extracting object identifier system in identifier element labeling module in the process flow diagram of the example of process that carries out.As shown in Figure 10, first in step S71, manufacturer's unit is marked.By observing, first unit (unit 0) in usual identifier fragment all belongs to manufacturer's unit, and manufacturer's unit does not comprise too many word usually.Therefore, carry out identifying unit 1 by the mutual information between computing unit 0 and unit 1 and whether also belong to manufacturer's unit.Mutual information is a kind of measure information, for defining the correlativity between two event sets, is represented by following formula 3:
I ( X ; Y ) = &Sigma; y &Element; Y &Sigma; x &Element; X P ( x , y ) log P ( x , y ) P 1 ( x ) P 2 ( y ) Formula 3
Wherein P (x, y) is the joint probability distribution function about X and Y, P 1(x) and P 2y () is respectively the marginal probability distribution function of X and Y.If X and Y is separate, so known X can not provide any information to Y, and Y can not provide any information to X, so mutual information is 0.Therefore, if mutual information is 0, then manufacturer's unit is unit 0; Otherwise unit 0 and unit 1 form manufacturer's unit jointly, by that analogy.
In step S72, when manufacturer's unit is unit 0, first suppose that the unit that word frequency is the highest belongs to ProductName unit, computing unit 1 is to the mutual information of unit k+1, here, consider that ProductName can not comprise too many word number usually, preferably get 0 < k < 5.If unit 1 is less than a certain threshold value to the mutual information of unit k+1, then ProductName unit be (unit 1 ..., unit k+1); And template(-let) be (unit k+2 ..., unit n).It will be understood by those skilled in the art that when manufacturer's unit not only comprises unit 0, then the unit that ProductName unit and template(-let) comprise also is postponed backward, and its mask method is same as described above, just repeats no more at this.
Like this, by obtaining class location with other method before, then constitute with the object identifier of a 4-cellular chain of { class location, manufacturer's unit, ProductName unit, template(-let) } form.
It should be noted that, in the embodiment of the invention described above, the mark given tacit consent to for identifier element carries out according to the order of manufacturer's unit, ProductName unit and template(-let), and the situation that the identifier element not considering to belong in fact certain unit may lack.Such as, when lacking ProductName unit in identifier fragment, and when there is template(-let), be also first mark out manufacturer's unit according to the method described above, then mark out ProductName unit, just there will be the situation by mistake template(-let) being labeled as ProductName unit here.Therefore, in a further embodiment, preliminary judgement can be carried out for each identifier element in identifier fragment according to the content of identifier element, correctly identifier element is labeled as above-mentioned { class location, manufacturer's unit, ProductName unit, template(-let) } the 4-cellular chain of form.
Certainly, it will be understood by those skilled in the art that the identification process of above identifier element is based on being that the product requirement specification page is described with input object.When input object is other webpage beyond the product requirement specification page, also identifier element can be labeled as the object identifier of other form.Such as, the object identifier that mark goes out can be the form of { class location, object name unit, template(-let) }.It will be understood by those skilled in the art that embodiments of the invention are not intended to carry out any restriction to this.
In said system, described identifier match module uses following formula to mate the object identifier representing identical product object:
Similarity(link i,link j)=∑w kSim(Unit[k] i,Unit[k] j),
Wherein 0≤k≤3, sim (u i, u j)=| { word|word ∈ u iaMP.AMp.Amp word ∈ u j|/(log (| u i|)+log (| u j|)), ∑ wk=1.
Next, identifier match module 25 is described.Identifier match sets up the basis of index and object map in object database, and object is to find out the multi-form object identifier describing identical product object.Such as, the object identifier in the embodiment of the present invention is a 4-cellular chain, and its similarity calculates with following formula 4:
Similarity (link i, link j)=∑ w ksim (Unit [k] i, Unit [k] j) formula 4
Wherein 0≤k≤3, Unit []={ class location, manufacturer's unit, ProductName unit, template(-let) }, sim (u i, u j)=| { word|word ∈ u iaMP.AMp.Amp word ∈ u j|/(log (| u i|)+log (| u j|)), ∑ wk=1, wherein, preferably w0=0.1, w1=0.2, w2=0.6, w3=0.1.Further, identifier match can also excavate the sequence relation between product, such as, under a certain similarity threshold, finds that series of products " RicohGDDIGITALII " and " RicohGDDIGITALIII " have similarity relation.
Here, it will be understood by those skilled in the art that according to the embodiment of the present invention from webpage extracting object identifier system in, above-mentioned identifier match module is also optional.Such as, when the system of extracting object identifier from webpage of the embodiment of the present invention is only applied to extracting object identifier from single webpage, and this single webpage is when only comprising the description for single object, then extracted object identifier is the object identifier of this single object of unique identification.That is, when there is not the multi-form object identifier for identical product object, the system of the embodiment of the present invention also can not comprise above-mentioned identifier match module, thus simplifies system configuration.
Further, according to the concrete form of the object identifier that identifier element labeling module 24 marks out, the matching treatment that identifier match module 25 carries out also can be revised accordingly.Such as, when the object identifier that identifier element labeling module 24 marks out is the form of { class location, object name unit, template(-let) }, the k in described identifier match module 25 then gets 0≤k≤3.It will be appreciated by those skilled in the art that embodiments of the invention are not intended to carry out any restriction to this here.
Simultaneously, it will be appreciated by those skilled in the art that, above according to the embodiment of the present invention from webpage the system of extracting object identifier description in, be only described for the situation comprising single object in webpage, but embodiments of the invention also can be applicable to the situation comprising multiple object.Now, can classify according to the object identifier relevant information of the plurality of object to extracted identifier block, each class identifier block corresponds to an object.In addition, also can classify to the identifier fragment after noise cleaning process according to the object identifier relevant information of the plurality of object, each class identifier fragment corresponds to an object, thus carries out identifier element mark to form the object identifier for each object for each class identifier fragment.It will be understood by those skilled in the art that embodiments of the invention are not intended to carry out any restriction to this.
According to the embodiment of the present invention from webpage extracting object identifier system in, when described system is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, comprise further: identifier block taxon, be connected with described identifier fragment abstraction module with described identifier block recognition unit, identifier block for the multiple objects identified by described identifier block recognition unit is classified for each object, extracts identifier fragment for by described identifier fragment abstraction module from the identifier block corresponding with each object.
According to the embodiment of the present invention from webpage extracting object identifier system in, when described system is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, comprise further: identifier segment classification unit, be connected with described identifier fragment abstraction module and described identifier element labeling module, identifier fragment for the multiple objects extracted by described identifier fragment abstraction module is classified for each object, for the object identifier being marked to be formed each object being suitable for building object database by the described identifier element labeling module pair identifier fragment corresponding with each object.
Here, as mentioned above, although for according to the embodiment of the present invention from webpage each module of the system of extracting object identifier description in, the mode of the method step of the process adopting described module to perform is described.But it will be understood by those skilled in the art that each particular step can be realized by specific hardware cell here, and described hardware cell is connected to each other according to the execution sequence of described method step, embodiments of the invention are not intended to carry out any restriction to this.
According to a further aspect in the invention, provide the method for extracting object identifier from webpage, comprise step: identification marking symbol block from webpage, wherein said webpage comprises the object identifier relevant information of the various information representing described object identifier, and described identifier block is the text comprising described object identifier relevant information; According to the positional information of each word in the described identifier block identified and content information at least one of them, from described identifier block, remove garbage, to obtain identifier fragment; And identifier fragment is labeled as the object identifier being suitable for building object database.
Said method comprises step further: the identifier fragment extracted from multiple webpage is carried out the identifier fragment integrated after forming integration, and the identifier fragment after described integration is labeled as the object identifier being suitable for building object database.
Said method comprises step further: identify the object identifier representing identical product object according to the object identifier marked out.
Said method comprises step further: when described method is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, the identifier block of the multiple objects identified is classified for each object, for the identifier fragment extracting each object from the identifier block corresponding with each object.
Said method comprises step further: when described method is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, the identifier fragment of the described multiple objects extracted is classified for each object, to mark to be formed to the identifier fragment corresponding with each object the object identifier being suitable for each object building object database.
Figure 11 shows the process flow diagram of the illustrative methods of extracting object identifier from webpage according to the embodiment of the present invention.As shown in figure 11, the example of the method for extracting object identifier from webpage of the embodiment of the present invention mainly comprise object identifier extract and object identifier process two steps.Wherein, in step s 11, carry out object identifier and extract so that object identifier relevant information is extracted from webpage, form the identifier fragment as in above-mentioned example.Subsequently, in step s 12, carry out object identifier processing, namely the identifier fragment extracted is carried out lacking the supplementing of unit, identifier element mark and object identifier coupling further, thus be the object identifier being suitable for building database by the object identifier information processing extracted.
As shown in figure 11, step S11 (identifier extraction) may further include following steps:
In the step s 21, identify identifier block, extract from webpage by identifier block.Such as, this step can be performed by the identifier block identification module 21 of the system shown in above-mentioned reference diagram 6.The input of this step is webpage, and output is identifier block.
In step S22, extract identifier fragment, remove by the irrelevant information in identifier block, thus identifier block is divided into multiple identifier element, to form the identifier fragment be made up of multiple identifier element.Such as, this step can be performed by the identifier fragment abstraction module 22 of the system shown in above-mentioned reference diagram 6.The input of this step is identifier block, and output is identifier fragment.
As shown in figure 11, step S12 (identifier processing) may further include following steps:
In step S31, supplement disappearance unit, the incomplete object identifier relevant information be about to from multiple webpage integrates.Such as, this step can be performed by the disappearance unit complementary module 23 of the system shown in above-mentioned reference diagram 6.The input of this step is a series of identifier fragments obtained from single webpage, and output is the identifier fragment after an integration.
In step s 32, carry out identifier element mark, thus identifier fragment is labeled as manufacturer's unit, ProductName unit and template(-let).Such as, this step can be performed by the identifier element labeling module 24 of the system shown in above-mentioned reference diagram 6.The input of this step is identifier fragment, and output is the object identifier of a 4-cellular chain form.
In step S33, matching identification accords with, and namely identifies the object identifier representing same target.Such as, this step can be performed by the identifier match module 25 of the system shown in above-mentioned reference diagram 6.The input of this step is a series of object identifiers being configured to 4-cellular chain form, and output is a mapping indicating which object identifier describes same target.
Figure 12 shows the hardware block diagram of the system of extracting object identifier from webpage realizing the embodiment of the present invention with computing machine.As shown in figure 12, the system of extracting object identifier from webpage of the embodiment of the present invention can PC system realize: input and output are stored in the memory device (13) as hard disk and so on, functional module and intermediate result are all stored in RAM (11), and functional module is performed by central processing unit CPU (10).
According to the embodiment of the present invention from webpage extracting object identifier system and method in, not only adopt the feature of dom tree and visual information, also introduce content information to calculate the weight of each text node of dom tree, thus more accurately can identify the identifier block comprising object identifier relevant information from webpage, improve the precision of the object identifier building object database.
According to the embodiment of the present invention from webpage extracting object identifier system and method in, based on both the positional information of word each in identifier block or content information or its, garbage is removed from identifier block, thus more accurately can remove useless word from identifier block, improve the precision of obtained identifier fragment, and further increase the precision of the object identifier building object database.
According to the embodiment of the present invention from webpage extracting object identifier system and method in, adopt the method based on word frequency and mutual information that the unordered identifier fragment of dispersion is labeled as a 4-cellular chain, wherein mark object in units of unit, each unit is again the set of a word, the object identifier of this structure is convenient to build object database, thus improves the facility of subsequent object Database process.
The sequence of operations illustrated in the description can be performed by the combination of hardware, software or hardware and software.When being performed this sequence of operations by software, computer program wherein can be installed in the storer be built in the computing machine of specialized hardware, make computing machine perform this computer program.Or, computer program can be installed in the multi-purpose computer that can perform various types of process, make computing machine perform this computer program.
Such as, computer program can be prestored in the hard disk or ROM (ROM (read-only memory)) of recording medium.Or, (record) computer program can be stored in removable recording medium, such as floppy disk, CD-ROM (compact disc read-only memory), MO (magneto-optic) dish, DVD (digital versatile disc), disk or semiconductor memory temporarily or for good and all.So removable recording medium can be provided as canned software.
The present invention has been described in detail with reference to specific embodiment.But clearly, when not deviating from spirit of the present invention, those skilled in the art can perform change to embodiment and replace.In other words, the form that the present invention illustrates is open, instead of explains with being limited.Judge main idea of the present invention, appended claim should be considered.

Claims (8)

1. the system of extracting object identifier from webpage, wherein, described webpage comprises the object identifier relevant information of the various information representing described object identifier, and described system comprises:
Identifier block identification module, for the symbol of identification marking from webpage block, described identifier block is the one section of text comprising described object identifier relevant information;
Identifier fragment abstraction module, be connected with described identifier block identification module, for the positional information of each word in the identifier block that identifies according to described identifier block identification module and content information at least one of them, garbage is removed, to obtain identifier fragment from described identifier block; And
Identifier element labeling module, is connected with described identifier fragment abstraction module, and the identifier fragment for being extracted by described identifier fragment abstraction module is labeled as the object identifier being suitable for building object database;
Wherein, described identifier fragment abstraction module is used for:
The size whether each word judging in described identifier block appears at beginning of the sentence is in the window of 5, and if it is score words is 1, otherwise marking is 0;
Judge whether each word in described identifier block can find in general dictionary, and if it is score words is 0, otherwise marking is 1;
If described webpage is the product requirement specification page, then judge whether each word in described identifier block mates regular expression, if it is score words is 1, otherwise marking is 0, and wherein said regular expression represents the universal law that the ProductName of electronic product has; And
The word of described identifier block mid-score more than two 0 point is removed;
Wherein, in described identifier block many groups be not removed the separated continuous word of word and will form a series of identifier element, to form identifier fragment.
2. system according to claim 1, comprises further:
Disappearance unit complementary module, be connected with described identifier fragment abstraction module and described identifier element labeling module, identifier fragment for being extracted from multiple webpage by described identifier fragment abstraction module carries out the identifier fragment integrated after forming integration, and the identifier fragment after described integration is outputted to described identifier element labeling module to be labeled as the object identifier being suitable for building object database by described identifier element labeling module.
3. system according to claim 1, comprises further:
Identifier match module, is connected with described identifier element labeling module, and the object identifier for marking out according to described identifier element labeling module identifies the object identifier representing identical product object.
4. system according to claim 1, wherein, described identifier block identification module comprises:
Web Page Processing unit, for the treatment of webpage to obtain dom tree and visual information;
Visual information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the visual information of each node in DOM;
Structural information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the structural information of each node in DOM;
Content information computing unit, is connected with described Web Page Processing unit, for calculating the weight of described node according to the content information of each node in DOM; And
Weight selection unit, be connected with described visual information computing unit, described structural information computing unit and described content information computing unit, for the weight that basis is calculated each node in dom tree by described visual information computing unit, described structural information computing unit and described content information computing unit, the node selecting weight higher is as described identifier block.
5. system according to claim 1, wherein, described identifier element labeling module is used for:
If described webpage is the product requirement specification page, then described identifier fragment is labeled as 4-cellular chain, that is, { classification, manufacturer, ProductName, attribute }.
6. system according to claim 5, wherein, described identifier fragment is by a series of identifier element: unit 0, unit 1 ..., unit n forms, and described identifier element labeling module is used for:
Suppose that unit 0 is manufacturer's unit, the mutual information between computing unit 0 and unit 1: if mutual information is 0, then determine that manufacturer's unit is unit 0, if mutual information is 1, then determine that raw manufacturer unit is for (unit 0, unit 1);
Suppose that the word that word frequency is the highest belongs to ProductName unit, mutual information between computing unit 1 and unit k+1: if described mutual information is less than certain threshold value, then determine that ProductName unit is (unit 1, unit 2, unit k+1), wherein 0<k<5; And
Determine template(-let) be (unit k+2 ..., unit n).
7. system according to claim 1, wherein, when described system is for the treatment of when comprising the webpage of object identifier relevant information of multiple object, comprises further:
Identifier block taxon, be connected with described identifier fragment abstraction module with described identifier block recognition unit, identifier block for the multiple objects identified by described identifier block recognition unit is classified for each object, extracts identifier fragment for by described identifier fragment abstraction module from the identifier block corresponding with each object.
8. the method for extracting object identifier from webpage, comprises step:
Identification marking symbol block from webpage, wherein said webpage comprises the object identifier relevant information of the various information representing described object identifier, and described identifier block is the one section of text comprising described object identifier relevant information;
According to the positional information of each word in the described identifier block identified and content information at least one of them, from described identifier block, remove garbage, to obtain identifier fragment; And
Described identifier fragment is labeled as the object identifier being suitable for building object database;
Wherein, the step obtaining identifier fragment described in specifically comprises:
The size whether each word judging in described identifier block appears at beginning of the sentence is in the window of 5, and if it is score words is 1, otherwise marking is 0;
Judge whether each word in described identifier block can find in general dictionary, and if it is score words is 0, otherwise marking is 1;
If described webpage is the product requirement specification page, then judge whether each word in described identifier block mates regular expression, if it is score words is 1, otherwise marking is 0, and wherein said regular expression represents the universal law that the ProductName of electronic product has; And
The word of described identifier block mid-score more than two 0 point is removed;
Wherein, in described identifier block many groups be not removed the separated continuous word of word and will form a series of identifier element, to form identifier fragment.
CN201110078361.4A 2011-03-30 2011-03-30 The system and method for extracting object identifier from webpage Active CN102722489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110078361.4A CN102722489B (en) 2011-03-30 2011-03-30 The system and method for extracting object identifier from webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110078361.4A CN102722489B (en) 2011-03-30 2011-03-30 The system and method for extracting object identifier from webpage

Publications (2)

Publication Number Publication Date
CN102722489A CN102722489A (en) 2012-10-10
CN102722489B true CN102722489B (en) 2015-12-02

Family

ID=46948256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110078361.4A Active CN102722489B (en) 2011-03-30 2011-03-30 The system and method for extracting object identifier from webpage

Country Status (1)

Country Link
CN (1) CN102722489B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902571B (en) * 2012-12-27 2017-09-01 腾讯科技(深圳)有限公司 Preserve method, system and the corresponding client and server of webpage complete content
CN105512107A (en) * 2015-12-10 2016-04-20 天津海量信息技术有限公司 Internet regular text page title identification method based on vision

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436186A (en) * 2007-11-12 2009-05-20 北京搜狗科技发展有限公司 Method and system for providing related searches
CN101576910A (en) * 2009-05-31 2009-11-11 北京学之途网络科技有限公司 Method and device for identifying product naming entity automatically
CN101615193A (en) * 2009-07-07 2009-12-30 北京大学 A kind of based on the integrated inquiry system of encyclopaedia data extract
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436186A (en) * 2007-11-12 2009-05-20 北京搜狗科技发展有限公司 Method and system for providing related searches
CN101576910A (en) * 2009-05-31 2009-11-11 北京学之途网络科技有限公司 Method and device for identifying product naming entity automatically
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN101615193A (en) * 2009-07-07 2009-12-30 北京大学 A kind of based on the integrated inquiry system of encyclopaedia data extract

Also Published As

Publication number Publication date
CN102722489A (en) 2012-10-10

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
CN101305370B (en) Information classification paradigm
US20080273802A1 (en) Program and apparatus for forms processing
CN110516074B (en) Website theme classification method and device based on deep learning
CN110795919A (en) Method, device, equipment and medium for extracting table in PDF document
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
CN103299324A (en) Learning tags for video annotation using latent subtags
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN104881458A (en) Labeling method and device for web page topics
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN113656805B (en) Event map automatic construction method and system for multi-source vulnerability information
CN110245349A (en) A kind of syntax dependency parsing method, apparatus and a kind of electronic equipment
CN115687643A (en) Method for training multi-mode information extraction model and information extraction method
CN115018588A (en) Product recommendation method and device, electronic equipment and readable storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN102722489B (en) The system and method for extracting object identifier from webpage
CN107291686B (en) Method and system for identifying emotion identification
CN112667815A (en) Text processing method and device, computer readable storage medium and processor
CN111339258A (en) University computer basic exercise recommendation method based on knowledge graph
CN110750712A (en) Software security requirement recommendation method based on data driving
CN113297482B (en) User portrayal describing method and system of search engine data based on multiple models
CN111400496B (en) Public praise emotion analysis method for user behavior analysis
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
CN112115362A (en) Programming information recommendation method and device based on similar code recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant