CN101866342A - Method and device for generating or displaying webpage label and information sharing system - Google Patents

Method and device for generating or displaying webpage label and information sharing system Download PDF

Info

Publication number
CN101866342A
CN101866342A CN200910133976A CN200910133976A CN101866342A CN 101866342 A CN101866342 A CN 101866342A CN 200910133976 A CN200910133976 A CN 200910133976A CN 200910133976 A CN200910133976 A CN 200910133976A CN 101866342 A CN101866342 A CN 101866342A
Authority
CN
China
Prior art keywords
mark
webpage
web page
page element
marked object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910133976A
Other languages
Chinese (zh)
Other versions
CN101866342B (en
Inventor
郝宇
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN 200910133976 priority Critical patent/CN101866342B/en
Publication of CN101866342A publication Critical patent/CN101866342A/en
Application granted granted Critical
Publication of CN101866342B publication Critical patent/CN101866342B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for generating or displaying a webpage label and an information sharing system based on the webpage label. The method for generating webpage label information comprises the following steps: responding to a user to select a target webpage element on a current webpage loaded on a client Web browser as a labelled object and extracting an XPath path of the labelled object in a document object model (DOM) tree of the current webpage; based on the labelled object and contents of context webpage elements next to the front and back parts of the labelled object in the current webpage, generating a feature code CF of the labelled object; and based on the XPath path and the feature code CF of the labelled object and a label input by the user, generating the webpage label information, wherein the webpage label information is stored in a label database of a remote label server and the feature code CF of the labelled object consists of a content-based feature (CBF) of the labelled object and the CBFs of the context webpage elements of the labelled object.

Description

The method and apparatus and the information sharing system of generation or display web page mark
Technical field
Present invention relates in general to the webpage label technology, and relate in particular to and consider on the webpage and to generate as the content of the target web element of marked object or the technology of display web page mark, and the technology that realizes information sharing based on this webpage label.
Background technology
Mark is a kind of technology of adding information in document.This notion begins to produce in paper media most, comprises keyword is highlighted, adds sidenote etc.Along with the fast development of computing machine and network technology and day by day universal, the current network medium have become people and have understood one of important channel of information.In this case, the webpage label technology has also obtained paying attention to and development, and webpage label day by day becomes one of hot issue in the multiple field that comprises digital library, area of computer aided collaborative work, information sharing and management.
Traditional Web system inwardly holds or the supplier of information provides information promulgating platform very easily, such as the webpage making platform.But the mode of this information interchange is unidirectional basically.What the web page browing person can carry out only only limits to clickthrough alternately or adds bookmark etc.Web 2.0 theories of current popular have been emphasized numerous Web user's participation and information sharing, and flowing of information just becomes two-way or even multidirectional mode like this.Information sharing technology commonly used at present comprises:
-RSS (Really Simply Syndication): wherein the content that will issue is carried out integratedly, select the content that to obtain by the user then by a server.The user can only obtain the content that the RSS source is issued passively in this manner, and such information flow also is asymmetric;
-interactively Web distribution platform (for example, Wiki and Blog): the user can deliver article and the suggestion of oneself, to reach the purpose of information sharing by such platform.But the mode of this information sharing need be carried out in the webpage of particular structured, can not share suggestion anywhere or anytime to all webpages of being seen.
Webpage labeling system is different with above-mentioned two kinds of information sharing modes, in fact it provide a kind of annotation equipment to help the user webpage of being browsed marked, this annotation equipment can be the independent Software tool that comprises browser, can be the independent Software tool that is independent of browser, perhaps also can be the expansion module that is integrated in the browser.(World Wide Web, the standard webpage annotation tool that W3C) provides have used RDF (Resource Description Format, resource description form) and XPointer to be marked the method for webpage as description to Annotea as WWW.As the recommendation plan of W3C, framework and implementation method that Annotea provides a standard for the expression and the storage of webpage label.In the Annotea system, system has used a RDF database server to store all webpage label information, and the user utilizes a specific software client that webpage is marked.On the basis of Annotea, some webpage labeling systems that have their own characteristics each have also appearred, such as Annoty, Crit, e-Marked, YAWAS etc.
On the whole, the basic framework of existing webpage labeling system can be as shown in Figure 1.As shown in Figure 1, the webpage labeling system of prior art comprises that mainly user command processing unit 110, mark query unit 120, webpage obtain unit 130 and webpage label synthesis unit 140.Wherein, user command processing unit 110 receives user's input information (comprising webpage URL, Show Options, user profile etc.), and these information are sent to mark query unit 120 and webpage acquisition unit 130.Mark query unit 120 by via the long-range mark server of the network inquiry such as the internet, obtains the markup information of webpage according to the webpage URL information of user's input.Webpage obtains unit 130 based on the webpage URL information that the user provides, and obtains desired webpage by the internet.Webpage label synthesis unit 140 is synthesized together the webpage of obtaining with relevant markup information, offer the user, makes the user can also see relevant webpage label information when seeing required webpage.
Although existing webpage labeling system can be realized webpage is added mark, also exist all variety of issues as described below:
-can not handle the situation that marked object is wherein transferred to other page.In a lot of websites, the interior element-specific of the page often automatically is listed as along with the rolling of content in other page, and traditional webpage label method can not show such mark;
-when some variation that can tolerate (for example, the font in the marked object becomes italic or adds black etc.) took place the form of marked object in webpage, mark can not correctly be shown;
-under many circumstances, tend to the content of marked object is carried out some modifications, being considered to through the marked object of content modification in traditional webpage labeling system is not the former content that is marked, thereby no longer its mark is shown.
Therefore, still need at present to provide a kind of method and apparatus that can under the situation of the content of considering marked object, generate webpage label or display web page mark, and the system that can between the user, more effectively realize information sharing based on webpage label, to overcome above-mentioned one or more the kind defectives that exist in the prior art.
Summary of the invention
Provided hereinafter about brief overview of the present invention, so that basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is determined key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is to provide some notion with the form of simplifying, with this as the preorder in greater detail of argumentation after a while.
In order to solve the problems referred to above of prior art, an object of the present invention is to provide and a kind ofly can consider the content of marked object on the webpage and generate or the method and apparatus of display web page mark, wherein can will be adjacent to before the marked object on webpage label information and marked object and the webpage and the content association of afterwards context web page element, thereby can dynamically follow the tracks of the variation of marked object.
Another object of the present invention provides a kind of webpage label method and apparatus, utilize this method and apparatus, can be on client browser the explicit user expectation webpage that is written into and shows, and be stored on the long-range mark server, before be labeled in existing mark on this webpage, and on webpage, add and show new mark.
A further object of the present invention provides a kind of information sharing system of utilizing above-mentioned webpage label method and apparatus realization based on the information sharing of webpage label.
To achieve these goals, according to an aspect of the present invention, a kind of method that is used to generate webpage label information is provided, this method comprises: selected the target web element as marked object on the current web page that is written on the client Web browser in response to the user, extracted the XPath path of marked object in DOM Document Object Model (DOM) tree of current web page; Based on being adjacent to before the marked object in marked object and the current web page and the content of afterwards context web page element, generate the condition code CF of marked object; And based on the XPath path of marked object, the mark of condition code CF and user's input, generate webpage label information, wherein, described webpage label information is stored in the mark database of long-range mark server, the condition code CF of marked object is made of the content-based feature (CBF) of marked object and the CBF of context web page element thereof, and the CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
According to another aspect of the present invention, a kind of device that is used to generate webpage label information also is provided, this device comprises: user interface, be used to receive the user to the selection on the current web page that is written on the client Web browser as the target web element of marked object, and the mark of user's input; The XPath maker is used for extracting the XPath path of user-selected marked object in DOM Document Object Model (DOM) tree of current web page; Content-based feature (CBF) maker is used for the content based on web page element, generates the content-based feature (CBF) of web page element; And mark maker, be used for XPath path based on marked object, the mark of the condition code CF of marked object and user's input, generate webpage label information, wherein the condition code CF of marked object is generated by the CBF maker, be adjacent to before the marked object in the CBF of marked object and the current web page and the CBF of context web page element afterwards formation, wherein, described webpage label information is stored in the mark database of long-range mark server, the CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
According to another aspect of the present invention, a kind of method that is used for the mark on display web page and webpage on the client Web browser also is provided, this method comprises: the URL(uniform resource locator) (URL) of a) importing the webpage that will be written into and show in response to the user on browser, URL to input analyzes, to obtain effective URL; B), from long-range mark server, inquire all and effective relevant mark of URL, thereby obtain marking the webpage label information of Candidate Set and these marks according to described effective URL; C) at each mark in the mark Candidate Set, webpage label information according to this mark, determine whether this mark has marked the web page element in the described webpage that will be written into, promptly, determine whether this mark should be present in the webpage that will be written into, and if also further determine the position of its web page element that is marked in the described webpage that will be written into, be labeling position; And d) according to the webpage label information and the labeling position thereof that are confirmed as to be present in the mark in the webpage that will be written into, these marks are combined with the described webpage that will be written into, and the web displaying after will synthesizing via browser is given the user, wherein, the webpage label information of mark comprises the XPath path that marks pairing marked object, the condition code CF of marked object, the content and the form of mark, the URL of mark place webpage, the content characteristic sign indicating number of mark place webpage, the condition code CF of marked object is by the content-based feature (CBF) of marked object and be adjacent to before the marked object and the CBF of afterwards context web page element constitutes, the CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
According to another aspect of the present invention, a kind of device of being convenient to via the mark on client Web browser display web page and the webpage also is provided, described device comprises: the URL analyzer, be used for URL(uniform resource locator) (URL) in response to the webpage that will on browser, be written into and show of user's input, URL to input analyzes, to obtain effective URL; The mark requestor is used for according to described effective URL, inquires all and effective relevant mark of URL from long-range mark server, thereby obtains marking the webpage label information of Candidate Set and these marks; The labeling position determining unit, be used for each mark at the mark Candidate Set, webpage label information according to this mark, determine whether this mark has marked the web page element in the described webpage that will be written into, promptly, determine whether this mark should be present in the webpage that will be written into, and if also further determine the position of its web page element that is marked in the described webpage that will be written into, be labeling position; And synthesis unit, be used for webpage label information and labeling position thereof according to the mark that is confirmed as to be present in the webpage that will be written into, these marks are combined with the described webpage that will be written into, wherein, webpage after synthetic is given the user via browser display, the webpage label information of mark comprises the XPath path that marks pairing marked object, the condition code CF of marked object, the content and the form of mark, the URL of mark place webpage, the content characteristic sign indicating number of mark place webpage, the condition code CF of marked object is by the content-based feature (CBF) of marked object and be adjacent to before the marked object and the CBF of afterwards context web page element constitutes, the CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
In addition, according to a further aspect of the invention, a kind of webpage label method also is provided, this method comprises: in response to the URL of the webpage that will be written on the client Web browser and show of user input, by carrying out the above-mentioned method that is used for the mark on display web page and webpage on the client Web browser, on browser, show described webpage, and be stored on the long-range mark server, before be labeled in existing mark on this webpage; By carrying out the above-mentioned method that is used to generate webpage label information, on described webpage, add new mark, the webpage label information of this new mark is stored on the long-range mark server; And on described webpage, show the new mark that is added via browser.
According to a further aspect of the invention, also provide a kind of webpage label device, this device comprises: the above-mentioned device that is used to generate webpage label information; And the above-mentioned device of being convenient to via the mark on client Web browser display web page and the webpage.
According to a further aspect of the invention, a kind of information sharing system based on webpage label also is provided, it comprises: client and long-range mark server, wherein, described client comprises above-mentioned webpage label device, and described long-range mark server comprises and is used to the markup information memory access storing the mark database of webpage label information and be used for mark database is carried out access control.
According to others of the present invention, corresponding computer readable storage medium and computer program are also provided.
The invention has the advantages that, in above-described the method according to this invention, device and system, when generating webpage label information, considered the XPath path of marked object, and the content of marked object and context web page element thereof, make it possible to realize marking dynamic tracking for marked object, therefore, Xiang Guan markup information can be followed marked object and moved.And even the form of marked object changes, mark also can correctly be shown.Even when the content of marked object itself changes, also can assess content change, whether can show corresponding mark with decision.
By below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.
Description of drawings
The present invention can wherein use same or analogous Reference numeral to represent identical or similar parts in institute's drawings attached by being better understood with reference to hereinafter given in conjunction with the accompanying drawings description.Described accompanying drawing comprises in this manual and forms the part of this instructions together with following detailed description, and is used for further illustrating the preferred embodiments of the present invention and explains principle and advantage of the present invention.In the accompanying drawings:
Fig. 1 shows the synoptic diagram of the general framework of webpage labeling system of the prior art;
Fig. 2 shows the synoptic diagram according to the structure of the system that utilizes webpage label realization information sharing of the embodiment of the invention;
Fig. 3 shows according to embodiments of the invention, the exemplary process diagram of performed processing procedure when utilizing system shown in Figure 2 to add new mark on webpage;
Fig. 4 at length shows the exemplary configurations of the CBF maker shown in Fig. 2 and the synoptic diagram of processing procedure;
Fig. 5 is the block scheme that at length shows the exemplary configurations of the mark analyzer shown in Fig. 2;
Fig. 6 shows the URL (URL(uniform resource locator)) that utilizes system shown in Figure 2 to import will to be written into webpage according to embodiments of the invention, the user so that show described webpage and the process flow diagram of the processing procedure of existing mark wherein in client browser;
Fig. 7 shows in according to one embodiment of present invention URL based on user's input and obtains alternative URL and the current webpage that is written into of pairing webpage of alternative URL and browser is carried out the identical and close page judge process flow diagram with the process (being the concrete processing procedure of the step S610 shown in Fig. 6) that obtains effective URL;
Fig. 8 shows in according to one embodiment of present invention and to determine whether all possible mark is present in the current webpage that is written into and is labeled in the process flow diagram of the process (being the concrete processing procedure of the step S630 among Fig. 6) of labeling position wherein; And
Fig. 9 shows the synoptic diagram in the structure of the dom tree (shown in (b) among Fig. 9) of the current web page of condition code CF of certain mark of using in the processing procedure shown in Figure 8 (shown in (a) among Fig. 9) and correspondence thereof.
It will be appreciated by those skilled in the art that in the accompanying drawing element only for simple and clear for the purpose of and illustrate, and not necessarily draw in proportion.For example, some size of component may have been amplified with respect to other elements in the accompanying drawing, so that help to improve the understanding to the embodiment of the invention.
Embodiment
To be described one exemplary embodiment of the present invention in conjunction with the accompanying drawings hereinafter.For clarity and conciseness, all features of actual embodiment are not described in instructions.Yet, should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, so that realize developer's objectives, for example, meet and system and professional those relevant restrictive conditions, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, might be very complicated and time-consuming though will also be appreciated that development, concerning the those skilled in the art that have benefited from present disclosure, this development only is customary task.
At this, need to prove, for fear of having blured the present invention because of unnecessary details, only show in the accompanying drawings with according to closely-related apparatus structure of the solution of the present invention and/or treatment step, and omitted other details little with relation of the present invention.
Fig. 2 shows the synoptic diagram according to the structure of the system embodiment of the invention, that utilize webpage label realization information sharing.This system can be divided into client and server end (promptly the marking server) two large divisions who links to each other by the network (not shown).
As shown in Figure 2, at client part, webpage label device 200 mainly comprises user interface 210, XPath maker 220, content-based feature (CBF) maker 230, mark maker 240, mark analyzer 250 and XML converter 260, and mainly comprises markup information memory access 270 and mark database 280 in the server end part.
In a specific implementation example of system shown in Figure 2, the webpage label device 200 of client can be realized in the mode of browser plug-in; And the mark server can be realized with Java Servelet, and specifically, the markup information memory access 240 of server end can realize that mark database 250 can be realized with existing data base management system (DBMS) with the mode of Java Servelet.But, it will be appreciated by those skilled in the art that principle of the present invention is not limited only to this, but can adopt other different modes to realize these devices or parts fully as required.
In client, the user can utilize webpage label device 200 to add on the loaded webpage of browser and show new webpage label, and correctly shows the existing webpage label that before has been added on this webpage.In webpage label device 200, user interface 210 is responsible for receiving the inputs of whole device, it can receive in the following input information any one or a plurality of: the input information that (1) is relevant with the configuration parameter of system; (2) the relevant input information of on webpage, selecting with the user of marked object; (3) input information relevant with marked content; (4) with the relevant input information of display mode that marks; Or the like.
XPath maker 220 is used for extracting the XPath path of marked object in the DOM of webpage (DOM Document Object Model) tree.XPath is the expression mode of any one element in the webpage recommended of W3C, each element in the webpage all corresponding an XPath path, and can navigate to any one element in the webpage by the XPath path.Each element that each node in the dom tree of webpage corresponds respectively in the webpage to be comprised.That is to say the marked object in the webpage and be adjacent to before the marked object and web page element afterwards can be represented as node on the dom tree.For convenience of explanation, with adjacent before marked object and web page element afterwards is called above and element hereinafter in the webpage, it corresponds respectively to the tight adjacent brotgher of node of the corresponding node of this marked object in dom tree, therefore also it can be referred to as context node or context web page element.
CBF (content-based feature) maker 230 generates the CBF of marked object according to the content of marked object.The CBF of marked object is made up of the alphabetical projection vector (CPF) and the lexicographic order vector (CSF) of marked object, that is: CBF=CPF+CSF.
Wherein, alphabetical projection vector (CPF) by all letters in the marked object alphabet Λ=a, b, c, d ..., statistics number on the z} is formed, the length of vector is the length of alphabet Λ.For example, suppose that marked object is one section English words explanation on the webpage, then can count in this section explanatory note each alphabetical a, b ..., number Num (a), the Num (b) of z ..., Num (z), thereby can obtain following alphabetical projection vector CPF:[Num (a), Num (b), ..., Num (z)].The variation of CPF can reflect the operations such as deletion, insertion and replacement to the content of marked object to a certain extent.
Lexicographic order vector (CSF) is made up of the backward statistics number of all letters in the expression marked object on alphabet Λ, and the length of vector is alphabetic(al) length.Suppose alphabet Λ exist partial ordering relation: an a<b<c<...<z, then the backward number of all letters among the marked object x on alphabetical a be all greater than alphabetical a (promptly, b, c, ..., z) and closely come the statistics number of alphabetical a letter before, the backward statistics number of all letters among the marked object x on alphabetical b be all greater than alphabetical b (promptly, c, d, ..., z) and closely come the statistics number of the letter of alphabetical b front, by that analogy, thus can obtain all letters among marked object x backward statistics number on whole alphabet.The variation of CSF can reflect to a certain extent that the exchange of marked object changes.For example, for bad and dab, their CPF is identical, but the CSF difference, and this reflects between them and there are differences aspect lexicographic order.
For whether the context that can follow the tracks of marked object effectively variation has taken place, CBF maker 320 also generates the CBF of the context node of marked object except the CBF that generates marked object.The XPath path of the marked object that the context node of marked object can be generated by XPath maker 220 is determined.CBF and context node thereof by marked object (x represents with the dom tree node) (are used node x respectively LeftAnd x RightExpression) CBF constitutes the condition code CF of marked object, that is, and and CF (x)=CBF (x Left)+CBF (x)+CBF (x Right).
The concrete structure of CBF maker 230 and processing procedure thereof, and how to utilize the webpage label device in webpage, to add the process of new mark, will be described below with reference to Fig. 3 and Fig. 4.
The mark maker 240 according to marked object for information about (for example, the condition code of marked object) and the content of mark of input and form etc., generate webpage label information, and XML converter 260 becomes to be adapted to pass through the XML message format that network and server end communicate with the webpage label information translation that generated, so that the webpage label information transmission is stored in the mark database 280 to server end and via markup information memory access 270.Wherein, the URL that webpage label information comprises mark (promptly, the URL of mark place webpage), be labeled on the webpage the position (promptly, the XPath routing information of corresponding marked object), the content of the content characteristic sign indicating number of the webpage at the features relevant of corresponding marked object (for example, condition code CF information etc.), mark place, mark and form etc.At this, the content characteristic sign indicating number of webpage is the condition code that is used for the content of presentation web page, the content characteristic sign indicating number of two webpages is identical, shows that the content of these two webpages is identical, and the content characteristic sign indicating number of webpage can adopt traditional coded system, for example Hash coding (MD5) to obtain.
Mark analyzer 250 is based on the URL of current web page, being stored in the mark database 250, with current web page in same website and the pairing webpage URL identical or close with current web page be defined as effective URL, from mark database, inquire about all and effective relevant mark of URL, and with inquiry obtains all are labeled in the current web page and mate, with judge which mark wherein should mark current be written in the webpage element (promptly, judge that wherein which mark should be present in the current web page), and determine these marks should be displayed on which position in the current web page.Mark analyzer 230 can support the content of marked object wherein to be transferred to the situation of other page from a page.The concrete processing procedure and the structure thereof of relevant mark analyzer 250 will be described with reference to Fig. 5 to Fig. 9 below.
XML converter 260 is used for communicating between client and server end information is carried out the conversion of XML message format, so that the webpage label device 200 of client can communicate with server end.Yet, it will be appreciated by those skilled in the art that, the message of XML form communicates with the server end of realizing with Java Servelet for the ease of client and uses, principle of the present invention is not limited only to convert to the message format conversion of XML form, but can select for use other different messages forms to communicate in client and server end according to the difference of the implementation of as shown in Figure 2 server end part.
As shown in Figure 2, at server end, markup information memory access 270 is in response to the request from client, mark database 280 is carried out access, and stored in the mark database 280 and the relevant webpage label information of collected each of information sharing system mark, it can comprise the content of the URL (that is the URL of mark place webpage) of mark, the condition code that is labeled in position on the webpage, corresponding marked object, mark and form etc. as mentioned above.
Describe below in conjunction with Fig. 3 and Fig. 4.Wherein, Fig. 3 shows according to embodiments of the invention, the exemplary process diagram of performed processing procedure 300 when utilizing system shown in Figure 2 to add new mark on webpage, and Fig. 4 at length shows the exemplary configurations of the CBF maker shown in Fig. 2 and the synoptic diagram of processing procedure.
As shown in Figure 3, at step S310, the marked object of on current web page, selecting according to the user, extract the XPath path of marked object in the dom tree of current web page, then at step S320, content based on marked object and context node (can determine based on the XPath path that is generated among the step S310) thereof generates their CBF as mentioned above, thereby obtains the condition code CF of marked object.Next, at step S330, according to marked content of marked object and input etc. for information about, generate webpage label information, at step S340, to become to be suitable for the message of the XML form that communicates with server end from the webpage label information translation that generated among the step S330, then in step S350, server end via markup information memory access 270 with webpage label information stores that client generated in mark database 280.
At length show CBF maker 230 as shown in Figure 2 among Fig. 4.As shown in Figure 4, CBF maker 230 can comprise HTML (HTML (Hypertext Markup Language)) cleaning (cleaning) unit 410, the alphabetized unit 420 of HTML, alphabetical projection vector (CPF) generation unit 430, lexicographic order vector (CSF) generation unit 440.Be that example describes with the CBF that utilizes CBF maker 230 to generate marked object below.
HTML cleaning unit 410 is used for (for example clearing up principle according to the HTML of storage in advance, can be stored in advance in the HTML dictionary 450 as shown in Figure 4), (for example from user-selected marked object, remove some HTML marks that do not have effect, such as<b〉</b,<u</u etc. the form mark), change influence so that reduce the HTML noise and reduce webpage format to marked object.
The alphabetized unit 420 of HTML is used for carry out HTML through the marked object after the HTML cleaning alphabetized, thereby is converted to an alphabetic string that is made of to the letter of z a based on the content marked object of marked object.For the marked object that wherein comprises the Chinese text explanation, the alphabetized unit 420 of HTML needs with reference to Chinese dictionary 460 (it can not omit) explanation of the Chinese text in the marked object to be converted to the Chinese phonetic alphabet earlier when marked object does not comprise the Chinese text explanation, and then obtains alphabetic string.For the situation of polyphone, HTML can get alphabetized unit first Chinese phonetic alphabet of this polyphone, but obvious principle of the present invention is not limited only to this.
Letter projection vector (CPF) generation unit 430 and the alphabetical projection vector (CPF) that provides more than lexicographic order vector (CSF) generation unit 440 bases and the definition of lexicographic order vector (CSF), based on the alphabetic string that the alphabetized processing through HTML obtains, generate the alphabetical projection vector and the lexicographic order vector of marked object respectively.Then, by alphabetical projection vector (CPF) and lexicographic order vector (CSF) are stitched together, just can obtain the content-based feature CBF of marked object.
Return referring to Fig. 2.When the user imports the URL of a certain webpage so that when browsing markup information on this webpage and the webpage in client browser, the browser of client is written into desired webpage, and the URL of webpage and dom tree structure are sent to mark analyzer 240.
Fig. 5 shows the exemplary configurations according to the mark analyzer 240 of the embodiment of the invention.As shown in Figure 5, mark analyzer 230 comprises URL analyzer 510, mark requestor 520 and webpage label compositor 530.
Wherein, the URL of 510 pairs of user's inputs of URL analyzer analyzes, (via XML converter 260 and markup information memory access 270) takes out all from mark database 280 and the current webpage that will be written into (is the pairing webpage of URL of current input, also may simply be current web page) URL in same website, form an alternative URL collection, the pairing webpage of all URL that alternative URL is concentrated (below be referred to as alternative URL) carries out with current web page that same page is judged and the judgement of the close page, and the alternative URL that pairing webpage is identical or close with current web page is defined as effective URL.
Mark requestor 520 is according to URL analyzer 510 determined effective URL, (via XML converter 260 and markup information memory access 270) inquires about all marks relevant with effective URL (i.e. all marks on effective pairing webpage of URL) in mark database 280, promptly, in mark database 280, inquire might the mark relevant with current web page, thereby obtain marking Candidate Set, and from mark database 280, obtain all these webpage label information that may mark.
All possible being labeled in the current web page of webpage label compositor 530 usefulness mated, to judge wherein which mark has most possibly marked current which element or the object that is written in the webpage, promptly, determine each possible position that whether exists and exist in the current web page that is labeled in, and will mark with webpage and combine so that give the user via browser display.As shown in Figure 5, webpage label compositor 530 may further include labeling position determining unit 532 and synthesis unit 534.
Wherein, labeling position determining unit 532 is at each the possible mark in the described mark Candidate Set, according to the webpage label information of this mark (for example, this marks information such as the XPath path of pairing marked object and condition code CF), whether determine that this may mark has marked the web page element in the current web page (promptly, whether exist) if determining that this may be labeled in the current web page, and determining further to determine the position (that is, labeling position) of its web page element that is marked in current web page under this situation that may mark existence.
Synthesis unit 534 is according to being determined the webpage label information that may mark that should be present in the current web page, and determined these are labeled in labeling position in the current web page, these marks are synthetic with current web page, and the web displaying after will synthesizing via browser is given the user.
Fig. 6 shows according to embodiments of the invention, utilizes above-mentioned information sharing system to import the URL that will be written into webpage in client browser so that show this webpage and the process flow diagram of the processing procedure 600 of wherein existing mark the user.
As shown in Figure 6, in step S610, as mentioned above, URL to user's input analyzes, obtain alternative URL collection, and pairing webpage of all alternative URL and the webpage that will be written into (being current web page) are carried out the identical and close page judge, thereby determine effective URL.There was the concrete processing procedure among the step S610 to be described with reference to Fig. 7 hereinafter.
In step S620, according to determined effective URL, in mark database inquiry might the mark relevant with current web page, thereby obtain marking Candidate Set.Then,, determine that institute which in might marking exists in current web page at step S630, and definite these existence be labeled in labeling position in the current web page.About the concrete processing procedure of step S630 will be illustrated with reference to Fig. 8 and Fig. 9 hereinafter.
Then, in step S640, based on the webpage label information of the mark of determining among the step S630 to exist and the determined labeling position of these marks, will mark with current web page and synthesize, and the webpage after will synthesizing in step S650 is given the user via browser display.At this, can at first change into mark the form of html by on-the-fly modifying the DOM code of current web page, then the html fragment after the conversion is inserted in the web page code, and in browser, shows.
Fig. 7 shows the exemplary process diagram that the URL that imports based on the user obtains alternative URL and the current webpage that is written into of its pairing webpage and browser (being current web page) carried out the process (that is the concrete processing procedure of the step S610 shown in Fig. 6) of identical and close page judgement in according to one embodiment of present invention.
As shown in Figure 7, in step S710, as mentioned above,, obtain and the set of all the alternative URLs of URL in same website of input, be alternative URL collection based on the URL of user input.Then, in step S720, determine whether the pairing webpage of a certain alternative URL is the identical page with current web page.At this,, can determine that then described two webpages are the identical page, otherwise above-mentioned two webpages are exactly inequality if the content characteristic sign indicating number of the pairing webpage of alternative URL is identical with the content characteristic sign indicating number of current web page.Whether the webpage of judging the mark place at this content characteristic sign indicating number by means of webpage is the identical page with current web page, therefore as noted before, can adopt existing coded system, for example MD5 to obtain the content characteristic sign indicating number of webpage.This mainly is the situation that does not but have change at the different still contents of URL of some webpages.
If determine that in step S720 above-mentioned two webpages are inequality, then in step S730, determine whether these two webpages are akin pages.At this, when between these two webpages, meeting the following conditions, can determine that these two webpages are akin, otherwise be exactly not close:
(1) title of webpage is identical, and
(2) situation that exists parameter to transmit between these two webpages, digital parameters disappearance among the URL, other is identical;
The situation that exists parameter to transmit between these two webpages, the digital parameters difference among the URL, and also the digital parameters in the pairing webpage of alternative URL compares forr a short time with the digital parameters in the pairing webpage of current URL, and other is identical; Perhaps
There is not the parameter transmission between these two webpages, last address portion difference of URL, other is identical.
At this obviously as can be seen, principle of the present invention is not limited only to above-mentioned this close page decision condition, and those skilled in the art can set other different close page decision conditions fully as required.
When the result of determination in step S720 or step S730 is sure, handles and proceed to step S740, current alternative URL is defined as effective URL.
If determine that after the judgement in step S720 and step S730 above-mentioned two webpages are both inequality not close yet, then handle proceeding to step S750, determine alternative URL concentrates whether also have the URL that judges without the identical and close page.If, then at step S760, concentrate the next alternative URL of taking-up from alternative URL, handle turning back to step S720 then, so that carrying out the identical and close page with current web page, the pairing webpage of next alternative URL that will take out judges.The processing of repeating step S720~step S760 till determining that in step S750 all concentrated alternative URL of alternative URL have passed through identical and close page judgement, thereby is determined all effective URL that alternative URL concentrates.
Fig. 8 be at length show the step S630 among Fig. 6 processing procedure (promptly, determine whether all possible mark is present in the labeling position in the current web page and in current web page) process flow diagram, and the synoptic diagram of the structure of the dom tree (shown in (b) among Fig. 9) of condition code CF that Fig. 9 shows at certain mark of using in the processing procedure shown in Figure 8 (shown in (a) among Fig. 9) and corresponding current web page thereof.
As shown in Figure 8, in step S810, based on the current webpage label information that may mark to be determined, the condition code CF of for example corresponding marked object and XPath path etc. with this mark, based in the dom tree of current web page according to the determined node in XPath path, successively the node in the dom tree of current web page is detected up and down respectively, to determine marking pairing marked object and the identical or immediate node of context node thereof (at this with this in the dom tree, similar content that is meant node and contextual difference are in the scope that can allow), as dom tree node corresponding in the current web page with this mark.
For example, with the condition code CF that may mark a certain to be determined shown in Fig. 9 (a) is example, wherein A, B and C represent that respectively this marks pairing marked object and above node and hereinafter node, with the node of determining based on the XPath path of A serves as that the basis is detected the node in the dom tree successively, determine A, B and C at immediate node in the current dom tree A ' shown in (b) among Fig. 9, B ' and C ' respectively, can be referred to as the pairing dom tree node of described mark to be determined at this.
Then, in step S820, may mark corresponding dom tree node, calculate the distance D (A, A ') of this mark and dom tree in the following manner based on determined with to be determined:
D(A,A’)=d(A,A’)+α(d(B,B’)+d(C,C’))+βd s
Wherein,
d(A,A’)=|CFB(A)-CFB(A’)|,
d(B,B’)=|CFB(B)-CFB(B’)|,
d(B,B’)=|CFB(C)-CFB(C’)|,
d sBe the tree construction distance, α, β are constant, and α represents the influence degree of the contextual difference of marked object to the difference of marked object, and β represents the influence degree of the difference of dom tree structure to the similarity difference of mark, d sThe difference of representing the CF structure (that is former context node structure) of context node structure in the current dom tree and mark.
Suppose in dom tree, can find the bottom common node P of node A ', B ', C ', and l A ', l B ', l C 'Represent respectively from node A ', B ', C ' to node P the number of node of process, d then sCan calculate as follows:
d s=l A’+l B’+l C’
Under the situation as shown in Fig. 9 (b), d s=1.
Return referring to Fig. 8.In step S830, whether the distance D of judging the described mark to be determined that is calculated in step S820 is less than a certain predetermined threshold.If can determine in step S840 that then this mark should be present on the current web page, and determine its location on current web page.For example, if the D that is calculated is (A, A ') less than predetermined threshold, determine that then described mark to be determined has still marked element or the object in the current web page, therefore should be presented on the current web page, and node A ' residing position in dom tree has just determined this mark should be presented at the position on the current web page.
Be not less than predetermined threshold if in step S830, determine the distance D of described mark to be determined, then in step S840, abandon this mark, determine that promptly this mark should not be displayed on the current web page.
From above to marked object content-based feature CBF and the definition of condition code CF as can be seen, CBF has uniqueness (especially all the more so when marked object is the web page contents of representing with English text) in the ordinary course of things for marked object, and have unified length, be convenient to data transmission and storage; The variation of CBF can truly reflect the content change of marked object; And the distance between the CF of marked object is the tolerance of object variation.
In aforesaid information sharing system according to the embodiment of the invention, when using the XPath path that marked object is identified, also utilized the condition code CF information of marked object, therefore can realize marking in the dynamic web page dynamic tracking, and this is impossible realize in traditional info web labeling system for marked object.This be because, the form of general employing hash function in traditional info web labeling system (such as the MD5 coding) is constructed the feature of marked object, though this feature is unique in the ordinary course of things, and length is unified, be convenient to data transmission and storage, but this feature can not reflect the content change degree that is marked.This Hash coding makes the small variation of marked object cause the great variety of feature, thereby can not measure the degree that marked object changes by the distance between the feature.
In above described in conjunction with the accompanying drawings information sharing method and system based on webpage label according to the embodiment of the invention, the condition code that can generate marked object based on the content and the context thereof of marked object, like this with the institute might be labeled in current be written into mate in the webpage in, can measure the variation of mark, thereby make and to determine whether mark is shown according to the degree that changes, thereby realized dynamic tracking.And, in the process of mark coupling, adopted dom tree searching method based on the lightweight of the feature of context, be used for weighing the content change and the change in context thereof of marked object.
Be not difficult to find out by above description, in the method and system according to the embodiment of the invention described above, used Dynamic Tracing Technology, even make the marked object in the webpage that certain variation take place, also the mark of correspondence correctly can be presented at the position after the variation on the webpage, and for the content that disappears from webpage, then its corresponding mark will can not be revealed.And the marked object in webpage is to shift from other webpages and under the situation of coming, for this class marked object, also can be on webpage correct position display go out the mark of its correspondence.In addition, may have been undertaken by different URL at current web page under the situation of mark, these marks also can all correctly be shown.In addition, when the form of marked object changes, also correct the showing simultaneously of its mark, black such as adding, italic etc., quoted passage etc.The change of form is very common page refreshment or forum's content reprinting.Therefore, can realize the purpose of the information of sharing between the user with webpage label as means.
In addition, obviously, also can realize in the mode that is stored in the computer executable program in the various machine-readable storage mediums according to each operating process of said method of the present invention.
And, purpose of the present invention also can realize by following manner: the storage medium that will store above-mentioned executable program code offers system or equipment directly or indirectly, and the said procedure code is read and carried out to the computing machine in this system or equipment or CPU (central processing unit) (CPU).
At this moment, as long as this system or equipment have the function of executive routine, then embodiments of the present invention are not limited to program, and this program also can be a form arbitrarily, for example, the program carried out of target program, interpreter or the shell script that offers operating system etc.
Above-mentioned these machinable mediums include but not limited to: various storeies and storage unit, semiconductor equipment, disc unit be light, magnetic and magneto-optic disk for example, and other is suitable for the medium of canned data etc.
In addition, computing machine is by being connected to the corresponding website on the internet, and will download and be installed to according to computer program code of the present invention and carry out this program in the computing machine then, also can realize the present invention.
The step that also it is pointed out that the above-mentioned series of processes of execution can order following the instructions naturally be carried out in chronological order, but does not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.
At last, also need to prove, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as this process, method, article or equipment intrinsic key element.Do not having under the situation of more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
Though more than be described with reference to the accompanying drawings embodiments of the invention, should be understood that embodiment described above just is used to illustrate the present invention, and be not construed as limiting the invention.Under situation about not deviating from, can carry out various changes, substitute and modification by the appended the spirit and scope of the present invention that claim limited.And the application's scope is not limited only to the specific embodiment of structure, means, method and the step of the described process of instructions, equipment, manufacturing, material.Those of ordinary skills will be readily appreciated that according to disclosure of the present invention, can use structure, means, method or the step carried out with process essentially identical function of corresponding embodiment described herein or acquisition result essentially identical with it, that have now and will be developed in the future, equipment, manufacturing, material according to the present invention.Therefore, appended claim is intended to comprise in their scope structure, means, method or the step of such process, equipment, manufacturing, material.

Claims (25)

1. a method that is used to generate webpage label information comprises the steps:
On the current web page that is written on the client Web browser, select the target web element as marked object in response to the user, extracted the XPath path of marked object in DOM Document Object Model (DOM) tree of current web page;
Based on being adjacent to before the marked object in marked object and the current web page and the content of afterwards context web page element, generate the condition code CF of marked object; And
Mark based on the XPath path of marked object, condition code CF and user's input generates webpage label information,
Wherein, described webpage label information is stored in the mark database of long-range mark server,
The condition code CF of marked object is made of the content-based feature (CBF) of marked object and the CBF of context web page element thereof, and
The CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., the statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
2. method according to claim 1, wherein, the step of the condition code CF of described generation marked object further comprises:
Generate the CBF of marked object and context web page element thereof in the following manner:
By HTML cleaning principle, from web page element, remove insignificant HTML mark with reference to storage in advance;
Alphabetized to carry out HTML through the web page element after the HTML cleaning, thus web page element is converted to the alphabetic string that constitutes to the letter of z by a based on the content of web page element;
Add up in the described alphabetic string all letters alphabet Λ=a, b, c, d ..., number on the z} and backward number are so that generate the alphabetical projection vector and the lexicographic order vector of web page element;
The alphabetical projection vector and the lexicographic order vector of web page element are stitched together, thereby obtain the CBF of web page element, and
Obtain condition code CF:CBF (above web page element)+CBF (the marked object)+CBF (hereinafter web page element) of marked object as follows.
3. method according to claim 2, wherein, comprise at marked object and context web page element thereof under the situation of Chinese text explanation, to carry out through the web page element after the HTML cleaning HTML alphabetized before, with reference to Chinese dictionary the Chinese text explanation is converted to the Chinese phonetic alphabet.
4. according to any described method in the claim 1 to 3, wherein, described webpage label information also comprises the URL of mark place webpage, the content characteristic sign indicating number of mark place webpage except the content and form of the XPath path, condition code CF and the mark that comprise marked object.
5. according to any described method in the claim 1 to 4, wherein, described long-range mark server realizes with the form of Java Servelet, and
Described method further comprises step: the webpage label information translation that generated is become to be suitable for the XML form that communicates with long-range mark server, so that it is transferred to long-range mark server.
6. device that is used to generate webpage label information comprises:
User interface is used to receive the user to the selection as the target web element of marked object on the current web page that is written on the client Web browser, and the mark of user's input;
The XPath maker is used for extracting the XPath path of user-selected marked object in DOM Document Object Model (DOM) tree of current web page;
Content-based feature (CBF) maker is used for the content based on web page element, generates the content-based feature (CBF) of web page element; And
The mark maker, be used for the XPath path based on marked object, the condition code CF of marked object and the mark of user's input, generate webpage label information, wherein the condition code CF of marked object is by being adjacent to before the marked object in CBF maker CBF that generated, marked object and the current web page and the CBF of afterwards context web page element constitutes
Wherein, described webpage label information is stored in the mark database of long-range mark server,
The CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., the statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
7. device according to claim 6, wherein, described CBF maker further comprises:
HTML clears up the unit, is used for removing insignificant HTML mark by the HTML cleaning principle with reference to storage in advance from web page element;
The alphabetized unit of HTML, be used for carry out HTML through the web page element after the HTML cleaning alphabetized, thereby web page element is converted to the alphabetic string that constitutes to the letter of z by a based on the content of web page element;
Letter projection vector generation unit, all letters that are used for adding up described alphabetic string alphabet Λ=a, b, c, d ..., the number on the z} is to generate the alphabetical projection vector of web page element;
Lexicographic order vector generation unit, all letters that are used for adding up described alphabetic string alphabet Λ=a, b, c, d ..., the backward number on the z} is to generate the lexicographic order vector of web page element; And
Thereby be used for the be stitched together unit of the CBF that obtains web page element of the alphabetical projection vector of web page element and lexicographic order vector, and
Wherein, obtain condition code CF:CBF (above web page element)+CBF (the marked object)+CBF (hereinafter web page element) of marked object as follows.
8. device according to claim 7, wherein, comprise under the situation of Chinese text explanation at marked object and context web page element thereof, the alphabetized elements reference Chinese dictionary of described HTML will be converted to the Chinese phonetic alphabet through the Chinese text explanation of the web page element after the HTML cleaning, and it is alphabetized then it to be carried out HTML.
9. according to any described device in the claim 6 to 8, wherein, described webpage label information also comprises the URL of mark place webpage, the content characteristic sign indicating number of mark place webpage except the content and form of the XPath path, condition code CF and the mark that comprise marked object.
10. according to any described device in the claim 6 to 9, wherein, described device realizes with the form of browser plug-in, and described long-range mark server realizes with the form of Java Servelet,
Described device further comprises the XML converter, be used for will be generated the webpage label information translation become to be suitable for the XML form that communicates with long-range mark server.
11. a method that is used for the mark on display web page and webpage on the client Web browser may further comprise the steps:
A) URL(uniform resource locator) (URL) of the webpage that will be written on browser and show in response to user input is analyzed the URL of input, to obtain effective URL;
B), from long-range mark server, inquire all and effective relevant mark of URL, thereby obtain marking the webpage label information of Candidate Set and these marks according to described effective URL;
C) at each mark in the mark Candidate Set, webpage label information according to this mark, determine whether this mark has marked the web page element in the described webpage that will be written into, promptly, determine whether this mark should be present in the webpage that will be written into, and if also further determine the position of its web page element that is marked in the described webpage that will be written into, be labeling position; And
D) according to being confirmed as being present in the webpage label information and the labeling position thereof of the mark in the webpage that will be written into, these marks are combined with the described webpage that will be written into, and the web displaying after will synthesizing via browser gives the user,
Wherein, the webpage label information of mark comprises the content of condition code CF, mark of the XPath path that marks pairing marked object, marked object and form, the URL of mark place webpage, the content characteristic sign indicating number of mark place webpage,
The condition code CF of marked object is by the content-based feature (CBF) of marked object and be adjacent to before the marked object and the CBF of afterwards context web page element constitutes,
The CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet Λ={ a, b, c, d, ..., the statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
12. method according to claim 11, wherein, described step a) further comprises:
URL based on described input, from long-range mark server, take out all and the URL of webpage in same website that will be written into as alternative URL, pairing webpage of alternative URL and the webpage that will be written into are carried out the identical and close page judge, and the alternative URL that pairing webpage is identical or close with the webpage that will be written into is defined as effective URL.
13. according to claim 11 or 12 described methods, wherein, described step c) further comprises:
At mark each mark in the Candidate Set:
Mark the condition code CF and the XPath path of pairing marked object based on this, based in the DOM Document Object Model that will be written into webpage (DOM) tree according to the determined node in XPath path, successively the node in the dom tree of webpage is detected up and down respectively, with determine in the dom tree, mark pairing marked object and the identical or immediate node of context web page element thereof with this, be labeled in corresponding dom tree node in the dom tree as this;
Based on condition code and its pairing dom tree node of described mark, calculate the distance D of this mark and dom tree;
Determine that whether the calculated distance D of institute is less than predetermined threshold; And
In the described distance D of described mark during less than predetermined threshold, determine that this mark should be present in the webpage that will be written into, and mark the identical or immediate dom tree node of pairing marked object with this based on determined, determine that this is labeled in the labeling position that will be written in the webpage.
14. method according to claim 13 wherein, is calculated the distance D of mark and dom tree in the following manner:
Suppose that pairing marked object of mark and context web page element thereof are A, B and C, identical with them or immediate tree node is respectively A ', B ' and C ' in the dom tree, then:
D(A,A’)=d(A,A’)+α(d(B,B’)+d(C,C’))+βd s
Wherein,
d(A,A’)=|CFB(A)-CFB(A’)|,
d(B,B’)=|CFB(B)-CFB(B’)|,
d(B,B’)=|CFB(C)-CFB(C’)|,
α, β are constant, and α represents the influence degree of the contextual difference of marked object to the difference of marked object, and β represents the influence degree of the difference of dom tree structure to the similarity difference of mark, d sThe difference of the structure of expression dom tree and the condition code CF of mark.
15. any described method according in the claim 11 to 14 wherein, generates the CBF of web page element in the following manner:
By HTML cleaning principle, from web page element, remove insignificant HTML mark with reference to storage in advance;
Alphabetized to carry out HTML through the web page element after the HTML cleaning, thus web page element is converted to the alphabetic string that constitutes to the letter of z by a based on the content of web page element;
Add up in the described alphabetic string all letters alphabet Λ=a, b, c, d ..., number on the z} and backward number are so that generate the alphabetical projection vector and the lexicographic order vector of web page element;
The alphabetical projection vector and the lexicographic order vector of web page element are stitched together, thereby obtain the CBF of web page element.
16. according to any described method in the claim 11 to 15, wherein, described long-range mark server realizes with the form of Java Servelet, and
Described method also comprises step: for the information of transmitting between client and long-range mark server, converted thereof into the XML form before sending or receiving.
17. a device of being convenient to via the mark on client Web browser display web page and the webpage comprises:
The URL analyzer is used for the URL(uniform resource locator) (URL) in response to the webpage that will be written into and show of user's input on browser, the URL that imports is analyzed, to obtain effective URL;
The mark requestor is used for according to described effective URL, inquires all and effective relevant mark of URL from long-range mark server, thereby obtains marking the webpage label information of Candidate Set and these marks;
The labeling position determining unit, be used for each mark at the mark Candidate Set, webpage label information according to this mark, determine whether this mark has marked the web page element in the described webpage that will be written into, promptly, determine whether this mark should be present in the webpage that will be written into, and if also further determine the position of its web page element that is marked in the described webpage that will be written into, be labeling position; And
Synthesis unit is used for webpage label information and labeling position thereof according to the mark that is confirmed as being present in the webpage that will be written into, these marks combined with the described webpage that will be written into,
Wherein, the webpage after synthesizing is given the user via browser display,
The webpage label information of mark comprise the content of condition code CF, mark of the XPath path that mark pairing marked object, marked object and form, mark place webpage URL, mark the content characteristic sign indicating number of place webpage,
The condition code CF of marked object is by the content-based feature (CBF) of marked object and be adjacent to before the marked object and the CBF of afterwards context web page element constitutes, and
The CBF of web page element is made up of the alphabetical projection vector and the lexicographic order vector of this web page element, wherein said alphabetical projection vector by all letters in this web page element at alphabet A={a, b, c, d, ..., the statistics number on the z} is formed, and described lexicographic order vector is made up of the backward statistics number of all letters in this web page element on alphabet Λ.
18. device according to claim 17, wherein, described URL analyzer is based on the URL of described input, from long-range mark server, take out all and the URL of webpage in same website that will be written into as alternative URL, pairing webpage of alternative URL and the webpage that will be written into are carried out the identical and close page judge, and the alternative URL that pairing webpage is identical or close with the webpage that will be written into is defined as effective URL.
19. according to claim 17 or 18 described devices, wherein, described labeling position determining unit is carried out following the processing at each mark in the mark Candidate Set:
Mark the condition code CF and the XPath path of pairing marked object based on this, based in the DOM Document Object Model that will be written into webpage (DOM) tree according to the determined node in XPath path, successively the node in the dom tree of webpage is detected up and down respectively, with determine in the dom tree, mark pairing marked object and the identical or immediate node of context web page element thereof with this, be labeled in corresponding dom tree node in the dom tree as this;
Based on condition code and its pairing dom tree node of described mark, calculate the distance D of this mark and dom tree;
Determine that whether the calculated distance D of institute is less than predetermined threshold; And
In the described distance D of described mark during less than predetermined threshold, determine that this mark should be present in the webpage that will be written into, and mark the identical or immediate dom tree node of pairing marked object with this based on determined, determine that this is labeled in the labeling position that will be written in the webpage.
20. device according to claim 19, wherein, described labeling position determining unit is calculated the distance D of mark and dom tree in the following manner:
Suppose that pairing marked object of mark and context web page element thereof are A, B and C, identical with them or immediate tree node is respectively A ', B ' and C ' in the dom tree, then:
D(A,A’)=d(A,A’)+α(d(B,B’)+d(C,C’))+βd s
Wherein,
d(A,A’)=|CFB(A)-CFB(A’)|,
d(B,B’)=|CFB(B)-CFB(B’)|,
d(B,B’)=|CFB(C)-CFB(C’)|,
α, β are constant, and α represents the influence degree of the contextual difference of marked object to the difference of marked object, and β represents the influence degree of the difference of dom tree structure to the similarity difference of mark, d sThe difference of the structure of expression dom tree and the condition code CF of mark.
21. any described device according in the claim 17 to 20 also comprises content-based feature (CBF) maker, is used to generate the content-based feature (CBF) of web page element,
Described CBF maker further comprises:
HTML clears up the unit, is used for removing insignificant HTML mark by the HTML cleaning principle with reference to storage in advance from web page element;
The alphabetized unit of HTML, be used for carry out HTML through the web page element after the HTML cleaning alphabetized, thereby web page element is converted to the alphabetic string that constitutes to the letter of z by a based on the content of web page element;
Letter projection vector generation unit, all letters that are used for adding up described alphabetic string alphabet Λ=a, b, c, d ..., the number on the z} is to generate the alphabetical projection vector of web page element;
Lexicographic order vector generation unit, all letters that are used for adding up described alphabetic string alphabet Λ=a, b, c, d ..., the backward number on the z} is to generate the lexicographic order vector of web page element; And
Thereby be used for the be stitched together unit of the CBF that obtains web page element of the alphabetical projection vector of web page element and lexicographic order vector.
22. according to any described device in the claim 17 to 21, wherein, described device realizes with the form of browser plug-in, described long-range mark server realizes with the form of Java Servelet,
Described device further comprises the XML converter, and the information translation that was used for before sending or receiving transmitting between client and long-range mark server becomes the XML form.
23. a webpage label method comprises:
URL in response to the webpage that will on the client Web browser, be written into and show of user input, by carrying out according to any described method in the claim 11 to 16, on browser, show described webpage, and be stored on the long-range mark server, before be labeled in existing mark on this webpage;
By carrying out according to any described method in the claim 1 to 5, on described webpage, add new mark, the webpage label information of this new mark is stored on the long-range mark server; And
On described webpage, show the new mark that is added via browser.
24. a webpage label device comprises:
According to any described device that is used to generate webpage label information in the claim 6 to 10; And
According to any described device of being convenient to via the mark on client Web browser display web page and the webpage in the claim 17 to 22.
25. the information sharing system based on webpage label comprises client and long-range mark server, wherein,
Described client comprises webpage label device according to claim 24, and
Described long-range mark server comprises and is used to the markup information memory access storing the mark database of webpage label information and be used for mark database is carried out access control.
CN 200910133976 2009-04-16 2009-04-16 Method and device for generating or displaying webpage label and information sharing system Expired - Fee Related CN101866342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910133976 CN101866342B (en) 2009-04-16 2009-04-16 Method and device for generating or displaying webpage label and information sharing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910133976 CN101866342B (en) 2009-04-16 2009-04-16 Method and device for generating or displaying webpage label and information sharing system

Publications (2)

Publication Number Publication Date
CN101866342A true CN101866342A (en) 2010-10-20
CN101866342B CN101866342B (en) 2013-09-11

Family

ID=42958073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910133976 Expired - Fee Related CN101866342B (en) 2009-04-16 2009-04-16 Method and device for generating or displaying webpage label and information sharing system

Country Status (1)

Country Link
CN (1) CN101866342B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637172A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN103942224A (en) * 2013-01-23 2014-07-23 百度在线网络技术(北京)有限公司 Method and device for acquiring annotation rule of webpage blocks
CN104035916A (en) * 2013-03-07 2014-09-10 富士通株式会社 Method and device for standardizing annotation tool
CN104794174A (en) * 2015-04-01 2015-07-22 百度在线网络技术(北京)有限公司 Webpage marking information display method and device
CN104811351A (en) * 2015-04-21 2015-07-29 中国电子科技集团公司第四十一研究所 Distributed communication network testing method and system based on XML
CN105095432A (en) * 2015-07-22 2015-11-25 腾讯科技(北京)有限公司 Display method and device for webpage annotations
CN105117498A (en) * 2015-09-28 2015-12-02 北京奇虎科技有限公司 Webpage data processing method and device
CN105824925A (en) * 2016-03-17 2016-08-03 四川长虹电器股份有限公司 Dynamic annotation method based on browser webpage elements
CN105930383A (en) * 2016-04-14 2016-09-07 青岛海信移动通信技术股份有限公司 Method and device for implementing electronic bookmarks
US9449073B2 (en) 2013-01-31 2016-09-20 International Business Machines Corporation Measuring and displaying facets in context-based conformed dimensional data gravity wells
US9477844B2 (en) 2012-11-19 2016-10-25 International Business Machines Corporation Context-based security screening for accessing data
CN106250394A (en) * 2016-07-15 2016-12-21 北京邮电大学 Network resource content sees clearly system and method
US9607048B2 (en) 2013-01-31 2017-03-28 International Business Machines Corporation Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic context-based objects
US9619580B2 (en) 2012-09-11 2017-04-11 International Business Machines Corporation Generation of synthetic context objects
CN106610994A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Method and device for counting click paths
US9741138B2 (en) 2012-10-10 2017-08-22 International Business Machines Corporation Node cluster relationships in a graph database
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
WO2018053620A1 (en) * 2016-09-23 2018-03-29 Hvr Technologies Inc. Digital communications platform for webpage overlay
CN108874373A (en) * 2017-05-12 2018-11-23 腾讯科技(深圳)有限公司 Method and device, display terminal and the storage medium of information are inserted into webpage
US10152526B2 (en) 2013-04-11 2018-12-11 International Business Machines Corporation Generation of synthetic context objects using bounded context objects
CN110619100A (en) * 2019-06-18 2019-12-27 北京无限光场科技有限公司 Method and apparatus for acquiring data
US10521434B2 (en) 2013-05-17 2019-12-31 International Business Machines Corporation Population of context-based data gravity wells
CN113688597A (en) * 2020-05-18 2021-11-23 北京字节跳动网络技术有限公司 Display method, device, equipment and storage medium of labeled file

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7958115B2 (en) * 2004-07-29 2011-06-07 Yahoo! Inc. Search systems and methods using in-line contextual queries
CN101251855B (en) * 2008-03-27 2010-12-22 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637172B (en) * 2011-02-10 2013-11-27 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN102637172A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Webpage blocking marking method and system
US9619580B2 (en) 2012-09-11 2017-04-11 International Business Machines Corporation Generation of synthetic context objects
CN103678455B (en) * 2012-09-11 2017-04-12 国际商业机器公司 Method and system for generation of synthetic context objects
US9741138B2 (en) 2012-10-10 2017-08-22 International Business Machines Corporation Node cluster relationships in a graph database
US9477844B2 (en) 2012-11-19 2016-10-25 International Business Machines Corporation Context-based security screening for accessing data
US9811683B2 (en) 2012-11-19 2017-11-07 International Business Machines Corporation Context-based security screening for accessing data
CN103942224A (en) * 2013-01-23 2014-07-23 百度在线网络技术(北京)有限公司 Method and device for acquiring annotation rule of webpage blocks
CN103942224B (en) * 2013-01-23 2018-12-14 百度在线网络技术(北京)有限公司 A kind of method and device for the mark rule obtaining web page release
US9607048B2 (en) 2013-01-31 2017-03-28 International Business Machines Corporation Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic context-based objects
US9619468B2 (en) 2013-01-31 2017-04-11 International Business Machines Coporation Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic context-based objects
US10127303B2 (en) 2013-01-31 2018-11-13 International Business Machines Corporation Measuring and displaying facets in context-based conformed dimensional data gravity wells
US9449073B2 (en) 2013-01-31 2016-09-20 International Business Machines Corporation Measuring and displaying facets in context-based conformed dimensional data gravity wells
CN104035916B (en) * 2013-03-07 2017-05-24 富士通株式会社 Method and device for standardizing annotation tool
CN104035916A (en) * 2013-03-07 2014-09-10 富士通株式会社 Method and device for standardizing annotation tool
US10152526B2 (en) 2013-04-11 2018-12-11 International Business Machines Corporation Generation of synthetic context objects using bounded context objects
US11151154B2 (en) 2013-04-11 2021-10-19 International Business Machines Corporation Generation of synthetic context objects using bounded context objects
US10521434B2 (en) 2013-05-17 2019-12-31 International Business Machines Corporation Population of context-based data gravity wells
CN104794174A (en) * 2015-04-01 2015-07-22 百度在线网络技术(北京)有限公司 Webpage marking information display method and device
WO2016155299A1 (en) * 2015-04-01 2016-10-06 百度在线网络技术(北京)有限公司 Method and device for displaying webpage marking information
CN104811351A (en) * 2015-04-21 2015-07-29 中国电子科技集团公司第四十一研究所 Distributed communication network testing method and system based on XML
CN105095432A (en) * 2015-07-22 2015-11-25 腾讯科技(北京)有限公司 Display method and device for webpage annotations
US11200295B2 (en) 2015-07-22 2021-12-14 Tencent Technology (Shenzhen) Company Limited Web page annotation displaying method and apparatus, and mobile terminal
CN105095432B (en) * 2015-07-22 2019-04-16 腾讯科技(北京)有限公司 Web page annotation display methods and device
CN105117498A (en) * 2015-09-28 2015-12-02 北京奇虎科技有限公司 Webpage data processing method and device
CN106610994A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Method and device for counting click paths
CN105824925B (en) * 2016-03-17 2019-09-10 四川长虹电器股份有限公司 Dynamic label placement method based on browsing device net page element
CN105824925A (en) * 2016-03-17 2016-08-03 四川长虹电器股份有限公司 Dynamic annotation method based on browser webpage elements
CN105930383A (en) * 2016-04-14 2016-09-07 青岛海信移动通信技术股份有限公司 Method and device for implementing electronic bookmarks
CN106250394A (en) * 2016-07-15 2016-12-21 北京邮电大学 Network resource content sees clearly system and method
US10331758B2 (en) 2016-09-23 2019-06-25 Hvr Technologies Inc. Digital communications platform for webpage overlay
US10776447B2 (en) 2016-09-23 2020-09-15 Hvr Technologies Inc. Digital communications platform for webpage overlay
WO2018053620A1 (en) * 2016-09-23 2018-03-29 Hvr Technologies Inc. Digital communications platform for webpage overlay
CN108874373A (en) * 2017-05-12 2018-11-23 腾讯科技(深圳)有限公司 Method and device, display terminal and the storage medium of information are inserted into webpage
CN108874373B (en) * 2017-05-12 2023-05-30 深圳市雅阅科技有限公司 Method and device for inserting information into webpage, display terminal and storage medium
CN107808000B (en) * 2017-11-13 2020-05-22 哈尔滨工业大学(威海) System and method for collecting and extracting data of dark net
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN110619100A (en) * 2019-06-18 2019-12-27 北京无限光场科技有限公司 Method and apparatus for acquiring data
CN113688597A (en) * 2020-05-18 2021-11-23 北京字节跳动网络技术有限公司 Display method, device, equipment and storage medium of labeled file

Also Published As

Publication number Publication date
CN101866342B (en) 2013-09-11

Similar Documents

Publication Publication Date Title
CN101866342B (en) Method and device for generating or displaying webpage label and information sharing system
CN101551800B (en) Marked information generation device, inquiry unit and sharing system
CN101452453B (en) A kind of method of input method Web side navigation and a kind of input method system
US20100037130A1 (en) Site mining stylesheet generator
Hyvönen Semantic portals for cultural heritage
CN100573520C (en) For retrieval is carried out pretreated method and apparatus to a plurality of documents
US8554800B2 (en) System, methods and applications for structured document indexing
US20060167869A1 (en) Multi-path simultaneous Xpath evaluation over data streams
Lehmann et al. Deqa: deep web extraction for question answering
US9311303B2 (en) Interpreted language translation system and method
CN1408093A (en) Electronic shopping agent which is capable of operating with vendor sites having disparate formats
CN1979484A (en) Document-based information and uniform resource locator (URL) management method and device
US20090019015A1 (en) Mathematical expression structured language object search system and search method
Wang et al. Website browsing aid: A navigation graph-based recommendation system
Khalili et al. Wysiwym authoring of structured content based on schema. org
Haq et al. A Comprehensive analysis of XML and JSON web technologies
Huang et al. An SVG-based method to support spatial analysis in XML/GML/SVG-based WebGIS
US20100082594A1 (en) Building a topic based webpage based on algorithmic and community interactions
Ferrández et al. A framework for enriching Data Warehouse analysis with Question Answering systems
Thuy et al. Transforming valid XML documents into RDF via RDF schema
Bernardi et al. Web applications design recovery and evolution with RE‐UWA
Chang et al. Supporting unified interface to wrapper generator in Integrated Information Retrieval
CN113392070B (en) Online document management method, device, system, equipment and storage medium
CN1326078C (en) Forming method for package device
KR20110074423A (en) Egf file searching system service and method therefor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130911

Termination date: 20180416

CF01 Termination of patent right due to non-payment of annual fee