WO2013078829A1

WO2013078829A1 - Method and device for processing webpage content on the basis of content block identification

Info

Publication number: WO2013078829A1
Application number: PCT/CN2012/075044
Authority: WO
Inventors: 钱海祥; 辛昕
Original assignee: 百度在线网络技术（北京）有限公司
Priority date: 2011-11-30
Filing date: 2012-05-03
Publication date: 2013-06-06
Also published as: CN103136259A; CN103136259B

Abstract

The present invention is directed to a method and a device for processing webpage content on the basis of content block identification. The method comprises: acquiring an original webpage to be processed; extracting block identification information from a markup language file of the original webpage, the block identification information being used for identifying content blocks in the markup language file; performing match query in a processing rule base according to the block identification information, so as to acquire a content block processing rule corresponding to the block identification information; and according to the content block processing rule, perform corresponding processing on the content block identified by the block identification information, so as to acquire a target webpage. Compared with the prior art, the present invention implement fast processing on the webpage content, thus improving the webpage conversion efficiency and quality, and improving the user experience. Meanwhile, as the markup language file of the webpage only needs to comprise the block identification information and does not need to comprise the corresponding processing rule, the webpage maintenance load is reduced for the website.

Description

Method and device for processing webpage content based on content block identification

The present invention relates to the field of Internet technologies, and in particular, to a technology for processing webpage content based on content block identification. Background technique

In the prior art, when performing webpage content processing, for example, when converting a webpage displayed on a desktop computer into a webpage suitable for display on a mobile terminal, the theme content is usually extracted from the parsed internet webpage, and according to the extracted theme. The content generates a new webpage to convert the original webpage suitable for desktop computer display into a target webpage suitable for mobile device display, but the method for performing webpage conversion is less efficient, and the processing time is high, thereby affecting the movement from the mobile webpage. The response speed of the end user's page access request reduces the user's body.

Therefore, how to effectively implement the processing of page content quickly becomes one of the problems that need to be solved. Summary of the invention

It is an object of the present invention to provide a method and apparatus for processing web content based on content block identification.

According to an aspect of the present invention, a computer-implemented method for processing web content based on content block identification is provided, the method comprising the steps of:

a obtain the original web page to be processed;

b extracting block identification information from the markup language file of the original webpage, where the block identification information is used to identify each content block in the markup language file;

c performing a matching query in the processing rule base according to the block identification information to obtain a content block processing rule corresponding to the block identification information;

d according to the content block processing rule, the content block identified by the block identification information is processed correspondingly to obtain a target webpage. According to another aspect of the present invention, there is also provided an apparatus for processing webpage content based on a content block identifier, the apparatus comprising:

An original webpage obtaining device, configured to obtain an original webpage to be processed;

And an identifier information extracting device, configured to extract, from the markup language file of the original webpage, the block identifier information, where the block identifier information is used to identify each content block in the markup language file;

a processing rule obtaining means, configured to perform a matching query in the processing rule base according to the block identification information, to obtain a content block processing rule corresponding to the block identification information;

The target webpage obtaining means is configured to perform corresponding processing on the content block identified by the block identification information according to the content block processing rule to obtain a target webpage.

Compared with the prior art, the present invention performs a matching query in the processing rule base to obtain a piece identification information according to the block language information of the original web page, such as the block identification information corresponding to each content block of the HTML and XHTML files. Corresponding content block processing rules, and then corresponding processing such as folding, deleting, formatting, etc. for each content block, thereby realizing rapid processing of page content; thereby improving page conversion efficiency and quality, thereby improving user experience At the same time, since only the block identification information needs to be included in the markup language file of the page, it is not necessary to include corresponding processing rules, thereby reducing the burden on the website for maintaining the webpage. DRAWINGS

Other features, objects, and advantages of the present invention will become more apparent from the Detailed Description of Description

1 shows a schematic diagram of a device for processing webpage content based on content block identification, in accordance with an aspect of the present invention;

2 is a schematic diagram of an apparatus for processing webpage content based on content block identification, in accordance with a preferred embodiment of the present invention;

3 illustrates a flow chart of a method for processing web page content based on content block identification in accordance with another aspect of the present invention;

4 illustrates processing a web page based on a content block identifier in accordance with a preferred embodiment of the present invention. Method flow chart for content.

The same or similar reference numerals in the drawings denote the same or similar components. detailed description

The invention is further described in detail below with reference to the accompanying drawings.

1 shows a schematic diagram of a device for processing web page content based on content block identification in accordance with an aspect of the present invention. The processing device 1 includes an original web page obtaining device 11, an identification information extracting device 12, a processing rule obtaining device 13, and a target web page obtaining device 14.

Here, the processing device 1 may be a network device, including but not limited to a computer, a network host, a single network server, a set of two or more network servers, or a cloud composed of two or more servers, where the cloud is based on cloud computing (Cloud Computing) a large number of computers or network servers, wherein cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computers; the processing device 1 can also be a mobile terminal, the mobile terminal means Computer devices that can be used on the move, including but not limited to mobile phones, notebooks, POS machines, on-board computers, etc., are typically much smaller than desktop monitors.

The process of processing webpage content by processing device 1 is described in detail below with reference to FIG.

Specifically, the original web page obtaining means 11 acquires the original web page to be processed.

Here, the manner of obtaining the original webpage to be processed includes, but is not limited to, the following situations:

1 W according to the page access request from the mobile terminal, obtaining the corresponding original webpage from the website server pointed to by the uniform resource locator (URL) in the page access request; in an example, first, the user by means of the mobile terminal The interaction device, including but not limited to a keyboard, a mouse, a remote controller, a touch pad, or a handwriting device, interacts with a browser software or a client software of the mobile terminal, taking the keyboard as an example, the address of the browser software of the user at the mobile terminal When inputting in the column input box, the mobile terminal acquires a key sequence input by the user in real time, for example, a uniform resource locator (URL) input by the user, and records the page access request corresponding to the user input operation, where The URL is included in the page access request, and then the page access request is sent through the agreed communication method. Next, the original webpage obtaining means 11 receives the page access request in real time, extracts the page URL therefrom, and sends a request for obtaining the webpage to the web server where the webpage is located, for example, it can be encapsulated as a request message, such as an http request message, and sent to the web server through a corresponding communication protocol, such as http, https communication protocol; then, the original web page obtaining device 11 receives the web page that the web server feeds back in response to the request, and The web page is used as the original web page to be processed.

2) Obtain the original web page to be processed from the third-party device.

In another example, processing device 1 is a network device. The original web page obtaining means 11 sends a request for receiving the original web page to be processed to the third party device according to a predetermined condition or event triggering or periodically according to an application programming interface (API) provided by the third party device; The original webpage to be processed returned by the third-party device in response to the request message; or the third-party device actively pushes the original webpage to be processed to the processing device 1, and the original webpage obtaining device 11 receives the original webpage to be processed.

Those skilled in the art should understand that the manner of obtaining the original webpage to be processed is only an example, and other existing or future possible ways of obtaining the original webpage to be processed, as applicable to the present invention, are also included in the present invention. Within the scope of protection, and is hereby incorporated by reference.

Then, the identification information extracting device 12 extracts the block identification information from the markup language file of the original web page acquired by the original web page obtaining device 11 by using, for example, string matching, wherein the block identification information is used to identify the markup language file. Each piece of content.

Here, the markup language file includes but is not limited to:

1) HTML (Hypertext Markup Language) file, which is a standard universal markup language used to describe web page documents;

2) XML (Extensible Markup Language) file, which is a simple standard universal markup language for data storage;

3) XHTML (Extensible Hypertext Markup Language) file, which is an XML-based markup language with strict syntax;

4) A WML (Wireless Markup Language) file, which is a descriptive markup language used to create pages that can be displayed in a WAP browser. Those skilled in the art should understand that the above-mentioned markup language files are only examples, and other existing or future markup language files, as applicable to the present invention, are also included in the scope of the present invention and are incorporated by reference. this.

Here, the block identification information includes, but is not limited to, an identification name, an identification ID, and the like; wherein the identification name may be named according to the type of the content block it identifies, such as a title, a navigation, a body, a picture, an embedded object (such as Java). Applet, ActiveX, Flash), etc.

Here, the content block means a content area composed of at least one tag in the markup language file, which corresponds to a specific content displayed in the webpage, such as a title content block, a body content block, a navigation content block, and a picture content. Blocks, embedded objects (such as Java applets, ActiveX, Flash) blocks, etc.

Here, the storage manner of the block identification information in the markup language file includes but is not limited to:

1) Mark the comments in the language file; for example, using the JSON format, the identification information can be stored in the HTML file comments, such as <! — tc block—begin: {type: "context"}― >, where JSON format is a lightweight data exchange format that generally uses a "name/value" pair to represent data, between name and value. Separated by ":";

2) a custom tag in the markup language file; for example, in the HTML file, the custom tag can be <tc></tc>, and the identification information can be stored in the custom tag;

3) Marking the tag attribute in the language file; for example, in the XHTML file, the identification information can be stored in the attribute of the content block tag, such as < ¥ markName= "title" >, where the attribute value of the attribute markName is used for identification The identification information of the content block corresponding to this div tag.

Those skilled in the art will appreciate that the above-described storage methods are merely examples, and other existing or future storage methods, such as those applicable to the present invention, are also included in the scope of the present invention and are incorporated herein by reference.

In an example, when the identification language file of the original web page acquired by the identification information extraction device 12 is an XHTML file, such as:

<body>

<div markNams="title"> <h2>News headline K/h2>

<p>

The largest and most influential and authoritative program in China's Internet

</p>

</div>

</body> wherein the XHTML file is pre-defined to store the content block identification information by using the tag attribute of the attribute name markName, and accordingly, the identification information extracting means 12 parses the XHTML file, and then according to the keyword "markName" " Perform string matching to get the markName attribute in the div tag attribute and its attribute value "title", which is the identification name of the content block corresponding to the div tag, and the markName attribute and its attribute in the img tag attribute. The value "picture", the attribute value is the identification name of the content block corresponding to the img tag.

A person skilled in the art should understand that the manner of extracting block identification information is only an example, and other existing or future possible methods for extracting block identification information may be applicable to the present invention and should also be included in the scope of protection of the present invention. And is included here by reference.

Then, the processing rule obtaining means 13 performs a matching query in the processing rule base based on the block identification information acquired by the identification information extracting means 12 to obtain a content block processing rule corresponding to the block identification information.

Specifically, the processing rule obtaining means 13 performs a matching query in the processing rule base of the local or third party device based on the block identification information to obtain a content block processing rule corresponding to the block identification information.

Here, the processing rule includes but is not limited to:

1) formatting the content in the content block; wherein the formatting includes but is not limited to:

i changing the text attributes in the content block, such as font, size, color, background color of the content, etc.; 11 reducing the picture included in the content block by a predetermined ratio;

2) display the content block;

3) delete the content block;

4) folding the content block; wherein the folding means that the content block is set to be hidden by the content by default, but the content may be expanded by a specific triggering manner;

5) Adjust the display position of the content block.

It should be understood by those skilled in the art that the above-mentioned processing rules are only examples, and other existing or future processing rules may be applied to the present invention, and are also included in the scope of the present invention and are incorporated herein by reference.

Here, the processing rule base includes each block identification information and a corresponding processing rule thereof, including but not limited to a relational database, a Key-Value storage system, a file system, and the like.

In an example, the block identification information is a "title", and the processing rule obtaining means 13 performs a matching query in the local processing rule base through the application programming interface (API) provided by the processing device 1 according to the block identification information to obtain The content block processing rule corresponding to the "title" block identification information is "show", that is, the content block identified by the block identification information is subjected to display processing.

In another example, the block identification information is a "picture", and the processing rule obtaining means 13 sends a processing rule acquisition request to the third-party device according to the block identification information, where the processing rule acquisition request includes the block identification information; for example, It may be encapsulated into a request message, such as an http request message, and sent to a third-party device through a corresponding communication protocol, such as http, https communication protocol; the third-party device receives and parses the request information in a real-time listening manner, and further Performing a matching query in the processing rule base according to the extracted block identification information, to obtain a content block processing rule corresponding to the block identification information as "zoomin", that is, a picture in the content block identified by the block identification information A predetermined reduction process is performed.

Those skilled in the art should understand that the manner of obtaining the processing rule is only an example, and other existing or future possible acquisition processing rules may be applicable to the present invention, and should also be included in the protection scope of the present invention. The reference is included here. Preferably, the processing rule obtaining means 13 performs a matching query in the processing rule base according to the block identification information and the identification information of the website to which the original webpage belongs, to obtain a content block processing rule customized for the webpage of the website. Here, the identification information of the website to which the original webpage belongs includes, but is not limited to, a website domain name, a website IP address, a website name, and the like.

Specifically, the processing rule obtaining means 13 obtains the UL of the original webpage to be processed, for example, according to the original webpage obtaining means 11, and determines the identification information of the website to which the webpage belongs, such as the website domain name, the website IP address, etc.; The block identification information acquired by the identification information extracting device 12 and the identification information of the website to which the original web page belongs are matched in the processing rule base. If the matching is obtained as a processing rule reserved for the web page of the website, the predetermined processing rule is taken as The content block processing rules for this web page.

In an example, when the block identification information is "embedded object" and the URL of the original web page is "www.abc.com/sport/101.htm", the processing rule obtaining means 13 extracts the website where the web page is located according to the URL. The website domain name is "www.abc.com"; the processing rule obtaining means 13 performs a matching query in the processing rule base according to the block identification information, and obtains a corresponding processing rule as "delete", that is, deletes the content block identified by the identification information. However, according to the block identification information and the website domain name of the website described in the original webpage, a matching query is performed in the processing rule base, and the processing rule for obtaining the "inline object" block identification information reserved for the website is "show", that is, When the content block identified by the identification information is displayed, the processing rule obtaining means 13 ignores the deletion processing rule corresponding to the block identification information, and uses the processing rule predetermined for the website as the content block processing rule.

Those skilled in the art should understand that the manner of obtaining the processing rule is only an example, and other existing or future possible acquisition processing rules may be applicable to the present invention, and should also be included in the protection scope of the present invention. The reference is included here.

Then, the target webpage obtaining means 14 performs corresponding processing on the content block identified by the block identification information according to the content block processing rule acquired by the processing rule acquiring means 13 to obtain the target webpage.

Here, the corresponding processing on the content block includes, but is not limited to: formatting, displaying, deleting, folding, and ordering the content in the content block.

In an example, when the identification information extraction device 12 parses and acquires the HTML of a web page The two block identification information in the file are "body" and "picture", respectively, and the processing rule obtaining means 13 acquires the content block processing rule corresponding to the "body" block identification information to fold the content block identified by the identification information. And the content block processing rule corresponding to the "picture" block identification information is to reduce the picture in the content block identified by the identification information by a predetermined reduction ratio; and the target webpage obtaining means 14 is in the HTML according to the identification information. The content block identified by each identification information is obtained in the file, and then the content in the content block identified by the "body" block identification information is folded and hidden according to the corresponding processing rule, and a predetermined triggering manner is set to realize the future. The text content may be expanded to be displayed, and the image in the content block identified by the "picture" block identification information is reduced and displayed in a predetermined ratio, and the processed web page is used as the target web page.

A person skilled in the art should understand that the manner of obtaining the target webpage is only an example. Other existing or future possible ways of obtaining the target webpage may be applicable to the present invention, and should also be included in the scope of the present invention. The reference is included here.

Preferably, the original web page obtaining means 11, the identification information extracting means 12, the processing rule obtaining means 13 and the target web page obtaining means 14 are continuously operated. Specifically, the original webpage obtaining apparatus 11 continuously acquires the original webpage to be processed; then, the identifier information extracting apparatus 12 also continuously extracts block identification information from the markup language file of the original webpage, wherein the block identifier information is used by the block identifier information. And identifying the content blocks in the markup language file; subsequently, the processing rule obtaining means 13 also continuously performs a matching query in the processing rule base according to the block identification information to obtain content corresponding to the block identification information. The block processing rule is further processed according to the content block processing rule, and the content block identified by the block identification information is processed accordingly to obtain a target web page. Here, those skilled in the art should understand that "continuous" means that each device continuously performs the acquisition of the original webpage, the extraction of the block identification information, the acquisition of the processing rule, and the acquisition of the target webpage until the predetermined stop condition is met, for example, the original webpage acquisition. The device 11 stops acquiring the original web page to be processed for a long time.

Preferably (refer to FIG. 1), when the content block processing rule is not obtained from the processing rule base, the processing rule obtaining means 13 may determine the content according to content related information of the content block identified by the block identification information. Content block processing rules. Here, the content related information of the content block includes but is not limited to:

1) location information of the content of the content block in the original webpage;

2) the number of text characters contained in the content of the content block;

3) The tag information contained in the content block.

Those skilled in the art should understand that the above content related information is only an example, and other existing or future content related information may be applicable to the present invention, and should also be included in the scope of the present invention and included in the reference. this.

1) The processing rule obtaining means 13 determines the processing rule according to the location of the content block in the original webpage; for example, if the content block identified by the block identification information is located at the center of the original webpage, that is, the content block is important in the original webpage If the level is high, the content block processing rule may be determined to perform display processing on the content block.

2) The processing rule obtaining means 13 determines the processing rule according to the number of character characters in the content block; for example, if the number of content block characters identified by the block identification information exceeds a predetermined number of characters threshold, it may be determined that the processing rule is the content The text content in the block is folded;

3) the processing rule obtaining means 13 determines a processing rule according to the tag object included in the content block; for example, if the block identification information includes the tag <0 6(^> in the content block identified in the markup language file of the original web page, and The tag <0 6 (^> contains an object that is scheduled to be restricted in the mobile device, such as ActiveX, and determines that its processing rule is to delete the content block.

In an example, the following code snippet exists in the HTML file of the original web page:

<!-- tc block—begin: {markName: "embedded object"} -- >

< OBJECT

Classid="clsid: 2F390484-1C7D-11D0-8908-00A0C90395F4" codebase="ActiveXDoc.cab#version=l , 0, 0, 0" >

< /OBJECT > ,

<!- tc block end— > The block identification information is an "embedded object", and the processing rule obtaining means 13 fails to obtain a corresponding content block processing rule from the processing rule base according to the block identification information, and obtains the corresponding content block processing rule from the tag <object> The tag has an attribute clsid, and further determines that the ActiveX embedded object is included therein, thereby determining that the processing rule corresponding to the block identification information is to delete the content block identified by the identification information.

Those skilled in the art should understand that the manner of determining the processing rule is merely an example, and other existing or future possible methods for determining the processing rule, as applicable to the present invention, are also included in the scope of the present invention, and The reference is included here.

2 shows a schematic diagram of an apparatus for processing web page content based on content block identification in accordance with a preferred embodiment of the present invention. The processing device 1 further includes an updating device 15. The update device 15' establishes or updates the processing rule base based on the newly determined content block processing rule.

Here, the functions of the devices 11, 12, 13, and 14 shown in FIG. 2 are the same as those of the devices 11, 12, 13, and 14 previously described with reference to FIG. 1, for the sake of brevity, The way is included here, without making a comment.

Specifically, when the processing rule obtaining means 13 does not obtain the corresponding content block processing rule from the processing rule base according to the identification information, it newly determines the content block processing rule for the identification information, and the updating device 15' according to the identification information and the corresponding The newly determined processing rule is written into the processing rule base to update the processing rule base; if it is detected that the processing rule base is not established, the processing rule base is initialized first, and then the above information is written to the processing In the rule base.

In an example, when the processing rule obtaining means 13 obtains the new processing rule corresponding to the "inline object" as the deletion processing, the updating means 15 inserts a tag name and its corresponding in the processing rule base. The data record of the processing rules.

Those skilled in the art should understand that the above manner of establishing or updating the processing rule base is only an example, and other existing or future possible ways of establishing or updating the processing rule base may be applied to the present invention, and should also be included in the present invention. Within the scope of protection, and is hereby incorporated by reference. In another preferred embodiment (cf. Fig. 1), the processing device 1 further comprises a providing device (not shown). The original webpage obtaining device 11 acquires the original webpage according to a page access request input by the user through the mobile terminal; and the providing device provides the target webpage to the user.

The other preferred embodiment is described in detail below with reference to FIG. 1, wherein the identification information extracting means 12 extracts block identification information from the markup language file of the original web page, wherein the block identification information is used to identify the mark a content block in the language file; subsequently, the processing rule obtaining means 13 performs a matching query in the processing rule base according to the block identification information to obtain a content block processing rule corresponding to the block identification information; The obtaining means 14 performs corresponding processing on the content block identified by the block identification information according to the content block processing rule to obtain a target webpage; the specific process and the identification information extracting apparatus 12 in the embodiment described above with reference to FIG. The process performed by the processing rule obtaining means 13 and the target web page obtaining means 14 is the same, and is included herein for the sake of brevity and is not to be construed as a reference.

In an example, first, when the user inputs in the address bar input box of the browser software of the mobile terminal, the mobile terminal acquires a webpage URL input by the user in real time, and records the page corresponding to the user input operation. An access request, wherein the page access request includes the URL, and then the page access request is sent to the processing device 1 by an agreed communication method; then, the original web page obtaining device 11 receives the page access request in real time, and extracts a page from the page a URL, and sending a request for obtaining the webpage to a web server where the webpage pointed to by the URL is located, and then receiving a webpage that is fed back by the web server in response to the request, and using the webpage as the original webpage to be processed.

The providing device obtains the target webpage acquired by the target webpage obtaining device 14 by using any known mobile terminal to provide human readable information, such as screen display, speaker playback, etc., and provides the target webpage to the user through the mobile terminal. For example, taking the screen display as an example, the providing device provides the target webpage acquired by the target webpage obtaining device 14 to the mobile terminal in a certain order and format through page technologies, such as JSP, ASP, or PHP, for example, by linking, displaying the page, etc. The method is provided to the mobile terminal for browsing by the user.

Those skilled in the art should be able to understand the manner in which the original web page is obtained and/or provide the purpose. The manner of marking the webpage is only an example, and other existing or future possible ways of obtaining the original webpage and/or the manner of providing the target webpage may be applicable to the present invention, and should also be included in the scope of protection of the present invention. The way is included here.

Preferably (refer to Fig. 1), the processing device 1 further comprises parameter acquisition means (not shown) and preferred rule acquisition means (not shown). The parameter obtaining means acquires display parameter information of the mobile terminal; the preferred rule obtaining means optimizes the content block processing rule according to the display parameter information to obtain a preferred content block processing rule; the target webpage obtaining means 14 according to the The preferred content block processing rule is configured to perform corresponding processing on the content block to obtain the target web page.

Specifically, the parameter obtaining device acquires display parameter information of the mobile terminal by using an API (application programming interface) provided by the mobile terminal to display the target webpage in an agreed manner; where the display parameter information includes but is not limited to :

1) Image formats supported by mobile terminals, such as JPEG, PNG, GIF formats, etc.

2) the screen resolution of the mobile terminal, such as the physical size of the pixel, the number of color bits,

3) Whether the mobile terminal supports plug-ins, such as Flash plug-ins;

Then, the rule acquisition means optimizes the content block processing rule acquired by the processing rule acquisition means 13 for each identification information according to the display parameter information of the mobile terminal acquired by the parameter acquisition means to obtain a preferred content block processing rule. Subsequently, the target webpage obtaining means 14 performs corresponding processing on the content block according to the preferred content block processing rule to obtain the target webpage.

In an example, when the block identification information in the markup language file acquired by the identification information acquiring device 12 is "Flash", the identified content block includes a Flash animation, and the processing rule obtaining means 13 obtains in the processing rule base. The corresponding processing rule is to delete the Flash animation identified by the identifier information, but the display parameter information acquired by the parameter obtaining device indicates that the mobile terminal supports the FLASH plug-in operation, and then the preferred rule obtaining device accordingly performs the original processing corresponding to the identifier information. The rule is optimized to preserve the Flash animation in the content block, that is, the preferred content block processing rule; and then the target webpage obtaining device 14 retains the FLASH animation in the corresponding processing of the content block to obtain the target webpage including the FLASH animation. Those skilled in the art should be able to understand that the manner of obtaining the display parameter information and/or the manner of obtaining the preferred content block processing rule and/or the manner of obtaining the target webpage are merely examples, and other existing or future possible acquisition parameter information may be obtained. The manner and/or manner of obtaining the preferred content block processing rules and/or the manner in which the target web page is obtained, as applicable to the present invention, is also included in the scope of the present invention and is incorporated herein by reference.

3 illustrates a flow diagram of a method for processing web page content based on content block identification in accordance with an aspect of the present invention.

Here, the processing device 1 may be a network device, including but not limited to a computer, a network host, a single network server, a set of more than one network server, or a cloud composed of more than one server, where the cloud is cloud computing-based. A large number of computers or network servers, wherein cloud computing is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers; the processing device 1 can also be a mobile terminal, and the mobile terminal means Computer equipment used in mobile, including but not limited to mobile phones, notebooks, POS machines, car computers, etc., the display size is usually much smaller than the size of the desktop computer.

The process of processing webpage content by processing device 1 is described in detail below with reference to FIG. 3:

Specifically, in step S1, the processing device 1 acquires the original web page to be processed.

1 W according to the page access request from the mobile terminal, obtaining the corresponding original webpage from the website server pointed to by the uniform resource locator (URL) in the page access request; in an example, first, the user by means of the mobile terminal The interaction device, including but not limited to a keyboard, a mouse, a remote controller, a touch pad, or a handwriting device, interacts with a browser software or a client software of the mobile terminal, taking the keyboard as an example, the address of the browser software of the user at the mobile terminal When inputting in the column input box, the mobile terminal acquires a key sequence input by the user in real time, for example, a uniform resource locator (URL) input by the user, and records the page access request corresponding to the user input operation, where The page access request includes the URL, and then the page access request is sent to the processing device 1 by the agreed communication method; then, in step S1, the processing device 1 receives the page access in real time. Requesting, and extracting a page URL therefrom, and sending a request for obtaining the web page to a web server where the web page pointed to by the URL, for example, encapsulating it as a request message, such as an http request message, and through a corresponding communication protocol, such as The http, https communication protocol is sent to the web server; then, the processing device 1 receives the webpage that the web server feeds back in response to the request, and uses the webpage as the original webpage to be processed.

2) Obtain the original web page to be processed from the third-party device.

In another example, processing device 1 is a network device. In step S1, the processing device 1 sends a request message for receiving the original web page to be processed to the third party device according to an application programming interface (API) provided by the third party device, triggered by a predetermined condition or event, or periodically. The third-party device responds to the original web page to be processed returned by the request message; or the third-party device actively pushes the original web page to be processed to the processing device 1, and in step S1, the processing device 1 receives the original web page to be processed.

Next, in step S2, the processing device 1 extracts block identification information from the markup language file of the original web page acquired in step S1, for example, by using string matching or the like, wherein the block identification information is used to identify the mark Each content block in the language file.

Here, the markup language file includes but is not limited to:

4) A WML (Wireless Markup Language) file, which is a descriptive markup language used to create pages that can be displayed in a WAP browser.

Those skilled in the art should be able to understand that the above markup language files are only examples, other existing </ RTI><RTIgt;</RTI><RTIgt;</RTI><RTIgt;</RTI><RTIgt;</RTI><RTIgt;

Here, the content block means a content area composed of one or more tags in a markup language file, which corresponds to a specific content displayed in a webpage, such as a title content block, a body content block, a navigation content block, Image content blocks, embedded objects (such as Java applets, ActiveX, Flash), and so on.

Here, the storage manner of the block identification information in the markup language file includes but is not limited to: 1) markup in the markup language file; for example, using the JSON format, the identification information may be stored in the HTML file comment, such as <! — tc block—begin: {type: "context"}― >, where JSON format is a lightweight data exchange format that generally uses a "name/value" pair to represent data, between name and value. Separated by ":";

In an example, when the identification language file of the original web page acquired by the processing device 1 in step S2 is an XHTML file, such as:

<body>

<h2>News headline K/h2> 7flower.jpg"markName="picture"

Wherein, the XHTML file pre-defines the content block identification information by using the tag attribute with the attribute name markName, according to which, in step S2, the processing device 1 parses the XHTML file, and according to the keyword "markName" Perform string matching to obtain the markName attribute in the div tag attribute and its attribute value "title", which is the identification name of the content block corresponding to the div tag, and the markName attribute and its attribute value in the img tag attribute. "Picture", the attribute value is the identification name of the content block corresponding to the img tag.

Subsequently, in step S3, the processing device 1 performs a matching query in the processing rule base based on the block identification information acquired in step S2 to obtain a content block processing rule corresponding to the block identification information.

Specifically, in step S3, the processing device 1 performs a matching query in the processing rule base of the local or third-party device according to the block identification information to obtain a content block processing rule corresponding to the block identification information.

Here, the processing rule includes but is not limited to:

i changing the text attributes in the content block, such as font, size, color, background color of the content, etc.;

11 reducing the picture included in the content block by a predetermined ratio; 2) display the content block;

3) delete the content block;

5) Adjust the display position of the content block.

In an example, the block identification information is a "title", and in step S3, the processing device 1 performs a matching query in the local processing rule base by using an application programming interface (API) provided by the processing device 1 according to the block identification information. The content block processing rule corresponding to the "title" block identification information is "show", that is, the content block identified by the block identification information is subjected to display processing.

In another example, the block identification information is a "picture", and in step S3, the processing device 1 sends a processing rule acquisition request to the third-party device according to the block identification information, where the processing rule acquisition request includes the block identification information. For example, it can be encapsulated into a request message, such as an http request message, and sent to a third-party device through a corresponding communication protocol, such as http, https communication protocol; the third-party device receives and parses the request in real-time listening manner. And performing a matching query in the processing rule base according to the extracted block identification information, so as to obtain a content block processing rule corresponding to the block identification information, which is “zoomin”, that is, the content block identified by the block identification information. The picture in the picture is subjected to a predetermined reduction process.

Preferably, in step S3, the processing device 1 according to the block identification information and the The identification information of the website to which the original webpage belongs is matched query in the processing rule base to obtain a content block processing rule customized for the webpage of the website. Here, the identification information of the website to which the original webpage belongs includes, but is not limited to, a website domain name, a website IP address, a website name, and the like.

Specifically, in step S3, the processing device 1 determines the identification information of the website to which the web page belongs, such as the website domain name, the website IP address, etc., according to the URL of the original web page to be processed, for example, in step S1; Performing a matching query in the processing rule base according to the block identification information acquired in step S2 and the identification information of the website to which the original web page belongs, and if the matching is obtained as a processing rule reserved for the webpage of the website, the predetermined processing rule is obtained. As the content block processing rule of the web page.

In an example, when the block identification information is "embedded object" and the URL of the original web page is "www.abc.com/sport/101.htm", in step S3, the processing device 1 extracts the web page according to the URL. The website domain name of the website is "www.abc.com"; the processing device 1 performs a matching query in the processing rule base according to the block identification information, and obtains the corresponding processing rule as "delete", that is, deletes the content identified by the identification information. Block, but according to the block identification information and the website domain name of the website described in the original webpage, a matching query is performed in the processing rule base, and the processing rule for obtaining the "inline object" block identification information reserved for the website is "show", That is, the content block identified by the identification information is displayed, and the processing device 1 ignores the deletion processing rule corresponding to the block identification information, and uses the processing rule predetermined for the website as the content block processing rule.

Then, in step S4, the processing device 1 performs corresponding processing on the content block identified by the block identification information according to the content block processing rule acquired in step S3 to obtain a target web page.

In an example, in step S2, the processing device 1 parses and acquires two block identification information in an HTML file of a web page as "text" and "picture", respectively, and is in step In step S3, the processing device 1 acquires the content block processing rule corresponding to the "body" block identification information to collapse the content block identified by the identification information, and the content block processing rule corresponding to the "picture" block identification information is The image in the content block identified by the identification information is reduced in a predetermined reduction ratio. Then, in step S4, the processing device 1 acquires the content block identified by each identification information in the HTML file according to the identification information, and then, According to the corresponding processing rule, the content in the content block identified by the "body" block identification information is folded and hidden, and a predetermined triggering manner is set, so that the content of the text can be expanded and displayed in the future, and the "picture" block identifier is displayed. The image in the content block identified by the information is reduced and displayed in a predetermined proportion, and the processed web page is used as the target web page.

Preferably, the processing device 1 continues to operate in steps S1, S2, S3 and S4. Specifically, in step S1, the processing device 1 continuously acquires the original web page to be processed; then, in step S2, the processing device 1 also continuously extracts block identification information from the markup language file of the original web page, where The block identification information is used to identify each content block in the markup language file; subsequently, in step S3, the processing device 1 also continuously performs a matching query in the processing rule base according to the block identification information to obtain a content block processing rule corresponding to the block identification information; subsequently, in step S4, the processing device 1 also continuously performs corresponding processing on the content block identified by the block identification information according to the content block processing rule, to Get the landing page. Here, those skilled in the art should understand that "continuous" means that the processing device 1 continuously performs the acquisition of the original web page, the extraction of the block identification information, the acquisition of the processing rule, and the acquisition of the target web page in each step until the predetermined stop condition is satisfied. For example, the processing device 1 stops acquiring the original web page to be processed for a long time.

Preferably (refer to FIG. 3), when the content block processing rule is not obtained from the processing rule base, in step S3, the processing device 1 may according to the content related information of the content block identified by the block identification information, The content block processing rule is determined.

Here, the content related information of the content block includes but is not limited to: 1) location information of the content of the content block in the original webpage;

2) the number of text characters contained in the content of the content block;

3) The tag information contained in the content block.

1) In step S3, the processing device 1 determines a processing rule according to the location of the content block in the original webpage; for example, if the content block identified by the block identification information is located at the center of the original webpage, that is, the content block is on the original webpage If the importance level is high, the content block processing rule may be determined to perform display processing on the content block.

2) In step S3, the processing device 1 determines a processing rule according to the number of character characters in the content block; for example, if the number of content block characters identified by the block identification information exceeds a predetermined number of characters threshold, it may be determined that the processing rule is Folding the text content in the content block;

3) In step S3, the processing device 1 determines a processing rule according to the tag object included in the content block; for example, if the block identification information includes the tag <0 6 (^ in the content block identified in the markup language file of the original web page) > , and the tag <0 6 (^> contains an object that is scheduled to be restricted in the mobile device, such as ActiveX, then determines its processing rule to delete the content block.

<!-- tc block—begin: {markName: "embedded object"} -- >

< OBJECT

< /OBJECT > ,

<!- tc block end— > The block identification information that exists therein is an "embedded object". In step S3, the processing device 1 fails to obtain a corresponding content block processing rule from the processing rule base according to the block identification information, and from the tag <object> The parsing obtains the tag with the attribute clsid, and further determines that the ActiveX embedded object is included therein, thereby determining that the processing rule corresponding to the block identification information is to delete the content block identified by the identifier information.

4 illustrates a flow chart of a method for processing web page content based on content block identification in accordance with a preferred embodiment of the present invention. Wherein, the process further includes step S5. In step S5, the processing device 1 creates or updates the processing rule library according to the newly determined content block processing rule.

Here, the functions of the processing device 1 shown in FIG. 4 in step S1, step S2, step S3, and step S4, and the processing device 1 described above with reference to FIG. 3 are in step S1, step S2, step S3. It is the same as that in step S4, and for the sake of brevity, it is included herein by reference, and is not described.

Specifically, when, in step S3, the processing device 1 does not obtain the corresponding content block processing rule from the processing rule base according to the identification information, it newly determines the content block processing rule for the identification information, then in step S5, the processing device 1 according to the identification information and the corresponding newly determined processing rule is written into the processing rule base to update the processing rule base; if it is detected that the processing rule base is not established, the processing rule base is initialized first, and then Write the above information to the processing rule base.

In an example, in step S3, when the new processing rule corresponding to the tag name "inline object" obtained by the processing device 1 is a deletion process, then in step S5, the processing device 1 is in the process rule library. Insert a data record of the tag name and its corresponding processing rule.

Those skilled in the art should understand that the above manner of establishing or updating the processing rule base is only an example, and other existing or future possible ways of establishing or updating the processing rule base may be applied to the present invention, and should also be included in the present invention. Within the scope of protection, and by reference Included here.

In another preferred embodiment (see Figure 3), the process further includes a step S6 (not shown). In step S1, the processing device 1 acquires the original webpage according to a page access request input by the user through the mobile terminal; in step S6, the processing device 1 provides the target webpage to the user.

The other preferred embodiment is described in detail below with reference to FIG. 3, wherein, in step S2, the processing device 1 extracts block identification information from the markup language file of the original webpage, wherein the block identification information is used to identify Each content block in the markup language file; subsequently, in step S3, the processing device 1 performs a matching query in the processing rule base according to the block identification information to obtain a content block processing corresponding to the block identification information. a rule; subsequently, in step S4, the processing device 1 performs corresponding processing on the content block identified by the block identification information according to the content block processing rule to obtain a target web page; the specific process is as described above with reference to FIG. The processes performed by the processing device 1 in the step S2, the step S3 and the step S4 are the same in the described embodiment, and are included herein by way of citation for the sake of brevity.

In an example, first, when the user inputs in the address bar input box of the browser software of the mobile terminal, the mobile terminal acquires a webpage URL input by the user in real time, and records the page corresponding to the user input operation. An access request, wherein the page access request includes the URL, and then the page access request is sent to the processing device 1 by an agreed communication method; then, in step S1, the processing device 1 receives the page access request in real time, and Extracting a page URL from the webpage, and sending a request for obtaining the webpage to a web server where the webpage pointed to by the webpage is located, and then receiving a webpage that is fed back by the web server in response to the request, and using the webpage as the original webpage to be processed. .

In step S6, the processing device 1 uses the target webpage acquired in step S4 to adopt any known mobile terminal to provide human readable information, such as screen display, speaker playback, etc., to pass the target webpage through the mobile terminal. Provided to the user. For example, taking the screen display as an example, in step S6, the processing device 1 provides the target web page acquired in step S4 to the mobile terminal in a certain order and format through page technologies, such as JSP, ASP or PHP, for example. Provided to the mobile terminal by means of a link, a page display, etc. For users to browse.

Those skilled in the art should understand that the manner of obtaining the original webpage and/or the manner of providing the target webpage is only an example, and other existing or future possible ways of obtaining the original webpage and/or providing the target webpage may be applied. The present invention should also be included in the scope of the present invention and is hereby incorporated by reference.

Preferably (see Fig. 3), the process further includes a step S7 (not shown) and a step S8 (not shown). In step S7, the processing device 1 acquires display parameter information of the mobile terminal; in step S8, the processing device 1 optimizes the content block processing rule according to the display parameter information to obtain a preferred content block processing. In step S4, the processing device 1 performs corresponding processing on the content block according to the preferred content block processing rule to obtain the target web page.

Specifically, in step S7, the processing device 1 acquires display parameter information of the mobile terminal by calling an API (application programming interface) provided by the mobile terminal to display the target webpage in an agreed manner; where the display parameter is Information includes but is not limited to:

3) Whether the mobile terminal supports plug-ins, such as Flash plug-ins;

Next, in step S8, the processing device 1 performs optimization processing on the content block processing rule acquired for each identification information in step S3 according to the display parameter information of the mobile terminal acquired in step S7 to obtain a preference. Content block processing rules. Then, in step S4, the processing device 1 performs corresponding processing on the content block according to the preferred content block processing rule to obtain the target web page.

In an example, when the block identification information in the markup language file acquired by the processing device 1 in step S2 is "Flash", the identified content block contains a Flash animation, and in step S3, the processing device 1 is processing The corresponding processing rule obtained in the rule base is to delete the Flash animation identified by the identification information, but in step S7, the display parameter information acquired by the processing device 1 indicates that the mobile terminal supports the FLASH plug-in operation, then in step S8, The processing device 1 optimizes the original processing rule corresponding to the identification information to the Flash animation in the reserved content block, that is, the preferred content block processing rule; In step S4, the processing device 1 retains the FLASH animation in the content block when corresponding processing is performed to obtain a target webpage including the FLASH animation.

Those skilled in the art should be able to understand that the manner of obtaining the display parameter information and/or the manner of obtaining the preferred content block processing rule and/or the manner of obtaining the target webpage are merely examples, and other existing or future possible acquisition parameter information may be obtained. The manner and/or manner of obtaining the preferred content block processing rules and/or the manner in which the target web page is obtained, as applicable to the present invention, is also included in the scope of the present invention and is incorporated herein by reference.

It is apparent to those skilled in the art that the present invention is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims All changes in the meaning and scope of equivalent elements are included in the present invention. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. The plurality of units or devices recited in the device claims may also be implemented by a unit or device by software or hardware. The first and second terms are used to denote names and do not represent any particular order.

Claims

Claim

A computer-implemented method for processing webpage content based on content block identification, wherein the method comprises the following steps:

a obtain the original web page to be processed;

And performing, according to the block identification information, a matching query in the processing rule base to obtain a content block processing rule corresponding to the block identification information;

And performing, according to the content block processing rule, the content block identified by the block identification information to obtain a target webpage.

2. The method according to claim 1, wherein the step c comprises:

And performing a matching query in the processing rule base according to the block identification information and the identification information of the website to which the original webpage belongs, to obtain the content block processing rule.

The method according to claim 1 or 2, wherein the content block processing rule comprises at least one of the following:

- formatting the content in the content block;

- presenting the content block;

- deleting the content block;

- folding the content block.

The method according to claim 1 or 2, wherein the step c includes: content related information of the content block identified by the block identification information, determining the content block processing rule

5. The method according to claim 4, wherein the content related information comprises at least one of:

- location information of the content of the content block in the original web page;

- the number of text characters contained in the content of the content block; - tag information contained in the content block.

The method according to claim 4, wherein the method further comprises:

- establishing or updating the processing rule base according to the newly determined content block processing rule.

The method according to claim 1 or 2, wherein the step a comprises:

- obtaining the original webpage according to a page access request input by the user through the mobile terminal; wherein the method further includes:

- providing the target web page to the user.

8. The method according to claim 7, wherein the method further comprises:

Obtaining display parameter information of the mobile terminal;

- optimizing the content block processing rule according to the display parameter information to obtain a preferred content block processing rule;

The step d includes:

- performing corresponding processing on the content block according to the preferred content block processing rule to obtain the target web page.

The method according to claim 1 or 2, wherein the storage manner of the block identification information in the markup language file comprises at least one of the following:

- a comment in the markup language file;

- a custom tag in the markup language file;

- the tag attribute in the markup language file.

The method according to any one of claims 1 to 2, wherein the markup language file comprises at least one of the following:

- HTML file;

- XML file;

- XHTML file;

- WML file.

An apparatus for processing webpage content based on a content block identifier, wherein the apparatus comprises: an original webpage obtaining apparatus, configured to acquire an original webpage to be processed;

An identifier information extracting device, configured to extract block identification information from a markup language file of the original webpage, where the block identifier information is used to identify each of the markup language files Block

The device according to claim 11, wherein the processing rule obtaining means is configured to perform a matching query in the processing rule base according to the block identification information and the identification information of the website to which the original webpage belongs, to obtain a The content block processing rules.

The device according to claim 11 or 12, wherein the content block processing rule comprises at least one of the following:

- formatting the content in the content block;

- presenting the content block;

- deleting the content block;

- folding the content block.

The device according to claim 11 or 12, wherein the processing rule acquires content related information of the content block identified by the block identification information, and determines the content block processing rule

The device according to claim 14, wherein the content related information comprises at least one of the following:

- the number of text characters contained in the content of the content block;

- tag information contained in the content block.

The device according to claim 14, wherein the device further comprises:

And an updating device, configured to establish or update the processing rule base according to the newly determined content block processing rule.

The device according to claim 11 or 12, wherein the original webpage obtaining means is configured to acquire the original webpage according to a page access request input by a user through a mobile terminal; The device also includes:

Providing means for providing the target webpage to the user.

The device according to claim 17, wherein the device further comprises: parameter obtaining means, configured to acquire display parameter information of the mobile terminal;

And an optimization device, configured to optimize the content block processing rule according to the display parameter information to obtain a preferred content block processing rule;

The target webpage obtaining apparatus is configured to perform corresponding processing on the content block according to the preferred content block processing rule to obtain the target webpage.

The device according to claim 11 or 12, wherein the storage manner of the block identification information in the markup language file comprises at least one of the following: