CN103064943B - A kind of client device - Google Patents

A kind of client device Download PDF

Info

Publication number
CN103064943B
CN103064943B CN201210573088.7A CN201210573088A CN103064943B CN 103064943 B CN103064943 B CN 103064943B CN 201210573088 A CN201210573088 A CN 201210573088A CN 103064943 B CN103064943 B CN 103064943B
Authority
CN
China
Prior art keywords
webpage
matching
content
setting
matching setting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210573088.7A
Other languages
Chinese (zh)
Other versions
CN103064943A (en
Inventor
谢洲为
潘洪学
糜裕峰
任寰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210573088.7A priority Critical patent/CN103064943B/en
Publication of CN103064943A publication Critical patent/CN103064943A/en
Application granted granted Critical
Publication of CN103064943B publication Critical patent/CN103064943B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of client device, browser is installed on it, described browser is provided with the device that can extract webpage text content, described client device, web page browsing instruction according to user starts the described device that can extract webpage text content, and this webpage text content extracting the device that webpage text content can extract is showed user in a browser;The described device that can extract webpage text content includes: coupling arranges dispensing unit, is suitable to preset at least one webpage text content coupling in browser side and arranges;Download unit, is suitable to carry out web page contents download in browser side;Matching unit, is suitable to that with described webpage text content, described web page contents is mated setting respectively and mates, until described web page contents the match is successful;Extraction unit, is suitable to utilize and mates setting with the described web page contents webpage text content that the match is successful, extract the webpage text content in described web page contents.

Description

Client device
Technical Field
The present invention relates to the field of network technologies, and in particular, to a client device.
Background
With the popularization of internet technology, networks have become one of the important ways for people to obtain information, and text contents in web pages are the main carriers of information. However, in general, a web page includes a lot of useless information such as advertisement pictures and non-article content besides text content, which seriously affects the reading experience of a user.
In the scheme for extracting the text content of the webpage provided by the prior art, after the webpage is loaded in the browser, the content in the webpage is split, then the webpage content is positioned by the matching rule file in the browser, and the required field content is extracted and displayed, so that a user can see the webpage after text screening, and the user can conveniently and attentively read the webpage.
The existing scheme for extracting the webpage text content at least has the following defects:
the existing scheme sets a matching rule file aiming at a certain preset webpage structure, the matching rule file is only suitable for extracting webpage text contents under the preset structure, however, due to the fact that the updating speed of network resources is very high, the webpage structure can change frequently, the existing matching rule file cannot extract texts of changed webpages, a new matching rule file is generated again, and the new matching rule file is set in a browser, so that the matching operation is complex, the workload is large, and the efficiency is low.
Disclosure of Invention
In view of the above, the present invention has been made to provide a client device that overcomes or at least partially solves the above problems.
According to the present invention, an embodiment of the present invention provides a client device, where a browser is installed on the client device, and a device capable of extracting text content of a web page is installed in the browser,
the client equipment starts a device capable of extracting the webpage text content according to a webpage browsing instruction of a user and displays the webpage text content extracted by the device capable of extracting the webpage text content to the user in a browser;
the device capable of extracting the webpage text content comprises:
the matching setting configuration unit is suitable for presetting at least one webpage text content matching setting on the browser side;
the downloading unit is suitable for downloading the webpage content on the browser side;
the matching unit is suitable for matching the webpage content with the webpage text content matching setting respectively until the webpage content is successfully matched;
and the extraction unit is suitable for extracting the webpage text content in the webpage content by utilizing the webpage text content matching setting successfully matched with the webpage content.
The matching setting configuration unit is suitable for establishing a matching setting file and storing at least one webpage text content matching setting in the matching setting file; the matching setting file comprises at least one website node, each website node comprises at least one webpage node, at least part of the webpage nodes are provided with more than two matching setting description nodes, each matching setting description node corresponds to a webpage text content matching setting, and the matching settings of at least two webpage text contents respectively comprise different matching setting items for the same type of text contents.
The matching unit is suitable for searching website nodes and webpage nodes corresponding to the webpage content in the matching setting file; under the searched webpage node, matching the webpage content with the matching setting items in the first matching setting description node in the webpage node in sequence; setting the matching result as the webpage text content extracted by using the matching setting item for the matching setting item successfully matched; and for the matching setting item with the matching failure, searching the matching setting item corresponding to the matching setting item with the matching failure in the matching setting description nodes except the first matching setting description node in the webpage nodes, matching the searched matching setting item with the webpage content until the searched matching setting item is successfully matched with the webpage content, and setting the matching result as the webpage text content extracted according to the matching setting item.
The extracting unit is suitable for taking all webpage text contents extracted according to the matching setting items successfully matched as the webpage text contents in the identified webpage contents.
The matching setting configuration unit is suitable for establishing a website node for each type of website; under a website node, establishing a webpage node for each type of webpage under a website corresponding to the website node; establishing a matching setting item in a matching setting description node of each webpage node according to the content of the webpage, wherein in a first matching setting description node of the webpage node, at least one matching setting item is established for each type of text content in the webpage corresponding to the webpage node; and for the same type of text content in the webpage, the matching setting items established in the first matching setting description nodes are different from the matching setting items established in the matching setting description nodes except the first matching setting description nodes in the webpage.
The matching configuration unit is further adapted to set a download mode attribute and an element filtering attribute in the web page node, where the filtering mode indicated by the element filtering attribute includes: one or more of filtering pictures, filtering Cascading Style Sheets (CSS), filtering Javascript scripting language, filtering frames, filtering objects and filtering embedded contents, the device also comprises a loading control unit and a filtering unit,
the loading control unit is suitable for judging whether the attribute value of the download mode attribute in the searched webpage node is a preset value or not before the webpage content is sequentially matched with the matching setting item in the first matching setting description node in the webpage node under the searched webpage node, if so, starting the filtering unit, and then sequentially matching the filtered webpage content with the matching setting item in the first matching setting description node in the webpage node under the searched webpage node; if not, directly downloading the webpage content into the browser;
and the filtering unit is suitable for filtering the content in the webpage according to the filtering mode indicated by the element filtering attribute.
Wherein the matching of the text content of the web page configured by the matching configuration unit comprises establishing a web page URL matching setting item for a uniform resource locator URL of the web page content,
the webpage URL matching setting item comprises: a matching property setting item, the matching property setting item comprising:
the webpage URL takes the preset content as the beginning; and/or, the webpage URL comprises predetermined content, and the predetermined position of the predetermined content comprises any character; and/or, the web page URL does not contain predetermined content that contains arbitrary characters.
Wherein, the matching setting unit establishes the URL matching setting items of the web pages and also establishes the attribute setting items of the web page identifications, the extraction attribute setting items of the web page identifications and the conversion attribute setting items,
the web page identification attribute setting item includes: using characters at preset positions in the URL of the webpage as webpage identifiers of the webpage content;
the webpage identification extraction attribute setting item comprises the following steps: selecting characters of a preset position from the webpage identifications obtained by matching the attribute setting items of the webpage identifications as the webpage identifications;
the conversion attribute setting item includes: and converting the acquired webpage identification of the webpage content and the composition format of the URL to obtain the URL of the webpage.
The webpage URL matching setting item established by the matching setting configuration unit further comprises a webpage title extraction attribute setting item, and the webpage title extraction attribute setting item comprises: extracting the content before the preset characters in the webpage content as a title.
The matching setting configuration unit is also suitable for establishing at least one matching setting item for the hypertext markup language (HTML) element of each type of text content in the webpage content in the first matching setting description node;
the matching settings established for the HTML elements include a primary positioning matching setting that at least includes:
base point lookup settings: indicating a base point searching mode, wherein the mode comprises a searching identification, a searching name, a searching class name, a searching content and a searching expression; and/or the presence of a gas in the gas,
identification positioning setting item: locating an element that matches the identification of the HTML element; and/or the presence of a gas in the gas,
name location setting item: locating an element matching the name of the HTML element; and/or the presence of a gas in the gas,
class name location setting item: locating an element that matches the class name of the HTML element; and/or the presence of a gas in the gas,
content positioning setting item: locating an element that matches the content of the HTML element; and/or the presence of a gas in the gas,
the expression locates the setting item: locating elements matched with the expressions in the HTML elements;
and/or the presence of a gas in the gas,
label setting item: indicating the type and/or attribute of the element located when the element is located using the identification location setting item, the name location setting item, the class name location setting item, the content location setting item, or the expression location setting item.
Wherein, the matching setting item established by the matching setting configuration unit for the HTML element further comprises: the secondary location matches the setting item, and this secondary location matches the setting item and includes at least:
parent query setting item: setting a mode of searching a parent element of an element positioned according to the primary positioning matching setting item; or,
sub-query settings: setting a mode of searching the sub-elements of the element according to the element positioned by the primary positioning matching setting item; or,
when the father query setting item and the son query setting item exist at the same time, the father element of the element located by the location matching setting item is searched for once according to the father query setting item, and then the son element of the father element is searched for from the searched father element according to the son query setting item.
Wherein, the matching setting item established by the matching setting configuration unit for the HTML element further comprises: an element deletion match setting item, the element deletion match setting item including at least:
deleting predetermined contents in the elements located by the primary or secondary location matching setting items, and/or
Changing the predetermined content in the element located by the primary or secondary location matching setting item.
The device also comprises a matching setting updating unit which is suitable for updating the website node, the webpage node, the matching setting description node and/or the matching setting item in the matching setting description node according to the received updating instruction after the matching setting file is established.
The device also comprises a multithreading control unit. The multithreading control unit is suitable for distributing a thread for each webpage content when a plurality of downloaded webpage contents exist on the browser side, and controlling the matching unit to match the corresponding webpage contents with the webpage text contents respectively in the distributed threads until the webpage contents are successfully matched; and/or the multithreading control unit is suitable for distributing a plurality of threads for the webpage content at the browser side, and controls the matching unit to match the webpage content with different webpage text content matching settings in different threads respectively until the webpage content is successfully matched.
The device further comprises an input unit and an uploading unit. The input unit is suitable for receiving a selection instruction which is sent by a user and used for selecting the webpage text content matching setting; the matching setting configuration unit is also suitable for establishing a matching setting file according to the selection instruction and storing the matching setting of the webpage text content in the selection instruction in the established matching setting file; and the uploading unit is suitable for uploading the matching setting file to the server and storing the matching setting file in the user data of the server-side user.
The device further comprises a starting control unit which is suitable for starting the matching unit to execute the operation of matching the webpage content with the webpage text content respectively when a file completion event indicating that the browser is completely loaded is monitored.
The matching unit is also suitable for analyzing the downloaded webpage content in a layering way to obtain a DOM structure of the webpage content; and matching the webpage content with the webpage text content according to the DOM structure of the webpage content.
As described above, according to the embodiment of the present invention, by establishing a plurality of web page text content matching settings on the browser side and matching the same web page text content with the plurality of web page text content matching settings, when the web page content changes, a web page text content matching setting matching the changed web page can be found from the plurality of web page text content matching settings, so that the web page text content can be extracted by using the web page text content matching setting matching successfully. In addition, the scheme avoids the operation that a new matching rule file needs to be generated and set in the browser when the webpage content changes, simplifies the operation of realizing matching, reduces the workload and improves the efficiency.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a diagram illustrating an apparatus for extracting text content of a web page according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a method for extracting text content of a web page according to another embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
One embodiment of the invention provides a device capable of extracting webpage text content, which can provide more convenient and concentrated reading service for a user on the premise of ensuring the text extraction speed and stability. Referring to fig. 1, the apparatus includes a matching setting configuration unit 100, a download unit 101, a matching unit 102, an extraction unit 103, a load control unit 104, a filtering unit 105, a matching setting update unit 106, a multi-thread control unit 107, an input unit 108, and an upload unit 109. The respective units will be described below.
The matching setting configuration unit 100 is adapted to preset at least one webpage text content matching setting on the browser side. Specifically, the matching setting configuration unit 100 is adapted to establish a matching setting file and store at least one webpage text content matching setting in the matching setting file; the matching setting file comprises at least one website node, each website node comprises at least one webpage node, at least part of the webpage nodes are provided with more than two matching setting description nodes, and each matching setting description node corresponds to a webpage text content matching setting. The matching setting description node can comprise one or more matching setting items, and at least two webpage text content matching settings respectively comprise different matching setting items for the same type of text content.
The matching setting configuration unit 100 establishes a website node for each type of website, that is, one website node corresponds to one type of website; under a website node, a webpage node is established for each type of webpage under a website corresponding to the website node, namely, one webpage node corresponds to one type of webpage. And establishing a matching setting item in the matching setting description node of each webpage node according to the content of the webpage. And if the contents contained in different web pages are different, the matching setting items in the corresponding matching setting description nodes are also different.
Since there may be some fixed information that does not change frequently and some variable information that is easy to change in a common web page, the matching setting configuration unit 100 determines a matching setting description node among the matching setting description nodes in the web page node as a first matching setting description node, where a matching setting item included in the first matching setting description node is most comprehensive and includes at least one matching setting item established for each type of text content in the web page. However, in the matching setting description nodes other than the first matching setting description node, the matching setting items may be established only for the variable information in the web page, and the matching setting items established in the matching setting description nodes other than the first matching setting description node in the web page node may be different from each other.
The processing mode simplifies the structure of the matching setting of the webpage text content, avoids repeated parts in different matching settings, reduces the data volume of the matching setting required to be stored, and improves the resource utilization rate; on the other hand, repeated matching operation on the same webpage content is avoided, and matching efficiency is improved.
The matching setting file is specifically described below with reference to an example of a piece of code.
The following is specifically described with reference to the node pair matching setting file in the above code as follows:
< websites > general website node: this node is the largest parent node, which corresponds to a matching profile, and is made up of several website (website) nodes.
< website > node: each website node represents a supported website, and one or more webpage nodes are arranged in one website node, such as a book webpage node, a catalog webpage node, and a chapter webpage node arranged under the website node www.feiku.com. A download mode (downloadmode) attribute and an element filter (elementary filter) attribute are also set in the web page node.
< book > webpage node: the information of the main page of the novel is described, and two matching setting description nodes < profile > are arranged below the webpage node. Configuring a plurality of matching setting items in < profile > serving as a first matching setting description node, wherein the matching setting items such as a Uniform Resource Locator (URL) describe the matching of related URLs and acquire the information of a bookmark (webpage identifier); title matching setting items describing how to acquire the information of the title of the novel home page; catalog URL matching settings describe the catalog URL of the novel; a lastercapter (latest section) matching setting item describes the description of the latest section; lastercapterurl (latest chapter URL) matches the URL that the setting item describes the latest chapter.
< catalog > webpage node: describing the novel directory page information, only setting a matching setting description node under the webpage node, and including: the URL matching setting item describes relevant URL matching and obtains the bookmark information; the chapterlist is matched with the setting item and describes the related content of the directory page; return book describes the URL address of the first page of the novel.
< chapter > webpage node: the novel chapter page information is described, and two < profiles > are set under the webpage node. In < profile > as a first matching setting description node, there are configured: the URL matching setting item is used for describing relevant URL matching and obtaining the bookmark information; the title matching setting item is used for describing how to obtain the title information of the first page of the novel; text matching setting items are used for describing the text content of the novel; next matching setting items, describing the URL of the next chapter novel page; prev match settings, describing the last section of the URL of the novel; return catalog matching setting items, describing novel catalog page URLs stored in chapter pages; return book matches the settings item, describing the novel top page held by the novel chapter page.
< profile > matching setup description node: when a plurality of webpage text content matching settings are set under one webpage node, matching setting description nodes < profile > are configured, and each < profile > corresponds to one webpage text content matching setting. < profile > is located under a specific web page node, for example, under the above book web page node and chapter web page node, and the matching setting item is set in < profile >.
When receiving a web page access instruction of a user, the downloading unit 101 downloads web page content on the browser side, establishes a connection between the downloading unit 101 and a server, and downloads the web page content corresponding to the web page access instruction from the server.
The matching unit 102 matches the downloaded web page content with the web page text content matching setting respectively until the web page content matching is successful. Still according to the description of the scene in the code, the matching unit 102 searches the website node and the webpage node corresponding to the webpage content in the matching setting file, finds out that the website node corresponding to the webpage content is the website node www.feiku.com and the corresponding webpage node is the book webpage node according to the downloaded webpage content; and sequentially matching the webpage content with the matching setting items in the first matching setting description node in the webpage node under the searched webpage node, and when the first matching setting description node in the book webpage node is configured as the first < profile > under the book webpage node, firstly matching the webpage content with the matching setting items in the first < profile >. For the matching setting item which is successfully matched, setting the matching result as the webpage text content extracted by using the matching setting item, wherein the returned result can be the extracted text content directly or the information indicating that the result is TRUE (TRUE) is returned; for the matching setting item with the matching failure, the returned matching result may be an empty character string indicating that the matching cannot be processed or information indicating that the returned matching result is FALSE (FALSE), then the matching setting item corresponding to the matching setting item with the matching failure is searched in the matching setting description node (for example, in the second < profile > under the book web page node) except the first matching setting description node in the web page node, the searched matching setting item is matched with the web page content until the searched matching setting item is successfully matched with the web page content, and the matching result is set as the web page text content extracted according to the matching setting item. That is, for the web page content that the description node matching fails by using the first matching setting, as long as there is a < profile > that can be matched with the web page content, the corresponding web page content can be extracted by using the matched < profile >.
Since the presentation form of the web page content is usually HTML (Hypertext markup language), the matching unit 102 also needs to perform matching on HTML elements in the web page, for example, the matching unit 102 performs hierarchical parsing on the downloaded web page content to obtain a Document Object Model (DOM) structure of the web page content, and matches the web page content with the matching setting of the web page content according to the DOM structure of the web page content, so as to extract the web page text content.
The extracting unit 103 is adapted to extract the web page text content from the web page content by using the web page text content matching setting successfully matched with the web page content. Specifically, the extracting unit 103 is adapted to use all the web page text contents extracted according to the matching setting item successfully matched as the web page text contents in the identified web page contents.
Further, in this embodiment, the download of the web page content may also be controlled by using a download mode (downloadmode) attribute and an element filter (elementary filter) attribute set in the web page node by the matching configuration unit 100. The apparatus further comprises a load control unit 104 and a filter unit 105.
The matching configuration unit 100 sets at least two types of attribute values for the download mode attribute, for example, when the attribute value is 0, it indicates to download all web page contents to the browser according to the existing download mode of the browser web page, and when the attribute value is 1, the filtering unit 105 is used to filter the web page contents, and only the remaining web page contents after filtering are downloaded to the browser.
The matching setting configuration unit 100 sets a plurality of attribute values for the element filtering attribute, each attribute value corresponding to a filtering mode, for example, attribute value 1 represents filtering picture (img), attribute value 2 represents filtering Cascading Style Sheet (CSS), attribute value 4 represents filtering frame (frame), attribute value 8 represents filtering Javascript language, attribute value 16 represents filtering object (object), and attribute value 32 represents filtering embedded (embedded) content.
When the combination of the multiple filtering modes needs to be adopted, a new attribute value can be generated by adopting a bitwise or calculation mode through the binary character of the attribute value, and the new attribute value can indicate the multiple filtering modes.
The loading control unit 104 is adapted to determine whether an attribute value of a download mode attribute in a found web page node is a predetermined value (e.g., 1) before sequentially matching the web page content with the matching setting item in the first matching setting description node in the web page node under the found web page node, if so, start the filtering unit 105, and then sequentially matching the filtered web page content with the matching setting item in the first matching setting description node in the web page node under the found web page node; if not, directly downloading the webpage content into the browser;
the filtering unit 105 is adapted to filter the content in the web page according to the filtering manner indicated by the element filtering attribute. For example, when the attribute value of the element filtering attribute indicates to filter pictures, the filtering unit 105 filters out pictures in the web content, and when the attribute value of the element filtering attribute indicates to filter pictures and CSSs, the filtering unit 105 filters out pictures and CSSs in the web content.
Some major matching setting items configured in the matching setting description node by the matching setting configuration unit 100 will be specifically described below.
Extraction of webpage URL
The matching setting configuration unit 100 configures the matching setting of the text content of the web page, which includes establishing a web page URL matching setting item for the URL of the web page content.
In this section, the web page URL matching setting items are explained from five aspects of Match setting, Trans setting, Bookid setting, Booksep setting, and Tabtitle setting in conjunction with the URL node in the above example.
1) Match setting: matching property settings
The webpage URL matching setting item comprises a matching attribute setting item, and the matching attribute setting item comprises the following steps:
a. the web page URL begins with predetermined content, such as ^ which indicates that URL must begin with content that follows ^ a.
b. The URL of the webpage contains preset content, the preset position of the preset content contains any character, if the preset content starts with @, the preset content indicates that the URL must contain the content after @ and characters can be added into the content after @ and represent that any character is matched.
c. The web page URL does not contain predetermined contents containing arbitrary characters. If the predetermined content is in! The beginning, indicates that the url must not contain! The latter contents! Characters can be added into the content, and the characters represent any matched characters.
When extracting the web page URL, it may be required to satisfy a, b, and c described above at the same time, or to satisfy only one or two of a, b, and c.
2) Trans settings: conversion property settings
And converting the acquired webpage identification of the webpage content and the composition format of the URL to obtain the URL of the webpage. The operation is mainly applied to a scene that only one matching setting description node exists under one webpage node, namely, under the scene that only one profile exists, the relevant operation of URL conversion is carried out by the given webpage identifications such as a novel first page, a directory page, a chapter page and the like. The setting item describes the composition format of the url, and only the webpage identifiers such as the book or the signature are needed to be filled in to obtain the url, such as: trans = http:// www.qidian.com/BookReader/# # s, # # s. aspx ^ bookid ^ chapterid
If the character string shows the format of the URL, then the URL of a chapter page can be obtained by filling the bookmark into the first # # s and the chapter into the second # # s.
3) Boot setting: web page identification attribute setting item
And taking characters at a preset position in the URL of the webpage as the webpage identification of the webpage content.
This operation is used to retrieve a web page identifier, such as a url's bookmark string, for example, for bookmark = http:// www.readnovel.com/novel/. html, where the position of the character is the above-mentioned predetermined position, then the character string at that position is taken as the extracted web page identifier, such as the bookmark string.
And converting the URL of the webpage by using the webpage identification extracted in the operation.
4) Booksep setting: web page identification extraction attribute setting item
And selecting characters of a preset position from the webpage identifications obtained by matching the webpage identification attribute setting items as the webpage identifications. The operation is mainly used for scenes needing further extraction when the acquired webpage identification is relatively complex.
If an extraction structure of a bookmark = "/:0" is set, when a "/" symbol is included in a web page identification bookmark, in order to take a pure number, a bookmark can be used, "/" denotes a partition identifier, ": "represents a separator," 0 "represents that when the target letter is separated into sections by"/", the number of sections (counted from 0) is taken as a web page identification bookmark.
And converting the URL of the webpage by using the webpage identification extracted in the operation.
5) Tabtitle settings: web page title extraction attribute setting item
The content before the predetermined character in the web page content is extracted as Title (Title) information. If the extraction structure of tabtitle = "-", it means that all portions before "-" appearing first are titles. Symbol may match any character.
Second, extraction of HTML content in web page
The matching setting configuration unit 100 establishes at least one matching setting item in a matching setting description node (e.g., in a first matching setting description node) for a Hypertext Markup Language (HTML) element in the web page content of each type of text content in the web page.
The HTML elements to be extracted are different in different types of web pages, for example, taking the scenario in the above code as an example, the HTML elements to be processed include a < title > element indicating a title, a < catalogurl > element indicating a directory url, a < lastscapter > element indicating a latest chapter, a < lastscapular > element indicating a latest chapter url, a < text > element indicating a body, a < next > element indicating a next page url, a < prev > element indicating an upper page url, a < recantcatalog > element indicating a return directory url, and a < recantbook > element indicating a return first page url, and so on.
The matching setting items established by the matching setting configuration unit 100 for the HTML element include a primary-positioning matching setting item and a secondary-positioning matching setting item. The following description will be made separately.
1) One-time positioning matching setting item
The one-time positioning matching setting item at least comprises:
a. base point lookup settings el: the way of indicating the radix point lookup may be set to a numerical value of 1, 2, 4, 8, 16, etc., where 1 corresponds to the lookup identifier id, 2 corresponds to the lookup name, 4 corresponds to the lookup class name, 8 corresponds to the lookup content value, and 16 corresponds to the expression regular.
b. Identification positioning setting item id: locating an element that matches the identification of the HTML element.
c. Name positioning setting item name: the element that matches the name of the HTML element is located.
d. Class name location setting item classmate: elements that match the class name of the HTML element are located, and when there are multiple elements that match the class name, only the first element is matched.
e. Content positioning setting item value: an element that matches the content of the HTML element (innertext) is located, and when there are multiple matching elements, only the first element is matched.
f. The expression positioning setting item regular ar: an element that matches an expression in the HTML element, such as for expression% CURRENTURL%, is located at the url where the expression matches.
g. Tag setting item tag: indicating the type and/or attribute of the element located when the element is located using the identification location setting item, the name location setting item, the class name location setting item, the content location setting item, or the expression location setting item.
I.e., tag indicates the element type and attribute of one position. If the structure of tag = "a-href" is set, it means that the attribute of the located element is taken to be href, and the type of the located element is a. And the effective time of the tag setting item is that no secondary positioning occurs, if the secondary positioning occurs, the tag is only responsible for verification.
2) Secondary positioning match setting item
On the basis of executing the primary positioning, the secondary positioning can be carried out on the result obtained by the primary positioning. This secondary positioning matches the setting item and includes:
a. parent query setting item parentselect: setting a mode of searching a parent element of an element positioned according to the primary positioning matching setting item;
b. the sub-query setting term childrenselect: setting a mode of searching the sub-elements of the element according to the element positioned by the primary positioning matching setting item;
c. when the father query setting item and the son query setting item exist at the same time, the father element of the element located by the location matching setting item is searched for once according to the father query setting item, and then the son element of the father element is searched for from the searched father element according to the son query setting item.
The embodiment also sets the specific way of positioning in setting items such as parentselect, child senses, tag and the like according to the element name, the element attribute, the sequence and the like, for example, when the way is expressed as "ul:0| li:1| a-href:0", it indicates that the following positioning operation is performed from the currently positioned element:
1. find the 1 (0 stands for the first) < ul > tag next (previous, current) to the current element, where the 1 < ul > tag next to the current element is found under parentselect, the 1 < ul > tag next to the current element is found under childrenselect, and the 1 < ul > tag current to the current element is found under tag.
2. Then the 2 nd (1 for the first) < li > tag at the next (previous, current) level of find ul element.
3. Then find the 1 st (0 represents the first) < a > tag at the next (previous, current) level of li elements.
4. After the element a is found, if the href can be set, the href attribute content of the element a is taken; if there is no such setting, the element content (innertext) of the a element is taken directly.
3) Filter arrangement
The matching setting items established by the matching setting configuration unit 100 for the HTML element further include an element deletion matching setting item elementerase to erase some sub-elements within the located element. The element deletion matching setting item includes at least:
and/or changing the predetermined content in the elements located by the primary or secondary positioning matching setting items.
For example, when the structure of elementerase = "FONT:0| FONT:0" is set, then "erase" selects the FONT in the content or the content between the FONT tags. The "erasure" manner depends on the symbol ": "the latter value corresponds to a meaning, for example, a value of 0 corresponds to a change element name of divstyle =" display: none "; the value 1 corresponds to changing the element name to unidentifiable and the value 2 corresponds to deleting the element.
Further, the apparatus further includes a matching setting updating unit 106 adapted to update the website node, the webpage node, the matching setting description node and/or the matching setting item in the matching setting description node according to the received updating instruction after establishing a matching setting file. For example, when a certain website does not exist in the internet or text extraction of a web page in the website is not needed, the matching setting updating unit 106 is used to delete a website node corresponding to the website and related settings under the website node from the matching setting file.
Further, the apparatus further includes a multithread control unit 107. The multithread control unit 107 is adapted to allocate a thread to each web content when there are a plurality of downloaded web contents on the browser side, and control the matching unit to match the corresponding web contents with the web text contents respectively in the allocated threads until the web contents are successfully matched; and/or the multithread control unit 107 is adapted to allocate a plurality of threads to a web content on the browser side, and control the matching unit to match the web content with different web text content matching settings in different threads respectively until the web content matching is successful. The method and the device adopt a multithreading processing technology, can more quickly realize text extraction of one or more webpage contents, shorten the webpage loading time of the browser, and quickly present the extracted webpage text contents to a user in the browser.
The device further comprises an input unit 108 and an uploading unit 109. The input unit 108 is adapted to receive a selection instruction sent by a user to select a webpage text content matching setting; the matching setting configuration unit 100 is further adapted to establish a matching setting file according to the selection instruction and store the web page text content matching setting in the selection instruction in the established matching setting file, and the matching setting configuration unit 100 is further adapted to update the matching setting file according to an update instruction from the user; and the uploading unit 109 is adapted to upload the matching setting file to the server and store the matching setting file in the user data of the server-side user, so that when the matching setting file on the browser side is damaged or lost, the browser side can recover or update by using the matching setting file stored on the server side.
Further, the apparatus further includes a start control unit adapted to, when it is known that the extraction operation of the web content can be currently performed when a file completion (document complete) event indicating that the browser has been loaded is monitored, start the matching unit to perform an operation of matching the web content with the web text content respectively.
It will be appreciated that one or more of the above-described matching setting update unit 106, multi-thread control unit 107, input unit 108 and upload unit 109 may be omitted in some scenarios.
As described above, according to the embodiment of the present invention, by establishing a plurality of web page text content matching settings on the browser side and matching the same web page text content with the plurality of web page text content matching settings, when the web page content changes, a web page text content matching setting matching the changed web page can be found from the plurality of web page text content matching settings, so that the web page text content can be extracted by using the web page text content matching setting matching successfully. In addition, the scheme avoids the operation that a new matching rule file needs to be generated and set in the browser when the webpage content changes, simplifies the operation of realizing matching, reduces the workload and improves the efficiency.
Another embodiment of the present invention further provides a client device, which is installed with a browser, wherein the browser is provided with the device for extracting the text content of the web page as described above,
the client device starts the device capable of extracting the webpage text content according to the webpage browsing instruction of the user, and displays the webpage text content extracted by the device capable of extracting the webpage text content to the user in the browser.
The specific working mode of the device capable of extracting the webpage text content in the client device may refer to the related device embodiment of the present invention, and is not described herein again.
Another embodiment of the present invention further provides a method for extracting text content of a web page, which can provide a more convenient and focused reading service to a user on the premise of ensuring the speed and stability of text extraction, and the method includes:
s200: at least one webpage text content matching setting is preset on the browser side.
The method comprises the steps of establishing a matching setting file and storing at least one webpage text content matching setting in the matching setting file, wherein the matching setting file comprises at least one website node, each website node comprises at least one webpage node, at least part of the webpage nodes are provided with more than two matching setting description nodes, each matching setting description node corresponds to one webpage text content matching setting, and the matching settings of at least two webpage text contents respectively comprise different matching setting items for the same type of text contents.
In the embodiment, a website node is established for each type of website; under a website node, establishing a webpage node for each type of webpage under a website corresponding to the website node; establishing a matching setting item in a matching setting description node of each webpage node according to the content of the webpage, wherein in a first matching setting description node of the webpage node, at least one matching setting item is established for each type of text content in the webpage corresponding to the webpage node; and for the same type of text content in the webpage, the matching setting items established in the first matching setting description nodes are different from the matching setting items established in the matching setting description nodes except the first matching setting description nodes in the webpage node. Therefore, for a certain webpage content, when the matching setting item in the first matching setting description node cannot be matched with the matching setting item, the webpage content can be matched with the matching setting item in other matching setting description nodes until the matching is successful.
The method comprises the steps that a plurality of matching setting description nodes are included under a webpage node, and as some fixed information which does not change frequently and some variable information which is easy to change exist in a common webpage, one matching setting description node is determined to be used as a first matching setting description node in the matching setting description nodes under the webpage node, and the first matching setting description node comprises the most comprehensive matching setting items and at least one matching setting item established for each type of text content in the webpage. However, in the matching setting description nodes other than the first matching setting description node, the matching setting items may be established only for the variable information in the web page, and the matching setting items established in the matching setting description nodes other than the first matching setting description node in the web page node may be different from each other.
The processing mode simplifies the structure of the matching setting of the webpage text content, avoids repeated parts in different matching settings, reduces the data volume of the matching setting required to be stored, and improves the resource utilization rate; on the other hand, repeated matching operation on the same webpage content is avoided, and matching efficiency is improved.
Further, the web page node includes a download mode attribute and an element filter attribute, and the filtering manner indicated by the element filter attribute includes: one or more of filtering pictures, filtering Cascading Style Sheets (CSS), filtering Javascript scripting language, filtering frames, filtering objects, and filtering embedded content,
before the step of sequentially matching the web page content with the matching setting items in the first matching setting description node in the web page node under the searched web page node, the method further includes:
judging whether the attribute value of the download mode attribute in the searched webpage node is a preset value or not, if so, filtering the content in the webpage according to the filtering mode indicated by the element filtering attribute, and then sequentially matching the filtered webpage content with the matching setting items in the first matching setting description node in the webpage node under the searched webpage node; if not, directly downloading the webpage content in the browser.
Wherein, the above-mentioned webpage text content matches and sets up the webpage URL and matches the setting item including matching the attribute setting item for the URL of the webpage content, should match the attribute setting item and include in the webpage URL matches the setting item:
the webpage URL takes the preset content as the beginning; and/or, the webpage URL comprises predetermined content, and the predetermined position of the predetermined content comprises any character; and/or, the web page URL does not contain predetermined content that contains arbitrary characters.
Wherein, the web page URL matching setting item also comprises a web page identification attribute setting item, a web page identification extraction attribute setting item and a conversion attribute setting item,
the webpage identification attribute setting item comprises a webpage identification which takes characters at a preset position in the URL of the webpage as the webpage content; the webpage identification extracting attribute setting item comprises the step of selecting characters of a preset position from the webpage identifications obtained according to the matching of the webpage identification attribute setting item as the webpage identifications; the conversion attribute setting item comprises the URL of the webpage obtained by conversion according to the acquired webpage identification of the webpage content and the composition format of the URL.
Wherein, the above-mentioned webpage URL matches the setting item and still includes: the web title extracts the property setting item. The web page title extraction attribute setting item includes: extracting the content before the preset characters in the webpage content as a title.
Wherein, in the first matching setting description node of the web page node, establishing at least one matching setting item for each type of text content in the web page corresponding to the web page node includes:
establishing at least one matching setting item for a hypertext markup language (HTML) element of each type of text content in the webpage content in a first matching setting description node;
the above-mentioned match setting item established for the HTML element includes a primary positioning match setting item, and the primary positioning match setting item at least includes:
the base point searching setting item indicates a base point searching mode, and the mode comprises searching identification, searching name, searching class name, searching content and searching expression; and/or, identifying a positioning setting item to position an element that matches the identification of the HTML element; and/or, name location settings to locate elements that match the name of the HTML elements; and/or, a class name location setting to locate an element that matches the class name of the HTML element; and/or, content location settings to locate elements that match the content of the HTML elements; and/or, the expression locating setting item is used for locating the element matched with the expression in the HTML element; and/or, the tag setting item to indicate a type and/or attribute of the located element when the element is located using the identification location setting item, the name location setting item, the class name location setting item, the content location setting item, or the expression location setting item.
Wherein, the above-mentioned matching setup item established for the HTML element further includes: the secondary location matches the setting item, and this secondary location matches the setting item and includes at least:
the parent query setting item is used for setting a mode of searching the parent element of the element according to the element positioned by the one-time positioning matching setting item; or, the child query setting item searches for the child element of the element according to the element located by the one-time location matching setting item, or, when the parent query setting item and the child query setting item exist simultaneously, searches for the parent element of the element located by the one-time location matching setting item according to the parent query setting item, and then searches for the child element of the parent element from the searched parent element according to the child query setting item.
Wherein, the above-mentioned matching setup item established for the HTML element further includes: an element deletion match setting item, the element deletion match setting item including at least: and/or changing the predetermined content in the element located by the primary or secondary positioning matching setting item.
S202: and downloading the webpage content on the browser side.
S204: and searching the website node and the webpage node corresponding to the webpage content in the matching setting file.
S206: and under the searched webpage node, sequentially matching the webpage content with the matching setting items in the first matching setting description node in the webpage node, and respectively executing the step S208 or the step S210 according to the matching result.
S208: setting the matching result as the webpage text content extracted by using the matching setting item for the matching setting item successfully matched;
s210: and for the matching setting item with the matching failure, searching the matching setting item corresponding to the matching setting item with the matching failure in the matching setting description nodes except the first matching setting description node in the webpage nodes, matching the searched matching setting item with the webpage content until the searched matching setting item is successfully matched with the webpage content, and setting the matching result as the webpage text content extracted according to the matching setting item.
S212: and extracting the webpage text content in the webpage content by utilizing the webpage text content matching setting successfully matched with the webpage content.
And taking all webpage text contents extracted according to the matching setting items successfully matched as webpage text contents in the identified webpage contents.
After step S200, the method further includes: and updating the website node, the webpage node, the matching setting description node and/or the matching setting item in the matching setting description node according to the received updating instruction.
In step S206, the matching the web page content with the web page text content matching setting respectively until the web page content matching is successful includes:
when a plurality of downloaded webpage contents exist on the browser side, distributing a thread for each webpage content, and respectively matching the corresponding webpage contents with the webpage text content matching setting in the distributed threads until the webpage contents are successfully matched; and/or distributing a plurality of threads for the webpage content at the browser side, and respectively matching the webpage content with different webpage text content matching settings in different threads until the webpage content is successfully matched.
In step S206, since the web content generally has the description form of HTML, the downloaded web content can be analyzed in a layered manner to obtain a DOM structure of the web content; and matching the webpage content with the webpage text content according to the DOM structure of the webpage content.
Wherein, step S200 further includes: receiving a selection instruction which is sent by a user and matches and sets the text content of a selected webpage; establishing a matching setting file according to the selection instruction, and storing the matching setting of the webpage text content in the selection instruction in the established matching setting file; and uploading the matching setting file to a server and storing the matching setting file in user data of a server-side user.
Before step S204, the method further includes: and when a file completion event indicating that the browser is completely loaded is monitored, starting the operation of respectively matching the webpage content with the webpage text content for matching.
The specific implementation manner of each step in this embodiment may refer to the relevant content in the embodiment of the apparatus of the present invention.
As described above, according to the embodiment of the present invention, by establishing a plurality of web page text content matching settings on the browser side and matching the same web page text content with the plurality of web page text content matching settings, when the web page content changes, a web page text content matching setting matching the changed web page can be found from the plurality of web page text content matching settings, so that the web page text content can be extracted by using the web page text content matching setting matching successfully. In addition, the scheme avoids the operation that a new matching rule file needs to be generated and set in the browser when the webpage content changes, simplifies the operation of realizing matching, reduces the workload and improves the efficiency.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in a client device according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (16)

1. A client device is provided with a browser, the browser is provided with a device capable of extracting webpage text content,
the client device starts the device capable of extracting the webpage text content according to a webpage browsing instruction of a user, and displays the webpage text content extracted by the device capable of extracting the webpage text content to the user in a browser;
the device capable of extracting the webpage text content comprises:
the matching setting configuration unit is suitable for presetting at least one webpage text content matching setting on the browser side; each webpage text content matching setting comprises one or more matching setting items established according to the text content of the webpage; specifically, the matching setting configuration unit is adapted to establish a matching setting file and store the at least one webpage text content matching setting in the matching setting file; the matching setting file comprises at least one website node, each website node comprises at least one webpage node, at least part of the webpage nodes are provided with more than two matching setting description nodes, each matching setting description node corresponds to a webpage text content matching setting, and at least two webpage text content matching settings respectively comprise different matching setting items for the same type of text content;
the downloading unit is suitable for downloading the webpage content on the browser side;
the matching unit is suitable for matching the webpage content with the webpage text content matching setting respectively until the webpage content is successfully matched;
and the extraction unit is suitable for extracting the webpage text content in the webpage content by utilizing the webpage text content matching setting successfully matched with the webpage content.
2. The client device of claim 1,
the matching unit is suitable for searching website nodes and webpage nodes corresponding to the webpage content in the matching setting file; under the searched webpage node, matching the webpage content with the matching setting items in the first matching setting description node in the webpage node in sequence; setting the matching result as the webpage text content extracted by using the matching setting item for the matching setting item successfully matched; and for the matching setting item with the matching failure, searching the matching setting item corresponding to the matching setting item with the matching failure in the matching setting description nodes except the first matching setting description node in the webpage node, matching the searched matching setting item with the webpage content until the searched matching setting item is successfully matched with the webpage content, and setting the matching result as the webpage text content extracted according to the matching setting item.
3. The client device according to claim 2, wherein the extracting unit is adapted to use all web page text contents extracted according to the matching setting item with successful matching as the identified web page text contents in the web page contents.
4. The client device according to claim 1, wherein the matching setting configuration unit is adapted to establish a website node for each type of website; under a website node, establishing a webpage node for each type of webpage under a website corresponding to the website node; establishing a matching setting item in a matching setting description node of each webpage node according to the content of the webpage, wherein in a first matching setting description node of the webpage node, at least one matching setting item is established for each type of text content in the webpage corresponding to the webpage node; and for the same type of text content in the webpage, the matching setting items established in the first matching setting description nodes are different from the matching setting items established in the matching setting description nodes except the first matching setting description nodes in the webpage.
5. The client device according to claim 2, wherein the matching setting configuration unit is further adapted to set a download mode attribute and an element filter attribute in the web page node, and the element filter attribute indicates a filtering manner including: one or more of filtering pictures, filtering Cascading Style Sheets (CSSs), filtering Javascript scripting language, filtering frames, filtering objects and filtering embedded content, the apparatus further comprising a load control unit and a filtering unit,
the loading control unit is suitable for judging whether the attribute value of the download mode attribute in the searched webpage node is a preset value or not before the webpage content is sequentially matched with the matching setting item in the first matching setting description node in the webpage node under the searched webpage node, if so, starting the filtering unit, and then sequentially matching the filtered webpage content with the matching setting item in the first matching setting description node in the webpage node under the searched webpage node; if not, directly downloading the webpage content into a browser;
and the filtering unit is suitable for filtering the content in the webpage according to the filtering mode indicated by the element filtering attribute.
6. The client device of claim 1, wherein the matching setting configured by the matching setting configuration unit comprises establishing a web page URL matching setting item for a Uniform Resource Locator (URL) of web page content,
the webpage URL matching setting item comprises: a matching attribute setting item, the matching attribute setting item comprising:
the webpage URL takes the preset content as the beginning; and/or the presence of a gas in the gas,
the webpage URL comprises predetermined content, and the predetermined position of the predetermined content comprises any character; and/or the presence of a gas in the gas,
the web page URL does not contain predetermined contents containing arbitrary characters.
7. The client device according to claim 6, wherein the web page URL matching setting items established by the matching setting configuration unit further include a web page identification attribute setting item, a web page identification extraction attribute setting item and a conversion attribute setting item,
the web page identification attribute setting item includes: using characters at preset positions in the URL of the webpage as webpage identifiers of the webpage content;
the webpage identification extraction attribute setting item comprises: selecting characters of a preset position from the webpage identifications obtained by matching the attribute setting items of the webpage identifications as the webpage identifications;
the conversion attribute setting item includes: and converting the acquired webpage identification of the webpage content and the composition format of the URL to obtain the URL of the webpage.
8. The client device according to claim 6, wherein the URL matching setting items of the web page established by the matching setting configuration unit further include a web page title extraction attribute setting item,
the web page title extraction attribute setting item includes: extracting the content before the preset characters in the webpage content as a title.
9. The client device according to claim 4, wherein the matching setting configuration unit is further adapted to establish at least one matching setting item in the first matching setting description node for a hypertext markup language (HTML) element in the web page content for each type of text content in the web page;
the matching setting items established for the HTML elements comprise a primary positioning matching setting item, which at least comprises:
base point lookup settings: indicating a mode of searching the base point, wherein the mode comprises a searching identifier, a searching name, a searching class name, a searching content and a searching expression; and/or the presence of a gas in the gas,
identification positioning setting item: locating an element that matches the identification of the HTML element; and/or the presence of a gas in the gas,
name location setting item: locating an element matching the name of the HTML element; and/or the presence of a gas in the gas,
class name location setting item: locating an element that matches the class name of the HTML element; and/or the presence of a gas in the gas,
content positioning setting item: locating an element that matches the content of the HTML element; and/or the presence of a gas in the gas,
the expression locates the setting item: locating elements matched with the expressions in the HTML elements;
and/or the presence of a gas in the gas,
label setting item: indicating the type and/or attribute of the positioned element when the element is positioned by the identification positioning setting item, the name positioning setting item, the class name positioning setting item, the content positioning setting item or the expression positioning setting item.
10. The client device according to claim 9, wherein the matching setting setup established by the matching setting configuration unit for the HTML element further comprises: a secondary positioning matching setting item, the secondary positioning matching setting item at least comprising one of the following setting items:
parent query setting item: setting a mode of searching a parent element of an element positioned according to the primary positioning matching setting item; or,
sub-query settings: setting a mode of searching the sub-elements of the element according to the element positioned by the primary positioning matching setting item; or,
when the father query setting item and the son query setting item exist at the same time, the father element of the element located by the location matching setting item is searched for once according to the father query setting item, and then the son element of the father element is searched for from the searched father element according to the son query setting item.
11. The client device according to claim 9, wherein the matching setting setup established by the matching setting configuration unit for the HTML element further comprises: an element deletion match setting, the element deletion match setting comprising at least:
deleting the preset content in the element positioned by the primary positioning matching setting item or the secondary positioning matching setting item; and/or
Changing the predetermined content in the element located by the primary or secondary location matching setting item.
12. The client device according to claim 1, wherein the apparatus further comprises a matching setting updating unit adapted to update the matching setting items in the website node, the webpage node, the matching setting description node and/or the matching setting description node in the matching setting file according to the received updating instruction after the establishing of a matching setting file.
13. The client device of claim 1, wherein the means for extractable web page text content further comprises a multithread control unit,
the multithreading control unit is suitable for distributing a thread for each webpage content when a plurality of downloaded webpage contents exist on the browser side, and controlling the matching unit to match the corresponding webpage contents with the webpage text content matching setting in the distributed threads respectively until the webpage contents are successfully matched; and/or
The multithreading control unit is suitable for distributing a plurality of threads for webpage content on the browser side, and controls the matching unit to match the webpage content with different webpage text content matching settings in different threads respectively until the webpage content is successfully matched.
14. The client device of claim 1, wherein the apparatus comprises an input unit and an upload unit,
the input unit is suitable for receiving a selection instruction which is sent by a user and used for selecting the webpage text content matching setting;
the matching setting configuration unit is also suitable for establishing a matching setting file according to the selection instruction and storing the matching setting of the webpage text content in the selection instruction in the established matching setting file;
and the uploading unit is suitable for uploading the matching setting file to a server and storing the matching setting file in the user data of the user at the server side.
15. The client device according to claim 1, wherein the device capable of extracting webpage text content further comprises a start control unit adapted to start the matching unit to perform an operation of matching the webpage content with the webpage text content matching setting respectively when a file completion event indicating that the browser has been loaded is monitored.
16. The client device of claim 1,
the matching unit is also suitable for analyzing the downloaded webpage content in a layered mode to obtain a Document Object Model (DOM) structure of the webpage content; and matching the webpage content with the webpage text content according to the DOM structure of the webpage content.
CN201210573088.7A 2012-12-25 2012-12-25 A kind of client device Expired - Fee Related CN103064943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210573088.7A CN103064943B (en) 2012-12-25 2012-12-25 A kind of client device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210573088.7A CN103064943B (en) 2012-12-25 2012-12-25 A kind of client device

Publications (2)

Publication Number Publication Date
CN103064943A CN103064943A (en) 2013-04-24
CN103064943B true CN103064943B (en) 2016-11-23

Family

ID=48107573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210573088.7A Expired - Fee Related CN103064943B (en) 2012-12-25 2012-12-25 A kind of client device

Country Status (1)

Country Link
CN (1) CN103064943B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302742B (en) * 2014-07-04 2018-07-20 深圳市雅都软件股份有限公司 Avoid the system and method for repeating to load dynamic buffering graph data
CN106326316B (en) * 2015-07-08 2022-11-29 腾讯科技(深圳)有限公司 Webpage advertisement filtering method and device
CN106547806B (en) * 2015-09-23 2020-12-18 阿里巴巴集团控股有限公司 Page loading method and device
CN108628860B (en) * 2017-03-15 2019-06-11 北京数聚鑫云信息技术有限公司 A kind of method and device of automatic acquisition web data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN102708174A (en) * 2012-05-04 2012-10-03 奇智软件(北京)有限公司 Method and device for displaying rich media information in browser
CN102789484A (en) * 2012-06-28 2012-11-21 奇智软件(北京)有限公司 Method and device for webpage information processing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032735A1 (en) * 2000-08-25 2002-03-14 Daniel Burnstein Apparatus, means and methods for automatic community formation for phones and computer networks
CN100512181C (en) * 2006-06-23 2009-07-08 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN102708174A (en) * 2012-05-04 2012-10-03 奇智软件(北京)有限公司 Method and device for displaying rich media information in browser
CN102789484A (en) * 2012-06-28 2012-11-21 奇智软件(北京)有限公司 Method and device for webpage information processing

Also Published As

Publication number Publication date
CN103064943A (en) 2013-04-24

Similar Documents

Publication Publication Date Title
CN103020266B (en) The method and apparatus that webpage text content is extracted
US8601120B2 (en) Update notification method and system
US20160364373A1 (en) Method and apparatus for extracting webpage information
US10542123B2 (en) System and method for generating and monitoring feedback of a published webpage as implemented on a remote client
US9104775B2 (en) Method for presenting a web page
CN105205080B (en) Redundant file method for cleaning, device and system
CN106547749B (en) Webpage data acquisition method and device
CN103064943B (en) A kind of client device
CN107566906B (en) Video comment processing method and device
CN102831148A (en) Method and device for loading recommended data based on browser
CN105589922A (en) Page display method, device and system and page display assisting method and device
KR101340588B1 (en) Method and apparatus for comprising webpage
CN105979393A (en) Web page display method and device, and intelligent television system
CN108470296B (en) Business object information processing method and device
CN102955850A (en) Method and device for loading sequencing website
CN101751462A (en) Network information storage and access methods, equipment and terminals
RU2562397C2 (en) Method and apparatus for inserting address of hyperlink into bookmark
CN107368546B (en) Method and device for generating article outline
WO2016069447A1 (en) Extracting similar group elements
CN106951405B (en) Data processing method and device based on typesetting engine
CN113010814A (en) Webpage content processing method, computing device and computer storage medium
CN106951429B (en) Method, browser and equipment for enhancing webpage comment display
CN103390043A (en) Method for displaying network data and device for displaying network data
CN102982078A (en) Loading method of sequencing website and client with sequencing website being loaded
US20190138657A1 (en) Information processing device and information terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161123

CF01 Termination of patent right due to non-payment of annual fee