CN107741942B

CN107741942B - Webpage content extraction method and device

Info

Publication number: CN107741942B
Application number: CN201611126527.4A
Authority: CN
Inventors: 赵铭鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2020-06-02
Anticipated expiration: 2036-12-09
Also published as: CN107741942A

Abstract

The embodiment of the invention discloses a method and a device for extracting webpage content, wherein the method for extracting the webpage content comprises the following steps: determining a candidate area where target content in a webpage to be extracted is located; calculating the visual feature score of each candidate region according to the preset visual features of the target content; and extracting the target content from the candidate region with the highest visual feature score. The embodiment of the invention can save human resources and improve the extraction efficiency.

Description

Webpage content extraction method and device

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method and a device for extracting webpage content.

Background

With the continuous expansion of the internet scale, the network information also shows exponential increase, and it becomes more and more difficult for users to obtain the information of interest by the network. The extraction of web page content is in line with the development of network and the increasing demand of people for information. Through extraction of webpage content, interested content can be stored in a database, so that stronger query service can be provided; by extracting the webpage content, useful content can be analyzed and processed, so that the webpage content can be released again; through extraction of webpage content, information of multiple websites can be integrated, and comparison and analysis are conducted.

The existing method for extracting web page content generally needs to manually label the Extensible Markup Language Path Language (XPath) data of each web page, then stores the XPath data of each web page in the background, and after downloading of one web page is finished, matches the XPath data stored in the background according to the Uniform Resource Locator (URL) of the web page, and extracts content from the corresponding web page by using the matched XPath data. For mass data of the network, the extraction method needs to spend a large amount of labor cost to label the XPath of each webpage, and the extraction efficiency is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for extracting web page content, which can save human resources and improve extraction efficiency.

The webpage content extraction method provided by the embodiment of the invention comprises the following steps:

determining a candidate area where target content in a webpage to be extracted is located;

calculating the visual feature score of each candidate region according to the preset visual features of the target content;

and extracting the target content from the candidate region with the highest visual feature score.

The web page content extraction device provided by the embodiment of the invention comprises:

the determining unit is used for determining a candidate area where the target content in the webpage to be extracted is located;

the calculating unit is used for calculating the visual feature score of each candidate region according to the preset visual features of the target content;

an extraction unit for extracting the target content from a candidate region with a highest visual feature score

In the embodiment of the invention, the candidate area where the target content in the webpage to be extracted is located is determined, then the visual feature score of each candidate area is calculated according to the preset visual feature of the target content, and finally the target content is extracted from the candidate area with the highest visual feature score, namely, in the extraction process of the embodiment of the invention, the area where the target content is located is determined according to the prominent design (namely the preset visual feature of the target content) which is made by a webpage designer and attracts users and aims at the target content according to the experience of acquiring webpage information by human eyes, so that the target content is directly extracted from the area, so that the XPath data of each webpage does not need to be manually marked, the human resources are saved, and the extraction efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scene of a web content extraction method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for extracting web page content according to an embodiment of the present invention;

FIG. 3 is another schematic flow chart illustrating a method for extracting web page content according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a web content extracting apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a web content extracting apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The conventional webpage content extraction method needs to manually label the XPath data of each webpage, so that the method and the device provided by the embodiment of the invention have the advantages of high labor cost and low extraction efficiency, and can save human resources and improve the extraction efficiency. The webpage content extracting method provided by the embodiment of the invention can be realized in a webpage content extracting device, and the webpage content extracting device can be a server. A specific implementation scenario of the web page content extraction method according to the embodiment of the present invention is shown in fig. 1, where a web page content extraction device may first download a web page to be extracted, then determine a candidate area where target content (i.e., web page content to be extracted, such as a title, a picture, and a price in the web page) in the web page to be extracted is located, the candidate regions may be a plurality of regions, and the target content may be extracted from the candidate region with the highest visual feature score by calculating the visual feature score of each candidate region according to a preset visual feature of the target content (the preset visual feature may be a design that attracts a user and is made by a web page designer according to experience of acquiring web page information by human eyes and is a design that is made to target content and is prominent and specific to the target content, and the preset visual feature may be information such as a font color, a font size, a font thickening degree, a background color, and a frame color of the target content). For example, in the webpage to be extracted shown in fig. 1, as can be known from the statistical data, if there are three candidate regions of the target content, the visual feature scores of the three candidate regions are respectively calculated, and the candidate region with the highest visual feature score is selected, for example, if the candidate region with the highest visual feature score is the candidate region 2, the target content is extracted from the candidate region 2. Namely, the extraction process of the embodiment of the invention: the area where the target content is located is determined by means of the preset visual features of the target content obtained through statistics, and then the target content is directly extracted from the area, so that XPath data of each webpage do not need to be marked manually, manpower resources are saved, and extraction efficiency is improved.

The following are detailed below, and the numbers of the following examples are not intended to limit the preferred order of the examples.

Example one

As shown in fig. 2, the method of the present embodiment includes the following steps:

step 201, determining a candidate area where target content in a webpage to be extracted is located;

in a specific implementation, before the method of this embodiment is implemented, a set of areas where the content to be extracted is located in the web page may be counted through manual data collection. The content to be extracted can be customized according to the actual webpage type, for example, for an e-commerce webpage, the content to be extracted can be information such as the name, price and picture of a commodity; for another example, for a news webpage, the content to be extracted may be information such as a title and a picture. The specific statistical method may be as follows:

in the embodiment, a preset number of representative webpages can be selected from each website, the preset number can be customized according to actual requirements, and the collected webpages are subjected to visual feature rendering so as to be convenient for browsing; then, the collected web pages can be classified (e.g., e-commerce type and news type), for each type of web pages, the position information of the content to be extracted in different web pages can be counted, the position information can be represented by a combination of coordinates, width and height, the position information is usually represented as an area, then the position information of the content to be extracted in each web page is combined, and finally, a set of areas where the content to be extracted is located in each web page is formed. By analogy, a set of regions in the web page where each content to be extracted is located, which is counted for each type of web page, can be obtained.

In the concrete implementation, the type of the webpage to be extracted can be determined firstly, a set of regions of the webpage where each content to be extracted is located is found out according to the type of the webpage to be extracted, and a candidate region where the target content in the webpage to be extracted is located is determined according to the set.

Step 202, calculating a visual feature score of each candidate region according to preset visual features of the target content;

in a specific implementation, before the step is executed, the preset visual features of the contents to be extracted and the preset scores of the preset visual features can be obtained through feature training. The preset visual features are usually the experience of a webpage designer for acquiring webpage information according to human eyes, and the preset visual features can be information of font color, font size, font thickening degree, background color, frame color and the like of the content to be extracted aiming at the prominent design which is made by the content to be extracted and attracts users.

For example, for e-commerce web pages, users often find information such as names, prices, pictures, etc. of commodities (i.e. information to be extracted) easily, because web page designers design important information (such as names, prices, pictures, etc. of commodities) more attractive and more prominent according to the experience of acquiring web page information by human eyes (i.e. sensitivity of human visual sense to information characteristics) when designing web pages. For example, the price font is designed to be large according to the price of the commodity, the color of the price font is designed to be more striking, the price font is even thickened, and the like.

Therefore, various types of web pages can be downloaded (for example, downloaded by using webkit), the visual features of all blocks in each web page are rendered for each type of web page, the visual features which can be perceived by human eyes are stored, the visual features include but are not limited to font color, font size, font thickening degree, background color, border color and the like, and then, for each type of visual features, the feature statistics of the positive examples is performed to obtain the preset visual features of the content to be extracted. For example, for the visual feature of the font size of the commodity price, through statistics, the font size of the commodity price is generally 18 to 22px, and then the preset visual feature corresponding to the font size of the commodity price can be set as: the font size is 18-22 px; for another example, for the visual feature of the font color of the commodity price, through statistics, the font color of the commodity price is usually red, and then the preset visual feature corresponding to the font color of the commodity price may be set as: the font color is red.

Next, a score (i.e., a preset score) may be set for each preset visual feature, and a specific value of the score may be determined by a contribution degree of the corresponding preset visual feature to the recognition of the content to be recognized, where initially, the contribution degree may be determined according to experience. For example, it is known through experience statistics that, for a content to be identified, which is a commodity price, the contribution degree of the font size of the price to the identified commodity price is 30%, and the contribution degree of the font color of the price to the identified commodity price is 70%, then the preset score of the preset visual feature corresponding to the font size of the commodity price may be set to 3; the preset score of the preset visual feature corresponding to the font color of the commodity price may be set to 7, which is only an example and does not limit the specific implementation.

In specific implementation, the preset visual features of the target content may be obtained according to the preset visual features corresponding to the trained contents to be recognized, and the visual feature score of each candidate region may be calculated according to the preset visual features of the target content, where the specific calculation method may be as follows:

the scores of the visual features corresponding to the preset visual features of the target content existing in each candidate area are calculated. Specifically, it may be determined whether each of the visual features in each candidate region matches each of the preset visual features of the corresponding target content, and a score of the visual feature matching the corresponding preset visual feature is determined to be equal to a preset score of the corresponding preset visual feature; determining a score of a visual feature that does not match the corresponding preset visual feature, equal to zero.

The matching comprises the following steps: the visual features are the same as the corresponding preset visual features, or the parameters of the visual features belong to the corresponding parameter intervals of the preset visual features. In practical applications, the specific matching determination method needs to be determined according to specific visual features, for example: for visual features which cannot be distinguished by numerical values, such as font color, border color, font boldness, it is necessary to determine whether the visual features are the same as the corresponding preset visual features; for the visual features that can be distinguished by numerical values, such as font sizes, it needs to be determined whether the parameters of the visual features belong to the corresponding parameter intervals of the preset visual features.

And then accumulating the scores of the visual features in each candidate region to serve as the visual feature score of each candidate region.

The calculation of the visual feature score for each candidate region is described below by way of example. For example, the target content is a price, and the preset visual features of the target content are as follows: font size 18-22 px, and font color red. The candidate regions of the target content are confirmed by the previous steps to include a first candidate region and a second candidate region. In the first candidate region, the visual features corresponding to the preset visual features of the target content are respectively: font size 20px, font color red. Matching the visual characteristic of the font size 20px within a parameter interval of the corresponding preset visual characteristic font size 18-22 px (the corresponding preset score is 3), wherein the visual characteristic of the font size 20px has a score of 3; the visual characteristic of the font color red is the same as the corresponding preset visual characteristic font color red (the corresponding preset score is 7), if the two are matched, the score of the visual characteristic of the font color red is 7, and the visual characteristic score of the first candidate region is 3+7, that is, 10. In the second candidate region, the visual features corresponding to the preset visual features of the target content are respectively: font size 21px, font color black. Matching the visual characteristic of the font size 21px within a parameter interval of the corresponding preset visual characteristic font size 18-22 px (the corresponding preset score is 3), wherein the visual characteristic of the font size 21px has a score of 3; the visual characteristic of the font color black is different from the corresponding preset visual characteristic of the font color red (the corresponding preset score is 7), if the visual characteristic of the font color black is not matched with the corresponding preset visual characteristic of the font color red, the score of the visual characteristic of the font color black is 0, and the visual characteristic score of the first candidate region is 3+0, namely 3.

Step 203, extracting the target content from the candidate region with the highest visual feature score.

In this embodiment, the candidate region with the highest visual feature score is the determined region where the target content is located, so the target content can be directly extracted from the candidate region with the highest visual feature score. In the above example, the target content is extracted from the second candidate region.

In addition, through experimental observation, a HyperText Markup Language (HTML) tag of a webpage is strongly associated with a picture, and the HTML tag of the picture is generally img, so that the HTML tag img can be added to the preset visual features of the target content, where the target content is a picture. Specifically, for the score of the feature of the HTML tag in the candidate region, the calculation process may refer to the calculation process of the scores of the other visual features, which is not described herein again.

Further, after the target content is extracted, whether the extracted target content is accurate or not can be tested, and if the extracted target content is accurate, the preset scores of all preset visual features of the target content are kept unchanged; if the preset values are inaccurate, the preset values of the preset visual features of the target content can be adjusted, during adjustment, other preset values can be fixed firstly, only one preset value is adjusted to enable the result to be optimal, and so on, and finally each preset value is the optimal result.

The specific adjustment method is, for example: when the target content is a title, the preset visual features include: the font size is 20-24 px, and the font is thickened; initially, the preset score of the preset visual feature, which is the font size of 20-24 px, is 6, and the preset score of the preset visual feature, which is the font bolding, is 4. During adjustment, the preset value of the preset visual feature of font thickening can be fixed and is not changed, the preset value of the preset visual feature of font size 20-24 px is adjusted to be high or low, the influence on the title extraction success rate is counted when the preset value of the preset visual feature of font size 20-24 px is adjusted to be high or low, if the preset value of the preset visual feature of font size 20-24 px is increased, the title extraction success rate is improved, the preset value of the preset visual feature of font size 20-24 px is increased, otherwise, if the title extraction success rate is reduced after increasing, the initially set value is firstly kept unchanged, and the preset value of the preset visual feature of font thickening is adjusted.

In the embodiment, the candidate area where the target content in the webpage to be extracted is located is determined, then the visual feature score of each candidate area is calculated according to the preset visual feature of the target content, and finally the target content is extracted from the candidate area with the highest visual feature score, namely, in the extraction process of the embodiment, according to the experience of webpage designers for acquiring webpage information according to human eyes, the area where the target content is located is determined according to the prominent design (namely the preset visual feature of the target content) which is made by the target content and attracts users, so that the target content is directly extracted from the area, so that the XPath data of each webpage does not need to be manually marked, the human resource is saved, and the extraction efficiency is improved.

Example two

As shown in fig. 3, the method described in the first embodiment, which will be described in further detail by way of example, includes:

step 301, determining a candidate region of a target content in a webpage to be extracted according to a set of regions where each content to be extracted in the webpage is counted in advance;

in a specific implementation, a set of regions where the content to be extracted is located in the webpage may be counted through manual data collection. The content to be extracted can be customized according to the actual webpage type, for example, for an e-commerce webpage, the content to be extracted can be information such as the name, price and picture of a commodity; for another example, for a news webpage, the content to be extracted may be information such as a title and a picture. The specific statistical method may be as follows:

Step 302, judging whether each visual feature in each candidate area is matched with each preset visual feature of corresponding target content; when a certain visual feature corresponds to a preset visual feature, executing step 303, and when a certain visual feature corresponds to a preset visual feature, executing step 304;

step 303, determining the score of the visual feature in the candidate region, which is equal to the preset score of the corresponding preset visual feature;

step 304, determining the score of the visual feature in the candidate area, which is equal to zero;

in specific implementation, the preset visual features of each content to be extracted and the preset scores of the preset visual features can be obtained through feature training. The preset visual features are usually the experience of a webpage designer for acquiring webpage information according to human eyes, and the preset visual features can be information of font color, font size, font thickening degree, background color, frame color and the like of the content to be extracted aiming at the prominent design which is made by the content to be extracted and attracts users.

and calculating scores of the visual characteristics corresponding to the preset visual characteristics of the target content in each candidate area. Specifically, it may be determined whether each of the visual features in each candidate region matches each of the preset visual features of the corresponding target content, and a score of the visual feature matching the corresponding preset visual feature is determined to be equal to a preset score of the corresponding preset visual feature; determining a score of a visual feature that does not match the corresponding preset visual feature, equal to zero.

Step 305, accumulating the scores of the visual features in each candidate region to serve as the score of the visual feature of each candidate region;

step 306, extracting target content from the candidate region with the highest visual feature score;

in this embodiment, the candidate region with the highest visual feature score is the determined region where the target content is located, so the target content can be directly extracted from the candidate region with the highest visual feature score.

In addition, through experimental observation, the HTML tag of the webpage is strongly associated with the picture, and the HTML tag of the picture is generally img, so that the HTML tag img can be added to the preset visual feature of the target content for the target content being the picture. Specifically, for the score of the feature of the HTML tag in the candidate region, the calculation process may refer to the calculation process of the scores of the other visual features, which is not described herein again.

Step 307, testing whether the extracted target content is accurate;

and 308, adjusting the preset scores of the preset visual features of the target content according to the test result.

After extracting the target content, testing whether the extracted target content is accurate, and if so, keeping preset scores of all preset visual features of the target content unchanged; if the preset values are inaccurate, the preset values of the preset visual features of the target content can be adjusted, during adjustment, other preset values can be fixed firstly, only one preset value is adjusted to enable the result to be optimal, and so on, and finally each preset value is the optimal result.

In the embodiment, the candidate area where the target content in the webpage to be extracted is located is determined, then the visual feature score of each candidate area is calculated according to the preset visual feature of the target content, and finally the target content is extracted from the candidate area with the highest visual feature score, namely, in the extraction process of the embodiment, the area where the target content is located is determined according to the prominent design (namely, the preset visual feature of the target content) which is made by a webpage designer and attracts users and is made according to the experience that webpage information is obtained by human eyes, so that the target content is directly extracted from the area, so that the XPath data of each webpage does not need to be manually marked, the human resources are saved, and the extraction efficiency is improved.

EXAMPLE III

In order to better implement the above method, an embodiment of the present invention further provides a web content extracting apparatus, as shown in fig. 4, the web content extracting apparatus of the embodiment includes: determination section 401, calculation section 402, and extraction section 403 are as follows:

(1) a determination unit 401;

a determining unit 401, configured to determine a candidate region where a target content in a webpage to be extracted is located;

In a specific implementation, the determining unit 401 may determine a type of a webpage to be extracted, find a set of regions of the webpage where each content to be extracted is located, according to the type of the webpage to be extracted, and determine a candidate region where the target content in the webpage to be extracted is located according to the set.

(2) A calculation unit 402;

a calculating unit 402, configured to calculate a visual feature score of each candidate region according to a preset visual feature of the target content.

In specific implementation, the calculating unit 402 may obtain the preset visual features of the target content according to the preset visual features corresponding to each content to be recognized obtained by training, and calculate the visual feature score of each candidate region according to the preset visual features of the target content, where the calculating unit 402 may include a first calculating unit and a second calculating unit, where:

the first calculation unit may first calculate scores of the respective visual features corresponding to the respective preset visual features of the target content, which exist in each candidate region. Specifically, the first calculating unit may include a judging subunit and a determining subunit, wherein the judging subunit may judge whether each of the visual features in each of the candidate regions matches each of the preset visual features of the corresponding target content, and the determining subunit determines a score of a visual feature matching the corresponding preset visual feature to be equal to a preset score of the corresponding preset visual feature; determining a score of a visual feature that does not match the corresponding preset visual feature, equal to zero.

The second calculation unit may accumulate the scores of the respective visual features in each of the candidate regions as a visual feature score for each of the candidate regions.

(3) An extraction unit 403;

an extracting unit 403, configured to extract the target content from the candidate region with the highest visual feature score.

Further, the apparatus of this embodiment may further include a testing unit and an adjusting unit, where after the extracting unit 403 extracts the target content, the testing unit may test whether the extracted target content is accurate, and if so, keep the preset scores of the preset visual features of the target content unchanged; if the preset scores are inaccurate, the adjusting unit can adjust the preset scores of the preset visual features of the target content, during adjustment, other preset scores can be fixed firstly, only one preset score is adjusted to enable the result to be optimal, and so on, and finally each preset score is the optimal result.

It should be noted that, when the web content extraction apparatus provided in the foregoing embodiment implements web content extraction, only the division of the above functional modules is taken as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the web content extraction device and the web content extraction method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

In this embodiment, the determination unit determines a candidate region where target content in a web page to be extracted is located, the calculation unit calculates a visual feature score of each candidate region according to preset visual features of the target content, and the extraction unit extracts the target content from the candidate region with the highest visual feature score, that is, the device of this embodiment determines a region where the target content is located and further directly extracts the target content from the region, depending on experience of a web page designer in acquiring web page information according to human eyes, aiming at a design (that is, preset visual features of the target content) that is made by the target content and attracts users, so that XPath data of each web page does not need to be manually labeled, human resources are saved, and extraction efficiency is improved.

Example four

An embodiment of the present invention further provides a device for extracting web content, as shown in fig. 5, which shows a schematic structural diagram of the device according to the embodiment of the present invention, specifically:

the apparatus may include components such as a processor 501 of one or more processing cores, memory 502 of one or more computer-readable storage media, Radio Frequency (RF) circuitry 503, a power supply 504, an input unit 505, and a display unit 506. Those skilled in the art will appreciate that the configuration of the device shown in fig. 5 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 501 is a control center of the apparatus, connects various parts of the entire apparatus using various interfaces and lines, performs various functions of the apparatus and processes data by running or executing software programs and/or modules stored in the memory 502, and calling data stored in the memory 502, thereby monitoring the entire apparatus. Optionally, processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.

The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the device, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

The RF circuit 503 may be used for receiving and transmitting signals during information transmission and reception, and in particular, for receiving downlink information of a base station and then processing the received downlink information by one or more processors 501; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 503 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 503 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The apparatus further includes a power supply 504 (e.g., a battery) for supplying power to the various components, and preferably, the power supply 504 is logically connected to the processor 501 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 504 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The apparatus may further include an input unit 505, and the input unit 505 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 505 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 501, and can receive and execute commands sent by the processor 501. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 505 may include other input devices in addition to a touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The device may also include a display unit 506, which display unit 506 may be used to display information input by or provided to the user, as well as various graphical user interfaces of the device, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 506 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-emitting diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 501 to determine the type of the touch event, and then the processor 501 provides a corresponding visual output on the display panel according to the type of the touch event. Although in FIG. 5 the touch-sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.

Although not shown, the device may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 501 in the apparatus loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application programs stored in the memory 502, thereby implementing various functions as follows:

Specifically, the processor 501 determines the candidate area where the target content in the web page to be extracted is located according to the following manner:

and determining a candidate area of the target content in the webpage to be extracted according to a set of areas where each content to be extracted in the webpage is counted in advance.

Specifically, processor 501 calculates the visual feature score of each of the candidate regions as follows:

calculating scores of all visual features corresponding to all the preset visual features and existing in each candidate region;

and accumulating the scores of the visual features in each candidate region to serve as the visual feature score of each candidate region.

Specifically, the processor 501 calculates scores of the respective visual features corresponding to the respective preset visual features, which exist in each of the candidate regions, as follows:

judging whether each visual feature in each candidate region is matched with each corresponding preset visual feature or not;

determining a score of a visual feature matched with the corresponding preset visual feature, wherein the score is equal to a preset score of the corresponding preset visual feature;

determining a score of a visual feature that does not match the corresponding preset visual feature, equal to zero.

Specifically, the matching includes: the visual features are the same as the corresponding preset visual features, or the parameters of the visual features belong to the corresponding parameter intervals of the preset visual features.

Further, the processor 501 is also configured to,

testing whether the extracted target content is accurate;

and adjusting the preset scores of the preset visual features of the target content according to the test result.

As can be seen from the above, after determining the candidate region where the target content in the web page to be extracted is located, the apparatus of this embodiment calculates the visual feature score of each candidate region according to the preset visual feature of the target content, and then extracts the target content from the candidate region with the highest visual feature score, that is, in the process of extracting the web page content, the apparatus of this embodiment determines the region where the target content is located by relying on the experience of a web page designer in acquiring web page information according to human eyes, aiming at the prominent design (i.e., the preset visual feature of the target content) that is made by the target content and attracts users, and further directly extracts the target content from the region, so that it is no longer necessary to manually label XPath data of each web page, which saves human resources and improves extraction efficiency.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer (which may be a personal computer, an apparatus, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for extracting web page content, comprising:

determining a candidate area where target content in a webpage to be extracted is located, wherein the candidate area comprises the following steps:

determining candidate areas of the target content in the webpages to be extracted according to a set of areas where each content to be extracted in the webpages to be counted in advance, wherein the set of the areas where each content to be extracted is formed by counting position information of each content to be extracted in a preset number of webpages in advance and combining the position information of the content to be extracted in each webpage;

2. The method of claim 1, wherein the calculating the visual feature score of each candidate region according to the preset visual features of the target content comprises:

3. The method of claim 2, wherein said calculating a score for each visual feature present in each of said candidate regions corresponding to each of said predetermined visual features comprises:

4. The method of claim 3, wherein the matching comprises: the visual features are the same as the corresponding preset visual features, or the parameters of the visual features belong to the corresponding parameter intervals of the preset visual features.

5. The method according to claim 3 or 4, characterized in that the method further comprises:

testing whether the extracted target content is accurate;

6. A web content extraction apparatus, comprising:

the determining unit is used for determining a candidate area where target content in a webpage to be extracted is located, and specifically, determining the candidate area of the target content in the webpage to be extracted according to a set of areas where each content to be extracted is located in a webpage which is counted in advance, wherein the set of areas where each content to be extracted is a set formed by counting position information of each content to be extracted in a preset number of webpages in advance and combining the position information of the content to be extracted in each webpage;

and the extracting unit is used for extracting the target content from the candidate region with the highest visual feature score.

7. The apparatus of claim 6, wherein the computing unit comprises:

a first calculation unit configured to calculate a score of each of the visual features corresponding to each of the preset visual features, which exist in each of the candidate regions;

and the second calculation unit is used for accumulating the scores of the visual features in each candidate region to serve as the visual feature score of each candidate region.

8. The apparatus of claim 7, wherein the first computing unit comprises:

the judging subunit is configured to judge whether each of the visual features in each of the candidate regions matches with each of the corresponding preset visual features;

a determining subunit, configured to determine a score of a visual feature matching the corresponding preset visual feature, where the score is equal to a preset score of the corresponding preset visual feature; and determining the score of the visual characteristics which do not match with the corresponding preset visual characteristics to be equal to zero.

9. The apparatus of claim 8, wherein the matching comprises: the visual features are the same as the corresponding preset visual features, or the parameters of the visual features belong to the corresponding parameter intervals of the preset visual features.

10. The apparatus of claim 8 or 9, further comprising:

a test unit for testing whether the extracted target content is accurate;

and the adjusting unit is used for adjusting the preset scores of the preset visual features of the target content according to the test result.

11. A computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded by a processor and executes a method according to any one of claims 1 to 5.