WO2018103540A1 - 网页内容提取方法、装置、存储介质 - Google Patents

网页内容提取方法、装置、存储介质 Download PDF

Info

Publication number
WO2018103540A1
WO2018103540A1 PCT/CN2017/112866 CN2017112866W WO2018103540A1 WO 2018103540 A1 WO2018103540 A1 WO 2018103540A1 CN 2017112866 W CN2017112866 W CN 2017112866W WO 2018103540 A1 WO2018103540 A1 WO 2018103540A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
visual
candidate
visual features
visual feature
Prior art date
Application number
PCT/CN2017/112866
Other languages
English (en)
French (fr)
Inventor
赵铭鑫
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201611126527.4A external-priority patent/CN107741942B/zh
Priority claimed from CN201611170430.3A external-priority patent/CN108205544A/zh
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018103540A1 publication Critical patent/WO2018103540A1/zh
Priority to US16/359,224 priority Critical patent/US11074306B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the embodiments of the present invention relate to the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for extracting webpage content.
  • the network service provider can extract the content of each webpage and store it in the database, thereby providing a query service for the user.
  • the embodiment of the present application provides a method and an apparatus for extracting webpage content, which can save human resources and improve extraction efficiency.
  • the webpage content extraction method provided by the embodiment of the present application is applied to a network device, where the network device includes a processor and a memory, and the processor may implement the method by executing computer readable instructions stored in the memory, where Methods include:
  • each candidate region including a location in the webpage Adjacent one or more page elements
  • the webpage content extraction apparatus includes a processor and a memory, wherein the memory stores computer readable instructions, and the computer readable instructions are executable by the processor for:
  • each candidate region including one or more page elements adjacent in position in the webpage
  • the application further provides a computer readable storage medium storing computer readable instructions executable by a processor for implementing the methods of various embodiments of the present application.
  • the webpage content identification method, device and server of the present application convert the visual features of the webpage block into feature vectors that the training tool can learn, thereby generating a content recognition model by using the training tool, thereby improving the efficiency and accuracy of identifying the webpage content.
  • FIG. 1 is a schematic diagram of a network scenario of a method for extracting webpage content according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for extracting webpage content according to an embodiment of the present application
  • FIG. 3 is a flowchart of a method for extracting webpage content according to an embodiment of the present application
  • FIG. 4 is a flowchart of a method for generating an extraction rule according to an embodiment of the present application
  • FIG. 5 is a schematic diagram of an application scenario of a webpage content extraction method according to an embodiment of the present application.
  • FIG. 6 is a flowchart of a method for extracting webpage content according to an embodiment of the present application.
  • FIG. 7 is a flowchart of a method for extracting webpage content according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a webpage content extraction apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a webpage content extraction apparatus according to an embodiment of the present application.
  • FIG. 10 is a flowchart of a method for extracting webpage content according to an embodiment of the present application.
  • FIG. 11 is a flowchart of a method for generating a recognition model according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a server according to an embodiment of the present application.
  • FIG. 13 is a flowchart of a method for identifying a webpage content according to an embodiment of the present application.
  • FIG. 14 is a flowchart of a method for identifying a webpage content according to an embodiment of the present application.
  • FIG. 15 is a schematic diagram of an interface of a webpage content identification method as shown in FIG. 14;
  • 16 is a flowchart of a method for identifying a webpage content according to an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a webpage content identification apparatus according to an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a webpage content identification apparatus according to an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a server according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a webpage content extraction system according to an embodiment of the present application.
  • the system 10 can include a web content extraction device 11, a plurality of websites 12-1 to 12-N, and a network 13.
  • the webpage content extraction device 11 can obtain webpage data (hereinafter referred to as webpage) from a plurality of websites through the network 13, and extract target content information in each webpage according to the method of each embodiment.
  • webpage webpage data
  • the web content extracting device 11 may include a processor 15, a communication interface 18, a web page data storage device 14 to be extracted, and an extracted content information storage device 16.
  • Processor 15 can be a special purpose processor (such as an FPGA or ASIC), or a general purpose processor or other programmable processor.
  • the processor 15 can extract the target content information in the web page by executing embedded processing logic or computer readable instructions stored in the memory.
  • the webpage content extraction device 11 can acquire webpage data from each website through the communication interface 18, and store the acquired webpage data in the webpage data storage device 14 to be extracted.
  • the target content information extracted by the processor 15 can be stored in the extracted content information storage device 16 for use by other service processing devices.
  • the search engine may search for content information matching the user's search term in the content information storage device 16; the application server may query the content information storage device 16 for content information matching the user's tag to be provided to the user.
  • FIG. 2 is a schematic flowchart diagram of a method for extracting webpage content according to an embodiment of the present application.
  • the method 20 can be performed by the web content extraction device 11. As shown in FIG. 2, method 20 can include the following steps.
  • Step S21 determining a plurality of candidate regions in the webpage, each candidate region including one or more page elements adjacent in position in the webpage.
  • Step S22 extracting, for each candidate region of the plurality of candidate regions, extraction values of the plurality of visual features of the candidate region.
  • Step S23 determining the plurality of candidate regions according to the extracted values of the plurality of visual features
  • the target area that meets the extraction rule is extracted, and the content information of the target area is extracted.
  • Visual features refer to features that can be perceived by the human eye in a web page, such as the font, color, size, boldness, background color, border color, etc. of the image, the foreground color of the image, the background color, the size, the position of the page elements, etc. .
  • the extracted value of the visual feature refers to the value of the visual feature set in the data of the web page extracted from the webpage data.
  • the extracted value can be a value of a numeric type or a value of a non-numeric type.
  • the value of a numeric type is a numerically represented value, such as the size of the font, the thickness of the line, the size of the image, the position, and so on.
  • Non-numeric types of values are typically selected from a collection of multiple descriptive information, such as font, bold, italic, boxed, and so on.
  • the values of the plurality of visual features may be processed by using a preset rule to obtain the value of the visual feature of the candidate region.
  • one of the values of the visual features of the plurality of page elements may be selected as the extracted value of the visual feature in the candidate region according to a predetermined rule.
  • the plurality of values may be calculated according to a predetermined algorithm (eg, averaging, weighted averaging, etc.), and the calculated result is used as an extracted value of the visual feature of the candidate region.
  • a candidate area is an area where a target content in a web page may exist.
  • various regions in a web page can be used as candidate regions.
  • each constituent block of the web page can be used as a candidate area according to the structure of the web page.
  • all nodes of a certain level of the tree structure or all nodes of each level may be used as candidate areas.
  • the area extracted from the webpage may also be filtered according to a preset rule, and the area obtained after the screening is used as a candidate area. For example, an area within a webpage that is within a preset location range may be determined as a candidate area.
  • the location range can be determined based on the location of the target area marked in the plurality of webpage samples.
  • an area of the webpage including a preset content tag may be determined as a candidate area.
  • the target content includes an image
  • the html tag of the image in the webpage data is "img”
  • the webpage is included
  • the area of the "img" tag is determined as a candidate area.
  • the webpage includes a plurality of chunks, and the preset content label is “img”, the chunks including the html label “img” among the plurality of chunks are used as the target area.
  • the preset content label is “img”
  • the chunks including the html label “img” among the plurality of chunks are used as the target area.
  • HTML tags of web page elements in web pages are strongly related to web page elements.
  • the HTML tag of an image is usually img. Therefore, for the target content to be a picture, an HTML tag img may be added to the preset visual feature of the target content.
  • the calculation process can refer to the calculation process of other visual feature scores, and details are not described herein again.
  • Each candidate area may include one or more page elements that are adjacent in position in the web page.
  • a page element refers to the smallest unit that makes up a web page. For example, an item corresponding to an html tag can be used as a page element.
  • the target area refers to an area in which the target content to be extracted is determined by the method of each embodiment of the present application.
  • the content information extracted from the target area is information of the target content.
  • the extracted content information may be the content itself, or may be a path of the content, such as XPath (Extensible Markup Language Path Language), JSON-Path (JavaScript Object Notation-Path), and the like.
  • the key content is often set to be conspicuous, such as using bright colors, large fonts, etc., and the machine uses these visual features to identify and extract the target content. While saving manpower and time, it also improves the accuracy of content extraction.
  • various methods may be employed to determine the target area according to the extracted value of the visual feature. For example, the range of values of each visual feature of the target content may be counted, and the target area may be determined based on the range of values. As another example, you can use the annotated web page samples to train The machine learning algorithm uses a well-trained machine learning algorithm to identify the target area and complete the extraction of the target content information.
  • FIG. 3 is a flowchart of a method for extracting webpage content according to an embodiment of the present application. For the sake of conciseness, steps similar to those previously described in the various embodiments are omitted. As shown in FIG. 3, the method 30 can include the following steps.
  • Step S31 Extract, for each candidate region of the plurality of candidate regions, an extracted value of the plurality of visual features of the candidate region.
  • Step S32 calculating a score of each visual feature of each candidate region according to the extracted value of each visual feature and the value range of each visual feature in the extraction rule.
  • step S33 a candidate region having the highest sum of scores of each of the plurality of candidate regions is selected as the target region.
  • the score of the visual feature may represent the degree to which the extracted value of the visual feature matches the range of values of the visual feature in the extraction rule.
  • the extraction rule may include a scoring rule, that is, different parts of the value range correspond to different scores.
  • the score of the visual feature may indicate how important the visual feature is in all of the visual features. For example, the most dominant visual feature of the target content may correspond to a higher score upper limit, and the secondary visual feature may correspond to a lower score upper limit. For example, if the upper limit of the font size is greater than the upper limit of the font color, the influence of the font color on the judgment of the target area is not affected by the font size.
  • FIG. 4 is a flowchart of a method for generating an extraction rule according to an embodiment of the present application. As shown in FIG. 4, the method 40 can include the following steps.
  • Step S41 Extract sample values of a plurality of visual features of the target area marked in each webpage sample from the plurality of webpage samples.
  • the web page sample is a web page that is pre-fetched and manually marked with the target content.
  • the sample value of the visual feature refers to the value of the visual feature of the target content in the web page sample.
  • Step S42 Determine, for each of the plurality of visual features, a value range of the visual feature by using sample values of the visual features of the plurality of webpage samples.
  • the range of values can be one or more ranges of values.
  • the range of values can be a discrete set of values, including various possible values. It is also possible to convert a non-numeric sample value into a value using a word vector, and the range of values can be in the form of an array.
  • Step S43 generating the extraction rule by using a value range of the plurality of visual features.
  • the generated extraction rule may include the value range of each visual feature obtained in step S42.
  • an upper limit of the score of each visual feature may also be set.
  • the method 40 may further include: step S44, determining, for each of the plurality of visual features, a plurality of second webpage samples and a range of values of the visual features to determine weights of the visual features, and The weight is added to the extraction rule.
  • step S44 determining, for each of the plurality of visual features, a plurality of second webpage samples and a range of values of the visual features to determine weights of the visual features, and The weight is added to the extraction rule.
  • the weight of the visual feature can be used to calculate an upper limit of the score of the visual feature, indicating the influence of the visual feature on the accuracy of the target content recognition.
  • weights can be directly used as a score upper limit for visual features.
  • FIG. 5 is a schematic diagram of an application scenario 50 of a webpage content extraction method according to an embodiment of the present application.
  • the webpage content extracting apparatus may first download a webpage to be extracted, and then determine a candidate area where the target content in the webpage to be extracted is located, and the candidate area may be multiple, according to the mesh.
  • the preset visual features of the target content are calculated, and the visual feature score of each candidate region is calculated, and the target content is extracted from the candidate region with the highest visual feature score.
  • the target content is the content of the webpage that needs to be extracted, such as the title, picture, price, etc. in the webpage.
  • the preset visual feature can be a webpage designer's experience of obtaining webpage information according to the human eye, and the user's attractive design is made for the target content.
  • the preset visual feature can be bold for the font color, font size, and font of the target content. Information such as degree, background color, border color, and more.
  • the extraction process of the embodiment of the present application determines the region where the target content is located by using the preset visual feature of the target content obtained by the statistics, and then extracts the target content directly from the region, thereby eliminating the need to manually label the XPath data of each webpage, thereby saving Human resources have improved extraction efficiency.
  • FIG. 6 is a flowchart of a method for extracting webpage content according to an embodiment of the present application. As shown in FIG. 6, method 60 can include the following steps.
  • Step S61 determining a candidate area where the target content in the webpage to be extracted is located
  • the target content belongs to the content to be extracted.
  • the type of the webpage to be extracted may be determined first, and according to the type of the webpage to be extracted, a collection of the regions of the corresponding type of webpage to be extracted and where the content to be extracted is located in the webpage is found. Determining candidate regions in which the target content in the web page to be extracted is located according to the set. Since each type of webpage usually includes multiple webpages, the structure and layout of different webpages may be different, so there are usually multiple candidate areas.
  • Step S62 Calculate a visual feature score of each of the candidate regions according to a preset visual feature of the target content.
  • the preset visual features of the target content may be acquired according to preset visual features corresponding to the respective to-be-identified content. Calculate each candidate based on the preset visual characteristics of the target content Select the visual feature score of the area. For example, a score of each visual feature corresponding to each preset visual feature of the target content existing in each candidate region is first calculated. For example, it may be determined whether the respective visual features in each candidate region match each of the preset visual features of the corresponding target content. For example, it may be determined that the score of the visual feature matching the corresponding preset visual feature is equal to the preset score of the preset visual feature; and the score of the visual feature that does not match the corresponding preset visual feature is determined to be equal to zero.
  • the matching includes: the visual feature is the same as the corresponding preset visual feature, or the parameter of the visual feature belongs to a corresponding parameter interval of the preset visual feature.
  • the specific matching judgment method needs to be determined according to specific visual features. For example, for visual features that cannot be distinguished by numerical values, such as font color, border color, and bold font, it is necessary to determine whether the visual feature is the same as the corresponding preset visual feature; and for visually distinguishable visuals For a feature, such as a font size, it is necessary to determine whether the parameter of the visual feature belongs to a corresponding parameter interval of the preset visual feature.
  • the scores of the respective visual features within each of the candidate regions are then accumulated as a visual feature score for each of the candidate regions.
  • the target content is a price
  • the preset visual characteristics of the target content are: a font size of 18 to 22 px, and a font color of red. It is confirmed by the previous steps that the candidate region of the target content includes the first candidate region and the second candidate region.
  • the visual features corresponding to the preset visual features of the target content are: a font size of 20px and a font color of red.
  • the visual feature of font size 20px is within the parameter interval of the corresponding preset visual feature font size 18 ⁇ 22px (corresponding preset score is 3).
  • the visual feature of the font size 20px is 3; the visual color of the font color red is the same as the corresponding preset visual feature font color red (the corresponding preset score is 7), and if the two match, the visual color of the font color red has a score of 7,
  • the visual feature score of the first candidate region is 3+7, ie 10 points.
  • the visual features corresponding to the preset visual features of the target content are: a font size of 21px and a font color of black.
  • the visual feature of the font size 21px is within the parameter interval of the corresponding preset visual feature font size 18 ⁇ 22px (corresponding preset score is 3), and the scores of the visual feature of the font size 21px are 3; the visual color of the font color black is different from the corresponding preset visual feature font color red (the corresponding preset score is 7), and if the two do not match, the visual color of the font color black has a score of 0. Then, the visual feature score of the first candidate region is 3+0, that is, 3 points.
  • Step S63 extracting the target content from the candidate region with the highest visual feature score.
  • the candidate region with the highest visual feature score is the region in which the determined target content is located, and thus the target content can be directly extracted from the candidate region with the highest visual feature score.
  • the target content is extracted from the second candidate region.
  • the preset visual features of each content to be extracted and the preset scores of the respective preset visual features may be obtained through feature training.
  • the preset visual feature is usually a webpage designer's experience in acquiring webpage information according to the human eye, and the user's prominent design is made for the content to be extracted.
  • the preset visual feature may be the font color, font size, and The font boldness, background color, border color and other information. For example, for an e-commerce webpage, it is often easy for a user to find the name, price, picture, etc.
  • the web designer has experience in obtaining webpage information according to the human eye when designing the webpage (ie, The sensitivity of human visual senses to information features), the design of important information (such as the name, price, picture and other information of the product) is more attractive to users and more prominent.
  • the price font design is very large, the color design of the price font is more conspicuous, and even the price font is bolded.
  • These visual features include, but are not limited to, font color, font size, font boldness, background color, border color, etc., and then for each type of visual feature, do the feature statistics of the positive example to obtain the preset visual of the content to be extracted. feature.
  • the font size of the commodity price is generally 18 to 22 px, and the preset visual feature corresponding to the font size of the commodity price can be set as: font size 18 to 22 px;
  • the font color of the commodity price is usually red, and the preset visual feature corresponding to the font color of the commodity price may be set as: the font color is red.
  • a score (ie, a preset score) is set for each preset visual feature, and the specific value of the score can be determined by the recognition contribution of the corresponding visual feature to the recognized content.
  • the contribution degree Size can be determined empirically. For example, through empirical statistics, the contribution of the price font size to the identified product price is 30% for the product price to be identified, and the price font color has a contribution to the identified product price of 70%.
  • the preset score of the preset visual feature corresponding to the size may be set to 3; the preset score of the preset visual feature corresponding to the font color of the commodity price may be set to 7. It should be noted that, here is only an example It does not constitute a limitation on the specific implementation.
  • a collection of areas in which the content to be extracted is located in the web page may be counted by manual data collection.
  • the content to be extracted can be customized according to the actual webpage type. For example, for the e-commerce webpage, the content to be extracted may be the name, price, picture and the like of the merchandise; for example, for the news webpage, the content to be extracted may be a title, a picture, etc. .
  • First collect web pages from each site (as a sample web page). In this embodiment, a preset number of representative web pages can be selected from each site. The preset number can be customized according to actual needs. Visually feature the collected web pages for viewing. Classify collected web pages (eg e-commerce, news).
  • the location information of the content to be extracted in different webpages can be counted.
  • the location information can be represented by a combination of coordinates, width, and height. Location information usually Behaved as an area.
  • the location information of the content to be extracted in each webpage is merged, and finally a set of regions in which the respective content to be extracted is located in the webpage is formed.
  • the tagging of the machine-assisted web page may be performed and the machine automatically extracts sample values of the visual features of the marked target area.
  • web page data of a web page sample can be downloaded by a tag device (eg, a computer running a tag application, or a dedicated device, etc.) and the web page displayed.
  • the marking device can provide an operation interface on the user interface to receive a selection instruction for the target area in the webpage.
  • the marking device records information of the target area, such as XPath.
  • the marking device can extract the sample values of the visual features of the target area from the web page data using the information of the recorded target area.
  • the marking device may further obtain a range of values of each visual feature of the target content by using the sample values of the extracted visual features of each web page sample.
  • the preset scores of the respective preset visual features of the target content are kept unchanged; if not, the preset scores of the respective preset visual features of the target content may be adjusted.
  • each of the default scores is the best result.
  • the preset visual features include: a font size of 20 to 24 px, and a bold font; initially, a preset score of 6 for a preset visual feature of a font size of 20 to 24 px is The font has a preset score of 4 for the default visual feature.
  • the preset score of the preset visual feature can be fixed and the font size is unchanged, and the preset score of the preset visual feature of the font size of 20 to 24px is raised or lowered, and the font size is counted.
  • the preset score of 20 ⁇ 24px is adjusted or lowered, the effect of the title extraction success rate is increased.
  • the preset score of the preset visual feature of the font size is 20 ⁇ 24px is increased, The title extraction success rate is increased, and the preset score of the preset visual feature of the font size of 20 to 24 px is increased. Conversely, if the title extraction success rate is lowered after the adjustment, the initial set value is not maintained first. Change, and adjust the preset value of the default visual feature of the font bold.
  • the candidate region where the target content is to be extracted in the webpage is first determined, and then the visual feature score of each candidate region is calculated according to the preset visual feature of the target content, and finally the candidate with the highest visual feature score is obtained.
  • the region extracts the target content, that is, the extraction process of the embodiment, according to the experience of the web page designer according to the human eye to obtain the webpage information, and the user-focused and prominent design (ie, the preset visual feature of the target content) ), determining the area where the target content is located, and then extracting the target content directly from the area, thereby eliminating the need to manually mark the XPath data of each web page, saving human resources and improving extraction efficiency.
  • FIG. 7 is a flowchart of a method for extracting webpage content according to an embodiment of the present application. As shown in FIG. 7, the method 70 can include the following steps.
  • Step S71 Determine a candidate area of the target content in the to-be-extracted webpage according to the set of the regions in which the to-be-extracted content is located in the pre-stated webpage.
  • Step S72 Determine whether each visual feature in each candidate region matches each preset visual feature of the corresponding target content. When a certain visual feature corresponds to the preset visual feature, step S73 is performed, and when When the visual feature does not match the preset visual feature, step S74 is performed.
  • Step S73 Determine a score of the visual feature in the candidate region, which is equal to a preset score of the corresponding preset visual feature.
  • Step S74 determining a score of the visual feature in the candidate area, equal to zero
  • the preset visual features of the target content may be acquired according to the preset visual features corresponding to the respective to-be-identified content obtained by the training, and the visual feature scores of each candidate region are calculated according to the preset visual features of the target content.
  • Step S75 accumulating scores of respective visual features in each candidate region as a visual feature score of each candidate region
  • Step S76 extracting target content from a candidate region with the highest visual feature score
  • Step S77 testing whether the extracted target content is accurate
  • Step S78 Adjust a preset score of each preset visual feature of the target content according to the test result.
  • FIG. 8 is a schematic diagram of a webpage content extraction apparatus according to an embodiment of the present application.
  • the device 80 can be provided in the web content extraction device 11, for example in the form of computer readable instructions, stored in the memory of the web content extraction device 11.
  • the web page content extraction device 80 may include a determination unit 21, a calculation unit 22, and an extraction unit 23.
  • the determining unit 21 may determine a candidate region in which the target content in the web page to be extracted is located.
  • the calculating unit 22 may calculate a visual feature score of each of the candidate regions according to a preset visual feature of the target content.
  • the calculating unit 22 may acquire preset visual features of the target content according to the preset visual features corresponding to the respective to-be-identified content obtained by the training, and calculate visual features of each candidate region according to the preset visual features of the target content. Score.
  • the computing unit 402 can include a first computing unit and a second computing unit.
  • the first calculating unit may first calculate a score of each visual feature corresponding to each preset visual feature of the target content existing in each candidate region.
  • the first computing unit can include a determining subunit and a determining subunit.
  • the determining subunit may determine whether the respective visual features in each candidate region match each of the preset visual features of the corresponding target content. Determining, by the subunit, a score of the visual feature matching the corresponding preset visual feature, equal to a preset preset score of the preset visual feature; and determining a visual feature that does not match the corresponding preset visual feature The score is equal to zero.
  • a second computing unit may obtain the respective visual features within each of the candidate regions The points are accumulated as a visual feature score for each of the candidate regions.
  • the extracting unit 23 may extract the target content from the candidate region having the highest visual feature score.
  • device 80 may also include a test unit and an adjustment unit.
  • the testing unit may test whether the extracted target content is accurate; if accurate, keep the preset scores of the respective preset visual features of the target content unchanged; if not, adjust The unit can adjust the preset scores of the respective preset visual features of the target content. When adjusting, you can fix several other preset scores, adjust only one preset score, make the result optimal, and so on. Finally, each preset score is the best result.
  • FIG. 9 is a schematic diagram of a webpage content extraction apparatus according to an embodiment of the present application.
  • the apparatus 90 can include a processor 31 of one or more processing cores, a memory 32 of one or more computer readable storage media, a radio frequency (RF) circuit 33, a power source 34, an input unit 35, and a display unit 36. And other components.
  • RF radio frequency
  • Memory 32 can be used to store software programs as well as modules.
  • the processor 31 executes the methods of the various embodiments by running software programs stored in the memory 32 and modules (e.g., modules of the device 80).
  • memory 32 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • FIG. 10 is a method for extracting webpage content according to an embodiment of the present application; Flow chart. As shown in FIG. 10, the method 100 can include the following steps.
  • Step S81 generating, for each candidate region of the plurality of candidate regions, an extraction vector by using extracted values of the visual features of the candidate region;
  • Step S82 determining the target area according to an extraction vector of each candidate area by using a preset recognition model.
  • the extraction vector refers to an array in which the extracted values of the visual features are composed in a preset order.
  • numeric extraction values can be added directly to the extraction vector.
  • the extracted values can be processed to be converted to representation values, and the representation values are added to the extraction vector.
  • step S81 of some embodiments may include, for each visual feature of a candidate region, mapping the extracted value of the visual feature to a representation value that is in a preset correspondence relationship with the extracted value.
  • the preset value range corresponds.
  • the representation values of the visual features of the candidate region are organized into the extraction vector in a preset order.
  • a non-numeric extraction value can be converted to a numerical representation value using a word vector.
  • FIG. 11 is a flowchart of a method for generating a recognition model according to an embodiment of the present application. As shown in FIG. 11, the method 110 can include the following steps.
  • Step S91 extracting sample values of a plurality of visual features of the target area marked in each webpage sample from the plurality of webpage samples.
  • Step S92 generating a sample vector by using sample values of respective visual features of each of the plurality of webpage samples.
  • Step S93 generating the recognition model by training a machine learning algorithm by using sample vectors of the plurality of webpage samples.
  • the webpage content identification method provided by the embodiments of the present application can be applied to as shown in FIG. Server 120.
  • the server 120 includes a memory 41, a processor 42, and a network module 43.
  • the memory 41 can be used to store software programs and modules, such as the webpage content identification method and the program instructions/modules corresponding to the system in the embodiment of the present application.
  • the processor 42 executes the various functions and data processing by executing the software program and the module stored in the memory 41, that is, the webpage content identification method and system in the embodiment of the present application.
  • Memory 41 may include high speed random access memory and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 41 may further include memory remotely located relative to processor 42 that may be connected to the server over a network.
  • the software program and module described above may further include an operating system 45 and a service module 46.
  • the operating system 45 may be LINUX, UNIX, WINDOWS, which may include various software components and/or drivers for managing system tasks (eg, memory management, storage device control, power management, etc.), and may be various Hardware or software components communicate with one another to provide an operating environment for other software components.
  • the service module 46 runs on the basis of the operating system 45, and listens for requests from the network through the network service of the operating system 45, completes corresponding data processing according to the request, and returns the processing result to the terminal. That is, the service module 46 is configured to provide network services to the terminal.
  • Network module 43 is used to receive and transmit network signals.
  • the above network signal may include a wireless signal or a wired signal.
  • the network signal is a wired network signal.
  • FIG. 13 is a flowchart of a method for identifying a webpage content according to an embodiment of the present application.
  • This embodiment is a webpage content identification method performed by a server through a network.
  • the webpage content identification method 130 of this embodiment may include the following steps.
  • Step S101 Determine at least one training site, and collect multiple training webpages in each training site.
  • each may be determined, but not limited to, based on the popularity of the training site.
  • the number of training web pages collected by the training site the more the number of training web pages collected by the more popular sites, so that the training tool can learn the visual features corresponding to the content of the web pages with large visits, thereby increasing the accuracy of webpage recognition.
  • Step S102 Acquire visual features of the block corresponding to the selected content in each training webpage.
  • the visual features of the block are the main features that can represent the visual level of the web page block, which can be, but is not limited to, the length, width, height, block font size, web page label, and the like of the block.
  • Step S103 performing data processing on the visual feature to obtain a feature vector (ie, a sample vector).
  • the visual feature In order to obtain a feature vector that can be recognized by the training tool, the visual feature needs to be processed. Specifically, if the visual feature includes a numerical feature, occupying one bit in the vector represents a numerical feature. Specifically, the numerical statistics can be performed for each numerical characteristic, and then divided into several parts, for example, 10 parts, which are respectively mapped to 0-0.1, 0.1-0.2, 0.2-0.3, 0.3-0.4, 0.4-0.5. , 0.5 to 0.6, 0.6 to 0.7, 0.7 to 0.8, 0.8 to 0.9, and 0.9 to 1.0.
  • one-hot representation is the simplest representation of word vector, that is, a long vector is used to represent a word.
  • the length of the vector is the size of the dictionary.
  • the component of the vector has only one "1", and the others are all "0".
  • the position of "1" corresponds to the position of the word in the dictionary.
  • Step S104 The recognition model of the selected content is established according to the feature vector by using the training tool.
  • the training tool may be, but not limited to, a Gradient Boosting Decision Tree (GBDT) training tool, or other machine training tools such as a linear regression training tool.
  • GBDT Gradient Boosting Decision Tree
  • establishing a recognition model of the selected content based on the feature vector is established The correspondence between the feature vector of the web page and the content of the web page such as title, price, and the like.
  • the webpage content identification method of the present application converts the visual features of the webpage block into feature vectors that the training tool can learn, thereby generating a content recognition model by using the training tool, thereby improving the efficiency and accuracy of identifying the webpage content.
  • FIG. 14 is a schematic flowchart diagram of a method for identifying a webpage content according to an embodiment of the present application.
  • FIG. 15 is a schematic diagram of an interface 150 of the webpage content identification method shown in FIG. As shown in FIG. 14 and FIG. 15, the webpage content identification method 140 may include the following steps.
  • Step S111 Select the content to be marked in the training webpage.
  • contents to be marked in the training webpage such as the title 55 and the like can be manually selected.
  • Step S112 Analyze the XPath of the content to be labeled.
  • the annotation program parses its XPath and displays the XPath in the XPath display area 52.
  • the annotation program can also automatically trigger the resolution of its Xpath and send it directly to the XPath. Backstage.
  • Step S113 Find the visual features of the block corresponding to the selected content according to the XPath.
  • the XPath of each block in the webpage is unique, the XPath of the content to be labeled can find all the visual features of the corresponding block stored after parsing.
  • webkit acts as a kernel with no interface browser, with the ability to parse Cascading Style Sheets (CSS) and automatically render the interface. Therefore, the above-mentioned functions of the webkit can be used to extract the visual information of the corresponding block, and then the visual information is processed by the feature engineering method, and the visual features are obtained and stored for future reference. Find.
  • CCS Cascading Style Sheets
  • the visual feature of the block corresponding to the selected content in each training webpage may be obtained by the method 140, or the content to be marked in the training webpage, for example, the title, and then directly parsed to obtain the corresponding content of the selected content.
  • the visual characteristics of the block may be obtained by the method 140, or the content to be marked in the training webpage, for example, the title, and then directly parsed to obtain the corresponding content of the selected content.
  • the webpage content identification method of the present application acquires visual features of the block corresponding to the selected content according to the XPath of the selected content, and converts the visual features of the webpage block into feature vectors that the training tool can learn, thereby utilizing the training tool.
  • the content recognition model is generated, thereby further improving the efficiency and accuracy of identifying the content of the webpage.
  • FIG. 16 is a schematic flowchart diagram of a method for identifying a webpage content according to an embodiment of the present application.
  • This embodiment is a webpage content identification method performed by a server through a network.
  • the webpage content identification method 16 of the present embodiment may include the following steps.
  • Step S121 Identify at least one training site, and collect multiple training webpages in each training site;
  • Step S122 Acquire visual features of the block corresponding to the selected content in each training webpage.
  • Step S123 Perform data processing on the visual feature to obtain a feature vector.
  • Step S124 The recognition model of the selected content is established according to the feature vector by using the training tool.
  • the recognition model of the selected content is established based on the feature vector, ie, the correspondence between the feature vector of the web page and the content of the web page, such as title, price, and the like.
  • Step S125 Receive the feature identifier of the webpage, and find the webpage to be identified according to the feature identifier.
  • the feature identifier may be a Uniform Resource Locator (URL) or a name, and the feature identifier is used to uniquely identify a webpage.
  • URL Uniform Resource Locator
  • the user may submit the information to the server through the provided interactive interface.
  • the feature identifier of the webpage may be another feature identifier of the webpage to be identified to the server by other servers, service platforms, and the like.
  • the identifier of the webpage to be identified may be submitted to the server at a time, or the signature of the plurality of webpages to be identified may be submitted to the server for batch processing, and the server determines the webpage to be identified that needs to be identified based on the feature identifier.
  • Step S126 Convert the visual features of all the blocks of the webpage to be recognized into feature vectors.
  • Step S127 Identifying an XPath of the corresponding content in the to-be-identified webpage according to the feature vector of the webpage to be identified by using the recognition model.
  • the recognition model includes a plurality of content such as a title, a relationship of a feature vector of the price to its XPath, an attribute of the corresponding content such as "title” is input to identify the XPath of the title using the recognition model.
  • the web content identification method 160 may further include the following steps.
  • Step S128 Extract corresponding content of the to-be-identified webpage according to the XPath of the corresponding content in the webpage to be identified.
  • extracting the corresponding content of the web page to be identified may be, but is not limited to, data used for statistical analysis such as extracting the title of the web page to be recognized and the price trend at which the price may detect the merchandise.
  • the webpage content identification method of the present application divides the visual features of the webpage block into numerical and non-numeric features, respectively, to generate feature vectors that the training tool can learn, thereby generating a content recognition model by using the training tool, and then utilizing the recognition model. Content recognition can further improve the efficiency and accuracy of identifying web content.
  • FIG. 17 is a schematic structural diagram of a webpage content identification apparatus according to an embodiment of the present application.
  • the webpage content recognition apparatus 160 includes a data collection module 61, a visual feature acquisition module 62, a data processing module 63, and a model creation module 64.
  • the data collection module 61 can determine at least one training site and collect multiple training web pages within each training site.
  • the visual feature acquisition module 62 can acquire each training webpage The visual characteristics corresponding to the selected content.
  • the data processing module 63 can perform data processing on the visual features to obtain feature vectors.
  • the model building module 64 can utilize the training tool to build a recognition model of the selected content from the feature vector.
  • the data collection module 61 determines the number of training web pages collected by each training site based on the popularity of the training site.
  • the webpage content recognition device of the present application converts the visual feature of the webpage block into a feature vector that the training tool can learn, thereby generating a content recognition model by using the training tool, and then using the recognition model for content recognition, thereby improving the efficiency and accuracy of identifying the webpage content. degree.
  • FIG. 18 is a schematic structural diagram of a webpage content identification apparatus according to an embodiment of the present application.
  • the webpage content recognition apparatus 180 includes a data collection module 71, a visual feature acquisition module 72, a data processing module 73, and a model creation module 74.
  • the visual feature acquisition module 72 includes a selection unit 75, a parsing unit 76, and an acquisition unit 77.
  • the selection unit 75 can select the content to be marked in the training webpage.
  • Parsing unit 76 can parse the XPath of the content to be annotated.
  • the obtaining unit 77 can find the visual features of the block corresponding to the selected content according to the XPath.
  • data processing module 73 includes a numerical feature processing unit 78 that can represent a one-dimensional representation of a numerical feature in a visual feature in a vector.
  • data processing module 73 includes a non-numeric feature processing unit 79 that can represent non-numeric features in the visual features in a horizontal one-hot representation mode.
  • the webpage content recognition apparatus 180 further includes an identification module (not shown) for receiving the feature identifier of the webpage, and searching for the webpage to be identified according to the feature identifier, and all the blocks of the webpage to be identified. After the visual feature is converted into the feature vector, the XPath of the corresponding content in the web page to be identified is identified by the recognition model according to the feature vector of the web page to be identified.
  • the webpage content recognition device of the present application divides the visual features of the webpage block into numerical and non-numeric features, respectively, to generate feature vectors that the training tool can learn, thereby generating a content recognition model by using the training tool, and then utilizing the recognition model. Content recognition can further improve the efficiency and accuracy of identifying web content.
  • FIG. 19 is a schematic structural diagram of a server according to an embodiment of the present application.
  • the server 190 includes web page content identifying means.
  • the webpage content recognizing device may be a webpage content recognizing device of various embodiments of the present application, such as webpage content recognizing device 11, device 80, 90, 120, 160, 180, and the like.
  • the webpage content identification method, device and server of the present application convert the visual features of the webpage block into feature vectors that the training tool can learn, thereby generating a content recognition model by using the training tool, thereby improving the efficiency and accuracy of identifying the webpage content.
  • the hardware modules in the embodiments may be implemented in a hardware manner or a hardware platform plus software.
  • the above software includes machine readable instructions stored in a non-volatile storage medium.
  • embodiments can also be embodied as software products.
  • the hardware may be implemented by specialized hardware or hardware that executes machine readable instructions.
  • the hardware can be a specially designed permanent circuit or logic device (such as a dedicated processor such as an FPGA or ASIC) for performing a particular operation.
  • Hardware can also be included by software temporary A configured programmable logic device or circuit (such as a general purpose processor or other programmable processor) is used to perform a particular operation.
  • the machine readable instructions corresponding to the modules in the figures may cause an operating system or the like operating on a computer to perform some or all of the operations described herein.
  • the non-transitory computer readable storage medium may be inserted into a memory provided in an expansion board within the computer or written to a memory provided in an expansion unit connected to the computer.
  • the CPU or the like installed on the expansion board or the expansion unit can perform part and all of the actual operations according to the instructions.
  • the non-transitory computer readable storage medium includes a floppy disk, a hard disk, a magneto-optical disk, an optical disk (such as a CD-ROM, a CD-R, a CD-RW, a DVD-ROM, a DVD-RAM, a DVD-RW, a DVD+RW), and a magnetic tape. , non-volatile memory card and ROM.
  • the program code can be downloaded from the server computer by the communication network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Transfer Between Computers (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种网页内容提取方法,应用于网络设备。该方法包括:确定网页中的多个候选区域,每个候选区域包括在所述网页中位置相邻的一个或多个页面元素(S21);针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值(S22),视觉特征是网页中人眼可以感知到的特征,视觉特征的提取值是是该网页的数据中设置的该视觉特征的值;根据所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域,提取所述目标区域的内容信息(S23)。

Description

网页内容提取方法、装置、存储介质
相关文件
本申请要求于2016年12月9日提交中国专利局、申请号为201611126527.4、申请名称为“一种网页内容提取方法及装置”的中国专利申请、以及于2016年12月16日提交中国专利局、申请号为201611170430.3、申请名称为“网页内容识别方法、装置、服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,具体涉及网页内容提取方法、装置、存储介质。
背景
随着互联网规模的不断扩大,网络信息也呈现出指数级的递增。为了方便用户查询网络中感兴趣的内容,网络服务提供者可以提取各网页的内容并存入数据库中,据此为用户提供查询服务。
技术内容
有鉴于此,本申请实施例提供了一种网页内容提取方法及装置,能够节省人力资源,提高提取效率。
本申请实施例提供的网页内容提取方法,应用于网络设备,所述网络设备包括处理器和存储器,所述处理器可以通过执行所述存储器中存储的计算机可读指令实现所述方法,所述方法包括:
确定网页中的多个候选区域,每个候选区域包括在所述网页中位置 相邻的一个或多个页面元素;
针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值;
根据所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域,提取所述目标区域的内容信息。
本申请实施例提供的网页内容提取装置,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令可以由所述处理器执行用于:
确定网页中的多个候选区域,每个候选区域包括在所述网页中位置相邻的一个或多个页面元素;
针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值;
根据所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域,提取所述目标区域的内容信息。
本申请还提供一种计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令可以由处理器执行用于实现本申请各实施例的方法。
本申请的网页内容识别方法、装置及服务器将网页区块的视觉特征转换为训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,进而能提高识别网页内容的效率、准确度。
附图简要说明
图1是本申请实施例的网页内容提取方法的网络场景示意图;
图2是本申请实施例的网页内容提取方法的流程图;
图3为本申请实施例的网页内容提取方法的流程图;
图4为本申请实施例的提取规则生成方法的流程图;
图5是本申请实施例的网页内容提取方法的应用场景的示意图;
图6是本申请实施例的网页内容提取方法的流程图;
图7是本申请实施例的网页内容提取方法的流程图;
图8是本申请实施例的网页内容提取装置的示意图;
图9是本申请实施例的网页内容提取装置的示意图;
图10为本申请实施例的网页内容提取方法的流程图;
图11为本申请实施例的一种识别模型的生成方法的流程图;
图12为本申请实施例的服务器的结构示意图;
图13为本申请实施例的网页内容识别方法的流程图;
图14为本申请实施例的网页内容识别方法的流程图;
图15为如图14的网页内容识别方法的界面示意图;
图16为本申请实施例的网页内容识别方法的流程图;
图17为本申请实施例的网页内容识别装置的结构示意图;
图18为本申请实施例的网页内容识别装置的结构示意图;
图19为本申请实施例的服务器的结构示意图。
实施本申请的方式
为了描述上的简洁和直观,下文通过描述若干代表性的实施例来对本申请的方案进行阐述。但本文并未示出所有实施方式。实施例中大量的细节仅用于帮助理解本申请的方案,本申请的技术方案实现时可以不局限于这些细节。为了避免不必要地模糊了本申请的方案,一些实施方式没有进行细致地描述,而是仅给出了框架。下文中,“包括”是指“包括但不限于”,“根据……”是指“至少根据……,但不限于仅根据……”。说明书和权利要求书中的“包括”是指某种程度上至少包括,应当解释为除 了包括之后提到的特征外,其它特征也可以存在。
图1是本申请实施例的一个网页内容提取系统的示意图。如图1所示,系统10可以包括网页内容提取设备11、多个网站12-1~12-N,以及网络13。网页内容提取设备11可以通过网络13从多个网站获取网页数据(以下简称网页),按照各实施例的方法提取各网页中的目标内容信息。
网页内容提取设备11可以包括处理器15、通信接口18、待提取网页数据存储装置14,以及提取内容信息存储装置16。
处理器15可以为专用处理器(如FPGA或ASIC),或通用处理器或其它可编程处理器。处理器15可以通过执行内嵌的处理逻辑或者存储器中存储的计算机可读指令来提取网页中的目标内容信息。
网页内容提取设备11可以通过通信接口18从各网站获取网页数据,并将获取到的网页数据存储在待提取网页数据存储装置14中。处理器15提取出的目标内容信息可以存储在提取内容信息存储装置16中,供其它业务处理设备使用。例如,搜索引擎可以在内容信息存储装置16中搜索与用户的检索词匹配的内容信息;应用服务器可以从内容信息存储装置16中查询与用户的标签匹配的内容信息提供给用户。
图2是本申请实施例的网页内容提取方法的流程示意图。该方法20可以由网页内容提取设备11执行。如图2所示,方法20可以包括以下步骤。
步骤S21,确定网页中的多个候选区域,每个候选区域包括在所述网页中位置相邻的一个或多个页面元素。
步骤S22,针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值。
步骤S23,根据所述多个视觉特征的提取值确定所述多个候选区域 中符合提取规则的目标区域,提取所述目标区域的内容信息。
视觉特征是指网页中人眼可以感知到的特征,例如,文本的字体、颜色、大小、加粗程度、背景颜色、边框颜色等,图像的前景色、背景色、大小,页面元素的位置等。视觉特征的提取值是指从网页数据中提取出的、该网页的数据中设置的该视觉特征的取值。提取值可以为数值类型的值,也可以是非数值类型的值。数值类型的值即用数字表示的值,例如字体的大小、线条的粗细磅数、图片的尺寸、位置,等。非数值类型的值一般从一个具有多个描述信息的集合中选取,例如字体、是否加粗、是否斜体、是否加框,等。一些实施例中,当候选区域中的多个页面元素的某种视觉特征的值不同,则可以利用预设的规则对多个视觉特征的值进行处理,得到该候选区域的该视觉特征的值。例如,可以按照预定的规则从多个页面元素的该视觉特征的值中选择一个作为该候选区域中该视觉特征的提取值。又例如,可以按照预定的算法(如,求平均、加权平均,等),对这多个值进行计算,将计算结果作为该候选区域的该视觉特征的提取值。
候选区域是指网页中目标内容可能存在的区域。一些实施例中,可以将网页中的各个区域作为候选区域。例如,可以根据网页的结构,将网页的每个组成区块作为一个候选区域。又例如,当网页元素按树形结构组织时,可以将树结构的某个层级的所有节点或者各层级的所有节点作为候选区域。一些实施例中,还可以根据预设的规则对从网页中提取的区域进行筛选,将筛选后得到的区域作为候选区域。例如,可以将网页中位于预设位置范围内的区域确定为候选区域。该位置范围可以根据多个网页样本中标记的目标区域所在的位置确定。又例如,可以将网页中包括预设的内容标签的区域确定为候选区域。例如,目标内容包括图片时,由于图片在网页数据中的html标签为“img”,则将该网页中包括 “img”标签的区域确定为候选区域。例如,该网页包括多个分块,预设的内容标签为“img”时,则将多个分块中包括html标签“img”的分块作为目标区域。以上仅为几个例子,其它实施例可以根据需要采用其它的筛选方式。
网页中网页元素的超文本标记语言(HyperText Markup Language,HTML)标签和网页元素有很强的关联。例如,图片的HTML标签通常为img。因此,针对目标内容为图片的,可以在目标内容的预设视觉特征中加入HTML标签img。针对候选区域内HTML标签这一特征的得分,其计算过程可参阅其它视觉特征得分的计算过程,此处不再赘述。
每个候选区域可以包括在所述网页中位置相邻的一个或多个页面元素。页面元素指组成网页的最小单元,例如,一个html标签对应的一项内容可以作为一个页面元素。
目标区域是指利用本申请各实施例的方法确定的要提取的目标内容所在的区域。从目标区域提取的内容信息即目标内容的信息。一些实施例中,提取的内容信息可以是内容本身,也可以是内容的路径,如XPath(Extensible Markup Language Path Language)、JSON-Path(JavaScript Object Notation-Path),等。
根据本申请各实施例,由于网站运营者为了吸引用户注意,往往将关键内容设置得很醒目,如采用鲜艳的颜色、较大的字体等,采用机器利用这些视觉特征来识别并提取目标内容,在节省人力和时间的同时,还提高了内容提取的准确性。
本申请实施例中,可以采用各种方法来根据视觉特征的提取值确定目标区域。例如,可以统计目标内容的各视觉特征的取值范围,根据这些取值范围来确定目标区域。又例如,可以利用标注的网页样本来训练 机器学习算法,利用训练好的机器学习算法来识别目标区域,完成目标内容信息的提取。
一些实施例可以统计目标内容的各视觉特征的取值范围,根据这些取值范围来确定目标区域。图3为本申请实施例的网页内容提取方法的流程图。为了行文简练,各实施例中与前文描述过的步骤相似的步骤均被忽略。如图3所示,该方法30可以包括以下步骤。
步骤S31,针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值。
步骤S32,根据各视觉特征的提取值和提取规则中各视觉特征的取值范围,计算各候选区域的各视觉特征的得分。
步骤S33,选择多个候选区域中各视觉特征的得分之和最高的候选区域作为目标区域。
一些实施例中,视觉特征的得分可以表示该视觉特征的提取值与提取规则中该视觉特征的取值范围的匹配程度。例如,提取规则中可以包括打分规则,即,取值范围中不同的部分对应不同的分值。一些实施例中,视觉特征的得分可以表示该视觉特征在所有视觉特征中的重要程度。例如,目标内容最主要的视觉特征可以对应较高的分值上限,较次要的视觉特征可以对应低一些的分值上限。比如,字体大小的分值上限大于字体颜色的分值上限时,说明字体颜色对于目标区域的判断的影响力没有字体大小的影响力大。下面给出一种提取规则生成方法的例子。图4为本申请实施例的提取规则生成方法的流程图。如图4所示,该方法40可以包括以下步骤。
步骤S41,从多个网页样本中提取各网页样本中标记的目标区域的多个视觉特征的样本值。
网页样本为预先提取并对其中的目标内容进行了人工标记的网页, 用于从中提取目标内容的视觉特征的取值范围。视觉特征的样本值指网页样本中目标内容的视觉特征的值。
步骤S42,针对所述多个视觉特征中的每个视觉特征,利用所述多个网页样本的该视觉特征的样本值确定该视觉特征的取值范围。
对于数值型的样本值,取值范围可以为一个或者多个数值范围。对于非数值型的样本值,取值范围可以为一个离散的集合,其中包括各种可能的取值。还可以将非数值型的样本值利用词向量转换为数值,则取值范围可以是数组的形式。
步骤S43,利用所述多个视觉特征的取值范围生成所述提取规则。
生成的提取规则可以包括步骤S42中得到的各视觉特征的取值范围。
一些实施例中,为了确定各视觉特征对目标内容识别准确度的影响力大小,还可以设置各视觉特征的分值上限。例如,方法40还可以包括:步骤S44,针对所述多个视觉特征中的每个视觉特征,利用多个第二网页样本和该视觉特征的取值范围确定该视觉特征的权重,并将所述权重加入所述提取规则。计算各候选区域的各视觉特征的得分时,当一候选区域的第一视觉特征的提取值在所述提取规则中该第一视觉特征的取值范围内时,可以将该第一候选区域的第一视觉特征的得分设置为提取规则中该第一视觉特征的权重。
这里,视觉特征的权重可以用于计算该视觉特征的分值上限,表示该视觉特征对目标内容识别准确度的影响力。一些实施例中,可以直接将权重用作视觉特征的分值上限。
图5为本申请实施例的网页内容提取方法的应用场景50的示意图。如图5所示,网页内容提取装置可以先下载待提取网页,然后确定待提取网页中目标内容所在的候选区域,该候选区域可为多个,根据所述目 标内容的预设视觉特征,计算每个候选区域的视觉特征得分,从视觉特征得分最高的候选区域提取所述目标内容。目标内容即需要提取的网页内容,例如网页中的标题、图片、价格等。预设视觉特征可为网页设计者根据人眼获取网页信息的经验,针对目标内容所作出的吸引用户的、突出的设计,预设视觉特征可为目标内容的字体颜色、字体大小、字体加粗程度、背景颜色、边框颜色等信息。
例如,在图5所示的待提取网页中,根据统计数据可知,目标内容的候选区域有三个,则分别计算这三个候选区域的视觉特征得分,选取视觉特征得分最高的候选区域,例如视觉特征得分最高的候选区域为候选区域2,则从候选区域2提取目标内容。即本申请实施例的提取过程依赖统计所得的目标内容的预设视觉特征确定目标内容所在的区域,进而直接从该区域提取目标内容,因而不再需要人工标注每个网页的XPath数据,节省了人力资源,提高了提取效率。
图6是本申请实施例的网页内容提取方法的流程图。如图6所示,方法60可以包括以下步骤。
步骤S61、确定待提取网页中目标内容所在的候选区域;
目标内容属于待提取内容。可以先确定待提取网页的类型,根据待提取网页的类型,找出所统计的对应类型网页的,各个待提取内容在网页中所在的区域的集合。根据该集合确定待提取网页中目标内容所在的候选区域。由于每类网页通常包括多个网页,不同网页的结构、布局可能不同,所以候选区域通常为多个。
步骤S62、根据所述目标内容的预设视觉特征,计算每个所述候选区域的视觉特征得分;
具体实现中,可以根据各个待识别内容对应的预设视觉特征,获取目标内容的预设视觉特征。根据目标内容的预设视觉特征,计算每个候 选区域的视觉特征得分。例如,先计算每个候选区域内存在的,与目标内容的各个预设视觉特征对应的各个视觉特征的得分。例如,可以判断每个候选区域内的所述各个视觉特征,是否与对应的目标内容的各个所述预设视觉特征匹配。例如,可以确定与对应的所述预设视觉特征匹配的视觉特征的得分等于该预设视觉特征的预设分值;确定与对应的所述预设视觉特征不匹配的视觉特征的得分等于零。
上述匹配包括:所述视觉特征与对应的所述预设视觉特征相同,或所述视觉特征的参数属于对应的所述预设视觉特征的参数区间。在实际应用中,具体匹配的判断方法需要视具体视觉特征来确定。例如:针对无法用数值区分的视觉特征,例如字体颜色、边框颜色、字体加粗这些视觉特征,需要判断该视觉特征是否与对应的所述预设视觉特征相同;而对于可以用数值区分的视觉特征,例如字体大小,则需要判断该视觉特征的参数是否属于对应的所述预设视觉特征的参数区间。
然后将每个所述候选区域内的所述各个视觉特征的得分累加,作为每个所述候选区域的视觉特征得分。
针对每个候选区域的视觉特征得分的计算,下面举例进行说明。例如,目标内容为价格,目标内容的预设视觉特征为:字体大小18~22px,以及字体颜色红色。经前面的步骤确认,目标内容的候选区域包括第一候选区域及第二候选区域。第一候选区域内,与目标内容的预设视觉特征对应的视觉特征分别为:字体大小20px,字体颜色红色。字体大小20px这一视觉特征,在对应的预设视觉特征字体大小18~22px(对应的预设分值为3)的参数区间内,二者匹配,则字体大小20px这一视觉特征的得分为3;字体颜色红色这一视觉特征,与对应的预设视觉特征字体颜色红色(对应的预设分值为7)相同,二者匹配,则字体颜色红色这一视觉特征的得分为7,则第一候选区域的视觉特征得分为3+7,即 10分。第二候选区域内,与目标内容的预设视觉特征对应的视觉特征分别为:字体大小21px,字体颜色黑色。字体大小21px这一视觉特征,在对应的预设视觉特征字体大小18~22px(对应的预设分值为3)的参数区间内,二者匹配,则字体大小21px这一视觉特征的得分为3;字体颜色黑色这一视觉特征,与对应的预设视觉特征字体颜色红色(对应的预设分值为7)不同,二者不匹配,则字体颜色黑色这一视觉特征的得分为0,则第一候选区域的视觉特征得分为3+0,即3分。
步骤S63、从视觉特征得分最高的候选区域提取所述目标内容。
本实施例中,视觉特征得分最高的候选区域即所确定的目标内容所在的区域,因此可以直接从视觉特征得分最高的候选区域提取所述目标内容。上面的例子中,即从第二候选区域提取目标内容。
各实施例中,执行上述方法前,可以先通过特征训练得到各个待提取内容的预设视觉特征及各个预设视觉特征的预设分值。预设视觉特征,通常为网页设计者根据人眼获取网页信息的经验,针对待提取内容所作出的吸引用户的、突出的设计,预设视觉特征可为待提取内容的字体颜色、字体大小、字体加粗程度、背景颜色、边框颜色等信息。例如,针对电商类网页,用户往往很容易找到商品的名称、价格、图片等(即待提取信息)信息,这是因为网页设计者在设计网页时,根据人眼获取网页信息的经验(即人的视觉感官对信息特征的敏感度),把重要的信息(例如商品的名称、价格、图片等信息)设计的更吸引用户、更突出。例如,针对商品的价格,把价格字体设计的很大,价格字体的颜色设计的更醒目,甚至把价格字体加粗等。
例如,可以先下载(例如利用webkit下载)各类网页,针对每类网页,渲染各个网页内所有区块的视觉特征,保存人眼能够感知的视觉特 征,这些视觉特征包括但不限于字体颜色、字体大小、字体加粗程度、背景颜色、边框颜色等,然后针对每类视觉特征,做正例的特征统计,以得到待提取内容的预设视觉特征。例如针对商品价格的字体大小这一视觉特征,经统计,商品价格的字体大小一般在18~22px,则与商品价格的字体大小对应的预设视觉特征,可以设置为:字体大小18~22px;再例如,针对商品价格的字体颜色这一视觉特征,经统计,商品价格的字体颜色通常为红色,则与商品价格的字体颜色对应的预设视觉特征,可以设置为:字体颜色红色。
再针对每个预设视觉特征设置一个分值(即预设分值),该分值的具体取值,可以通过对应预设视觉特征对待识别内容的识别贡献度来决定,初始时,贡献度大小可依据经验来确定。例如,通过经验统计得知,针对商品价格这一待识别内容,价格字体大小对识别商品价格的贡献度为30%,价格字体颜色对识别商品价格的贡献度为70%,则商品价格的字体大小对应的预设视觉特征的预设分值,可以设置为3;商品价格的字体颜色对应的预设视觉特征的预设分值,可以设置为7,需要说明的是,此处仅为举例,并不构成对具体实施的限定。
一些实施例中,可以通过人工数据收集,统计出待提取内容在网页中所在的区域的集合。待提取内容可根据实际网页类型自定义,例如,针对电商网页,待提取内容可以是商品的名称、价格、图片等信息;再例如,针对新闻网页,待提取内容可以是标题,图片等信息。先收集各个站点的网页(作为网页样本)。本实施例中,可以从每个站点中选取预设数量的具有代表性的网页。预设数量可根据实际需求自定义。对收集的网页进行视觉特征渲染以便浏览。对收集的网页进行分类(例如电商类,新闻类)。针对每类网页,可以统计不同网页中待提取内容的位置信息。该位置信息可用坐标、宽度、高度组合来表示。位置信息通常 表现为一个区域。合并待提取内容在各个网页中的位置信息,最终形成各个待提取内容在网页中所在的区域的集合。以此类推,可以得到,针对每类网页统计的,各个待提取内容在网页中所在的区域的集合。
另一些实施例中,可以由机器辅助网页的标记工作,并由机器来自动提取标记的目标区域的视觉特征的样本值。例如,可以由一标记设备(例如运行有标记应用程序的计算机,或专用设备,等)下载网页样本的网页数据,并展示网页。同时,该标记设备可以在用户界面上提供操作接口,接收对网页中目标区域的选择指令。响应于对目标区域的选择指令,标记设备记录该目标区域的信息,例如XPath。对于已经标记了目标区域的网页,标记设备可以利用记录的目标区域的信息从网页数据中提取目标区域的各视觉特征的样本值。一些实施例中,标记设备还可以利用提取出的各网页样本的各视觉特征的样本值获得目标内容的各视觉特征的取值范围。
一些实施例中,在提取目标内容之后,可以测试所提取的所述目标内容是否准确。如果准确,则保持目标内容的各个预设视觉特征的预设分值不变;如果不准确,可以调整目标内容的各个预设视觉特征的预设分值。调整的时候,可以先固定其他几个预设分值,只调整某一个预设分值,让结果最优。以此类推,最后每一个预设分值都是得到的最优结果。例如,当目标内容为标题时,预设视觉特征包括:字体大小20~24px,以及字体加粗;初始时,针对字体大小20~24px这一预设视觉特征的预设分值为6,针对字体加粗这一预设视觉特征的预设分值为4。调整的时候,可以固定字体加粗这一预设视觉特征的预设分值不变,将字体大小20~24px这一预设视觉特征的预设分值调高或调低,统计将字体大小20~24px这一预设视觉特征的预设分值调高或调低时,对标题提取成功率的影响,如果将字体大小20~24px这一预设视觉特征的预设分值调高, 会提高标题提取成功率,则将字体大小20~24px这一预设视觉特征的预设分值调高,反之,若调高之后导致标题提取成功率降低,则先保持初始设置的分值不变,改而调整字体加粗这一预设视觉特征的预设分值。
本实施例中,先确定待提取网页中目标内容所在的候选区域,然后根据所述目标内容的预设视觉特征,计算每个所述候选区域的视觉特征得分,最后从视觉特征得分最高的候选区域提取所述目标内容,即本实施例的提取过程,依据网页设计者根据人眼获取网页信息的经验,针对目标内容所作出的吸引用户的、突出的设计(即目标内容的预设视觉特征),确定目标内容所在的区域,进而直接从该区域提取目标内容,因而不再需要人工标注每个网页的XPath数据,节省了人力资源,提高了提取效率。
图7是本申请实施例的网页内容提取方法的流程图。如图7所示,该方法70可以包括以下步骤。
步骤S71、根据预先统计的网页中各个待提取内容所在的区域的集合,确定目标内容在待提取网页中的候选区域。
步骤S72、判断每个候选区域内的各个视觉特征,是否与对应的目标内容的各个预设视觉特征匹配.当某个视觉特征对应与预设视觉特征匹配时,执行步骤S73,而当某个视觉特征对应与预设视觉特征不匹配时,执行步骤S74.
步骤S73、确定候选区域内的该视觉特征的得分,等于对应的预设视觉特征的预设分值.
步骤S74、确定候选区域内的该视觉特征的得分,等于零;
具体实现中,可以根据训练所得的各个待识别内容对应的预设视觉特征,获取目标内容的预设视觉特征,根据目标内容的预设视觉特征,计算每个候选区域的视觉特征得分。
步骤S75、将每个候选区域内的各个视觉特征的得分累加,作为每个候选区域的视觉特征得分;
步骤S76、从视觉特征得分最高的候选区域提取目标内容;
步骤S77、测试所提取的目标内容是否准确;
步骤S78、根据测试结果调整目标内容的各个预设视觉特征的预设分值。
图8为本申请实施例的一种网页内容提取装置的示意图。该装置80可以设置在网页内容提取设备11中,例如以计算机可读指令的形式存储在网页内容提取设备11的存储器中。该网页内容提取装置80可以包括:确定单元21、计算单元22及提取单元23。
确定单元21可以确定待提取网页中目标内容所在的候选区域。
计算单元22可以根据所述目标内容的预设视觉特征,计算每个所述候选区域的视觉特征得分。
一些实施例中,计算单元22可以根据训练所得的各个待识别内容对应的预设视觉特征,获取目标内容的预设视觉特征,根据目标内容的预设视觉特征,计算每个候选区域的视觉特征得分。计算单元402可以包括第一计算单元及第二计算单元。
第一计算单元可以先计算每个候选区域内存在的,与目标内容的各个预设视觉特征对应的各个视觉特征的得分。一些例子中,第一计算单元可以包括判断子单元和确定子单元。判断子单元可以判断每个候选区域内的所述各个视觉特征,是否与对应的目标内容的各个所述预设视觉特征匹配。确定子单元确定与对应的所述预设视觉特征匹配的视觉特征的得分,等于对应的所述预设视觉特征的预设分值;确定与对应的所述预设视觉特征不匹配的视觉特征的得分,等于零。
第二计算单元可以将每个所述候选区域内的所述各个视觉特征的得 分累加,作为每个所述候选区域的视觉特征得分。
提取单元23可以从视觉特征得分最高的候选区域提取所述目标内容。
一些实施例中,装置80还可以包括测试单元和调整单元。在提取单元23提取目标内容之后,测试单元可以测试所提取的所述目标内容是否准确;如果准确,则保持目标内容的各个预设视觉特征的预设分值不变;如果不准确,则调整单元可以调整目标内容的各个预设视觉特征的预设分值。调整的时候,可以先固定其他几个预设分值,只调整某一个预设分值,让结果最优,以此类推,最后每一个预设分值都是得到的最优结果。
图9为本申请实施例的一种网页内容提取装置的示意图。该装置90可以包括一个或者一个以上处理核心的处理器31、一个或一个以上计算机可读存储介质的存储器32、射频(Radio Frequency,RF)电路33、电源34、输入单元35、以及显示单元36等部件。本领域技术人员可以理解,本申请各设备的示意图中示出的结构并不构成对装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
存储器32可用于存储软件程序以及模块。处理器31通过运行存储在存储器32的软件程序以及模块(例如装置80的各模块),从而执行各实施例的方法。一些实施例中,存储器32可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
一些实施例可以利用预先训练好的机器学习算法来识别目标区域,完成目标内容信息的提取。图10为本申请实施例的网页内容提取方法 的流程图。如图10所示,该方法100可以包括以下步骤。
步骤S81,针对所述多个候选区域中的每个候选区域,利用该候选区域的各视觉特征的提取值生成一个提取向量;
步骤S82,利用预设的识别模型根据各候选区域的提取向量确定所述目标区域。
提取向量是指将各视觉特征的提取值按照预设的顺序组成的数组。一些例子中,数值型的提取值可以直接加入提取向量。另一些例子中,可以对提取值进行处理,以转换为表示值,再将表示值加入提取向量。例如,一些实施例的步骤S81可以包括,针对一个候选区域的每个视觉特征,将该视觉特征的提取值映射到一表示值,该表示值在预设的对应关系中与该提取值所属的预设取值范围相对应。将该候选区域的各视觉特征的表示值按照预设的顺序组织为所述提取向量。例如,对于数值型提取值,不同的取值范围对应不同的预设值(即表示值),如0-127对应1,128-255对应2,等。对于非数值型的提取值,可以利用词向量将非数值型的提取值转换为数值形式的表示值。
上述识别模型可以通过使用标注的网页样本来训练机器学习算法得到。图11为本申请实施例的一种识别模型的生成方法的流程图。如图11所示,该方法110可以包括以下步骤。
步骤S91,从多个网页样本中提取各网页样本中标记的目标区域的多个视觉特征的样本值。
步骤S92,利用所述多个网页样本中每个网页样本的各视觉特征的样本值生成一个样本向量。
步骤S93,通过利用所述多个网页样本的样本向量训练机器学习算法来生成所述识别模型。
本申请各实施例所提供的网页内容识别方法,可应用于如图12所示 的服务器120。如图12所示,服务器120包括:存储器41、处理器42以及网络模块43。
存储器41可用于存储软件程序以及模块,如本申请实施例中的网页内容识别方法及系统对应的程序指令/模块。处理器42通过运行存储在存储器41内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现本申请实施例中的网页内容识别方法及系统。存储器41可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器41可进一步包括相对于处理器42远程设置的存储器,这些远程存储器可以通过网络连接至服务器。一些实施例中,上述软件程序以及模块还可包括:操作系统45以及服务模块46。其中操作系统45,例如可为LINUX、UNIX、WINDOWS,其可包括各种用于管理系统任务(例如内存管理、存储设备控制、电源管理等)的软件组件和/或驱动,并可与各种硬件或软件组件相互通讯,从而提供其他软件组件的运行环境。服务模块46运行在操作系统45的基础上,并通过操作系统45的网络服务监听来自网络的请求,根据请求完成相应的数据处理,并返回处理结果给终端。也就是说,服务模块46用于向终端提供网络服务。
网络模块43用于接收以及发送网络信号。上述网络信号可包括无线信号或者有线信号。在一个实例中,上述网络信号为有线网络信号。
图13为本申请实施例提供的网页内容识别方法的流程图。本实施例为服务器通过网络所执行的网页内容识别方法。如图13所示,本实施例的网页内容识别方法130可包括以下步骤。
步骤S101:确定至少一个训练站点,并在每个训练站点内采集多个训练网页。
一些实施例中,可以,但不限于,根据训练站点的流行度确定每个 训练站点采集的训练网页的数量,越流行的站点采集的训练网页的数量越多,从而使得训练工具能学习到访问量大的网页的内容对应的视觉特征,进而增加网页识别的准确率。
步骤S102:获取每个训练网页内被选定的内容对应的区块的视觉特征。
一些实施例中,区块的视觉特征即为能够表示该网页区块视觉层面的主要特征,其可以但不限于为区块的长、宽、高、区块字体大小、网页标签等等。
步骤S103:对视觉特征进行数据处理得到特征向量(即,样本向量)。
为了得到可被训练工具识别的特征向量,需对视觉特征进行处理。具体地,若视觉特征包括数值型特征,则在向量中占一位表示一种数值型特征。具体可以是:对于每一种数值型特征进行数值统计,再等量的划分成若干份,例如10份,分别映射到0~0.1,0.1~0.2,0.2~0.3,0.3~0.4,0.4~0.5,0.5~0.6,0.6~0.7,0.7~0.8,0.8~0.9,0.9~1.0这10个区间中。
若视觉特征包括非数值型特征,则以横向的one-hot representation模式表示非数值型特征。其中,one-hot representation是一种最简单的词向量表示方式,即用一个长向量来表示一个词,向量的长度为词典的大小,向量的分量只有一个“1”,其它全为“0”,“1”的位置对应该词在词典中的位置。
步骤S104:利用训练工具根据特征向量建立被选定的内容的识别模型。
一些实施例中,训练工具可以但不限于为迭代的决策树(Gradient Boosting Decision Tree,GBDT)训练工具,也可以为线性回归训练工具等其它机器训练工具。
一些实施例中,根据特征向量建立被选定的内容的识别模型即建立 网页的特征向量与网页内容例如标题、价格等等之间的对应关系。
本申请的网页内容识别方法将网页区块的视觉特征转换为训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,从而能提高识别网页内容的效率、准确度。
图14为本申请实施例的网页内容识别方法的流程示意图。图15为如图14所示的网页内容识别方法的界面150示意图。如图14与图15所示,网页内容识别方法140可以包括以下步骤。
步骤S111:选定训练网页内需标注的内容。
如图15所示,在界面150中,可以手动选定训练网页内需标注的内容例如标题55等等。
步骤S112:解析需标注的内容的XPath。
一些实施例中,在预览XPath 51按钮接收到触发信号时标注程序就会解析出其XPath,并将XPath在XPath显示区域52进行显示,当然标注程序也可以自动触发解析出其Xpath后直接发送给后台。
一些实施例中,在需要对多种内容进行标注时,需要在属性输入区53输入内容的属性例如“标题”,并将内容的属性与其XPath对应存储起来。
步骤S113:根据XPath查找被选定的内容对应的区块的视觉特征。
一些实施例中,由于网页内每个区块的XPath是唯一的,因此,根据需标注的内容的XPath就可以查找到解析后存储起来的对应区块的全部视觉特征。
一些实施例中,webkit作为一个无界面浏览器的内核,具有解析层叠样式表(Cascading Style Sheets,CSS)并自动渲染界面的功能。因此,可以利用webkit的上述功能提取对应的区块的视觉信息,再利用特征工程的方法对视觉信息进行加工处理,得到视觉特征后存储起来,以备查 找。
其中,可以通过上述方法140获取每个训练网页内被选定的内容对应的区块的视觉特征,也可以选定训练网页内需标注的内容例如标题,然后直接解析得到选定的内容对应的区块的视觉特征。
本申请的网页内容识别方法根据被选定内容的XPath获取被选定的内容对应的区块的视觉特征,并将网页区块的视觉特征转换为训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,从而能进一步提高识别网页内容的效率、准确度。
图16为本申请实施例的网页内容识别方法的流程示意图。本实施例为服务器通过网络所执行的网页内容识别方法。如图16所示,本实施例的网页内容识别方法16可包括以下步骤。
步骤S121:确定至少一个训练站点,并在每个训练站点内采集多个训练网页;
步骤S122:获取每个训练网页内被选定的内容对应的区块的视觉特征。
步骤S123:对视觉特征进行数据处理得到特征向量。
步骤S124:利用训练工具根据特征向量建立被选定的内容的识别模型。
一些实施例中,根据特征向量建立被选定的内容的识别模型即建立网页的特征向量与网页内容例如标题、价格等等之间的对应关系。
步骤S125:接收网页的特征标识,并根据特征标识查找到待识别网页。
其中,特征标识具体可以是统一资源定位符(Uniform Resource Locator,URL)或名称等,特征标识用于唯一标识一个网页。
一些实施例中,可以是用户通过提供的交互界面向服务器提交待识 别网页的特征标识,也可以是其它服务器、业务平台等向服务器提交待识别网页的特征标识。可以向服务器一次提交一个待识别网页的特征标识,也可以向服务器一次提交多个待识别网页的特征标识以进行批量处理,服务器基于特征标识确定需进行内容识别的待识别网页。
步骤S126:将待识别网页的所有区块的视觉特征转换为特征向量。
步骤S127:利用识别模型根据待识别网页的特征向量识别出待识别网页中相应的内容的XPath。
一些实施例中,如果识别模型包括多种内容例如标题、价格的特征向量与其XPath的关系,则输入相应内容的属性例如“标题”以利用识别模型识别出标题的XPath。
一些实施例中,网页内容识别方法160还可以包括以下步骤。
步骤S128:根据待识别网页中相应的内容的XPath抽取待识别网页的相应内容。
一些实施例中,抽取待识别网页的相应内容可以但不限于用作统计分析的数据例如抽取待识别网页的标题和价格可以检测商品的价格趋势等等。
本申请的网页内容识别方法将网页区块的视觉特征分为数值型和非数值型特征分别进行转换,以生成训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,进而利用识别模型进行内容识别,能进一步提高识别网页内容的效率、准确度。
图17为本申请实施例的网页内容识别装置的结构示意图。如图17所示,网页内容识别装置160包括数据采集模块61、视觉特征获取模块62、数据处理模块63、模型建立模块64。
数据采集模块61可以确定至少一个训练站点,并在每个训练站点内采集多个训练网页。视觉特征获取模块62可以获取每个训练网页内被 选定的内容对应的视觉特征。数据处理模块63可以对视觉特征进行数据处理得到特征向量。模型建立模块64可以利用训练工具根据特征向量建立被选定的内容的识别模型。
一些实施例中,数据采集模块61根据训练站点的流行度确定每个训练站点采集的训练网页的数量。
本申请的网页内容识别装置将网页区块的视觉特征转换为训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,进而利用识别模型进行内容识别,能提高识别网页内容的效率、准确度。
图18为本申请实施例的网页内容识别装置的结构示意图。如图18所示,网页内容识别装置180包括数据采集模块71、视觉特征获取模块72、数据处理模块73、模型建立模块74。
一些实施例中,视觉特征获取模块72包括选定单元75、解析单元76、获取单元77。选定单元75可以选定训练网页内需标注的内容。解析单元76可以解析需标注的内容的XPath。获取单元77可以根据XPath查找被选定的内容对应的区块的视觉特征。
一些实施例中,数据处理模块73包括数值型特征处理单元78,可以将视觉特征中的数值型特征在向量中占一位表示。
一些实施例中,数据处理模块73包括非数值型特征处理单元79,可以将视觉特征中的非数值型特征以横向的one-hot representation模式表示。
一些实施例中,网页内容识别装置180还包括识别模块(图中未示出),其用于接收网页的特征标识,并根据特征标识查找到待识别网页,并将待识别网页的所有区块的视觉特征转换为特征向量后,利用识别模型根据待识别网页的特征向量识别出待识别网页中相应的内容的XPath。
本申请的网页内容识别装置将网页区块的视觉特征分为数值型和非数值型特征分别进行转换,以生成训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,进而利用识别模型进行内容识别,能进一步提高识别网页内容的效率、准确度。
图19为本申请实施例的服务器的结构示意图。图19所示,服务器190包括网页内容识别装置。网页内容识别装置可以是本申请各实施例的网页内容识别装置,例如网页内容识别设备11、装置80、90、120、160、180等。
本申请的网页内容识别方法、装置及服务器将网页区块的视觉特征转换为训练工具能学习的特征向量,从而利用训练工具生成内容识别模型,从而能提高识别网页内容的效率、准确度。
需要说明的是,上述各流程和各结构图中不是所有的步骤和模块都是必须的,可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的,可以根据需要进行调整。各模块的划分仅仅是为了便于描述采用的功能上的划分,实际实现时,一个模块可以分由多个模块实现,多个模块的功能也可以由同一个模块实现,这些模块可以位于同一个设备中,也可以位于不同的设备中。另外,上面描述中采用“第一”、“第二”仅仅为了方便区分具有同一含义的两个对象,并不表示其有实质的区别。
各实施例中的硬件模块可以以硬件方式或硬件平台加软件的方式实现。上述软件包括机器可读指令,存储在非易失性存储介质中。因此,各实施例也可以体现为软件产品。
各例中,硬件可以由专门的硬件或执行机器可读指令的硬件实现。例如,硬件可以为专门设计的永久性电路或逻辑器件(如专用处理器,如FPGA或ASIC)用于完成特定的操作。硬件也可以包括由软件临时 配置的可编程逻辑器件或电路(如包括通用处理器或其它可编程处理器)用于执行特定操作。
图中的模块对应的机器可读指令可以使计算机上操作的操作系统等来完成这里描述的部分或者全部操作。非易失性计算机可读存储介质可以是插入计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器。安装在扩展板或者扩展单元上的CPU等可以根据指令执行部分和全部实际操作。
非易失性计算机可读存储介质包括软盘、硬盘、磁光盘、光盘(如CD-ROM、CD-R、CD-RW、DVD-ROM、DVD-RAM、DVD-RW、DVD+RW)、磁带、非易失性存储卡和ROM。可选择地,可以由通信网络从服务器计算机上下载程序代码。
综上所述,权利要求的范围不应局限于以上描述的例子中的实施方式,而应当将说明书作为一个整体并给予最宽泛的解释。

Claims (21)

  1. 一种网页内容提取方法,应用于网络设备,所述网络设备包括处理器和存储器,所述处理器可以通过执行所述存储器中存储的计算机可读指令实现所述方法,所述方法包括:
    确定网页中的多个候选区域,每个候选区域包括在所述网页中位置相邻的一个或多个页面元素;
    针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值,视觉特征是网页中人眼可以感知到的特征,视觉特征的提取值是该网页的数据中设置的该视觉特征的值;
    根据所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域,提取所述目标区域的内容信息。
  2. 根据权利要求1所述的方法,根据所述多个候选区域的所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域包括:
    根据所述提取值和所述提取规则中各视觉特征的取值范围,计算各候选区域的各视觉特征的得分;
    选择所述多个候选区域中各视觉特征的得分之和最高的候选区域作为所述目标区域。
  3. 根据权利要求1所述的方法,根据所述多个候选区域的所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域包括:
    针对所述多个候选区域中的每个候选区域,利用该候选区域的各视觉特征的提取值生成一个提取向量;
    利用预设的识别模型根据各候选区域的提取向量确定所述目标区 域。
  4. 根据权利要求2所述的方法,进一步包括:
    从多个网页样本中提取各网页样本中标记的目标区域的多个视觉特征的样本值;
    针对所述多个视觉特征中的每个视觉特征,利用所述多个网页样本的该视觉特征的样本值确定该视觉特征的取值范围;
    利用所述多个视觉特征的取值范围生成所述提取规则。
  5. 根据权利要求4所述的方法,进一步包括:
    针对所述多个视觉特征中的每个视觉特征,利用多个第二网页样本和该视觉特征的取值范围确定该视觉特征的权重,并将所述权重加入所述提取规则;
    根据所述提取规则中各视觉特征的取值范围,计算各候选区域的各视觉特征的得分包括:
    针对所述多个候选区域中的第一候选区域的第一视觉特征,当所述候选区域的第一视觉特征的提取值在所述提取规则中所述第一视觉特征的取值范围内时,将所述第一候选区域的第一视觉特征的得分设置为所述第一视觉特征的权重。
  6. 根据权利要求3所述的方法,进一步包括:
    从多个网页样本中提取各网页样本中标记的目标区域的多个视觉特征的样本值;
    利用所述多个网页样本中每个网页样本的各视觉特征的样本值生成一个样本向量;
    通过利用所述多个网页样本的样本向量训练机器学习算法来生成所述识别模型。
  7. 根据权利要求3所述的方法,利用该候选区域的各视觉特征的提 取值生成一个提取向量包括:
    针对该候选区域的每个视觉特征,将该视觉特征的提取值映射到一表示值,该表示值在预设的对应关系中与该提取值所属的预设取值范围相对应;
    将该候选区域的各视觉特征的表示值按照预设的顺序组织为所述提取向量。
  8. 根据权利要求1所述的方法,确定网页中的多个候选区域包括:
    确定所述网页中位于预设位置范围内的多个区域为所述多个候选区域。
  9. 根据权利要求8所述的方法,进一步包括:
    根据多个网页样本中标记的目标区域所在的位置确定所述位置范围。
  10. 根据权利要求1所述的方法,确定网页中的多个候选区域包括:
    确定所述网页中包括预设的内容标签的多个区域为所述多个候选区域。
  11. 一种网页内容提取设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令可以由所述处理器执行用于:
    确定网页中的多个候选区域,每个候选区域包括在所述网页中位置相邻的一个或多个页面元素;
    针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值,视觉特征是网页中人眼可以感知到的特征,视觉特征的提取值是是该网页的数据中设置的该视觉特征的值;
    根据所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域,提取所述目标区域的内容信息。
  12. 根据权利要求11所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    根据所述提取值和所述提取规则中各视觉特征的取值范围,计算各候选区域的各视觉特征的得分;
    选择所述多个候选区域中各视觉特征的得分之和最高的候选区域作为所述目标区域。
  13. 根据权利要求11所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    针对所述多个候选区域中的每个候选区域,利用该候选区域的各视觉特征的提取值生成一个提取向量;
    利用预设的识别模型根据各候选区域的提取向量确定所述目标区域。
  14. 根据权利要求12所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    从多个网页样本中提取各网页样本中标记的目标区域的多个视觉特征的样本值;
    针对所述多个视觉特征中的每个视觉特征,利用所述多个网页样本的该视觉特征的样本值确定该视觉特征的取值范围;
    利用所述多个视觉特征的取值范围生成所述提取规则。
  15. 根据权利要求14所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    针对所述多个视觉特征中的每个视觉特征,利用多个第二网页样本和该视觉特征的取值范围确定该视觉特征的权重,并将所述权重加入所述提取规则;
    针对所述多个候选区域中的第一候选区域的第一视觉特征,当所述 候选区域的第一视觉特征的提取值在所述提取规则中所述第一视觉特征的取值范围内时,将所述第一候选区域的第一视觉特征的得分设置为所述第一视觉特征的权重。
  16. 根据权利要求13所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    从多个网页样本中提取各网页样本中标记的目标区域的多个视觉特征的样本值;
    利用所述多个网页样本中每个网页样本的各视觉特征的样本值生成一个样本向量;
    通过利用所述多个网页样本的样本向量训练机器学习算法来生成所述识别模型。
  17. 根据权利要求13所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    针对该候选区域的每个视觉特征,将该视觉特征的提取值映射到一表示值,该表示值在预设的对应关系中与该提取值所属的预设取值范围相对应;
    将该候选区域的各视觉特征的表示值按照预设的顺序组织为所述提取向量。
  18. 根据权利要求11所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    确定所述网页中位于预设位置范围内的多个区域为所述多个候选区域。
  19. 根据权利要求18所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    根据多个网页样本中标记的目标区域所在的位置确定所述位置范 围。
  20. 根据权利要求11所述的设备,所述计算机可读指令可以由所述处理器执行用于:
    确定所述网页中包括预设的内容标签的多个区域为所述多个候选区域。
  21. 一种计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令可以由处理器执行用于:
    确定网页中的多个候选区域,每个候选区域包括在所述网页中位置相邻的一个或多个页面元素;
    针对所述多个候选区域中的每个候选区域,提取该候选区域的多个视觉特征的提取值;
    根据所述多个视觉特征的提取值确定所述多个候选区域中符合提取规则的目标区域,提取所述目标区域的内容信息。
PCT/CN2017/112866 2016-12-09 2017-11-24 网页内容提取方法、装置、存储介质 WO2018103540A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/359,224 US11074306B2 (en) 2016-12-09 2019-03-20 Web content extraction method, device, storage medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201611126527.4A CN107741942B (zh) 2016-12-09 2016-12-09 一种网页内容提取方法及装置
CN201611126527.4 2016-12-09
CN201611170430.3A CN108205544A (zh) 2016-12-16 2016-12-16 网页内容识别方法、装置、服务器
CN201611170430.3 2016-12-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/359,224 Continuation US11074306B2 (en) 2016-12-09 2019-03-20 Web content extraction method, device, storage medium

Publications (1)

Publication Number Publication Date
WO2018103540A1 true WO2018103540A1 (zh) 2018-06-14

Family

ID=62491724

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/112866 WO2018103540A1 (zh) 2016-12-09 2017-11-24 网页内容提取方法、装置、存储介质

Country Status (2)

Country Link
US (1) US11074306B2 (zh)
WO (1) WO2018103540A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569119A (zh) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 一种基于多模态机器学习的新闻网页正文抽取系统及方法
CN113688302A (zh) * 2021-08-30 2021-11-23 百度在线网络技术(北京)有限公司 页面数据分析方法、装置、设备和介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050066269A1 (en) * 2003-09-18 2005-03-24 Fujitsu Limited Information block extraction apparatus and method for Web pages
CN101515272A (zh) * 2008-02-18 2009-08-26 株式会社理光 提取网页内容的方法和装置
CN103488746A (zh) * 2013-09-22 2014-01-01 成都锐理开创信息技术有限公司 一种获取业务信息的方法及装置
CN105631008A (zh) * 2015-12-28 2016-06-01 深圳市万普拉斯科技有限公司 网页的显示方法和系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8505074B2 (en) * 2008-11-21 2013-08-06 Sharp Laboratories Of America, Inc. Selective web content controls for MFP web pages across firewalls
CN102768663A (zh) 2011-05-05 2012-11-07 腾讯科技(深圳)有限公司 一种网页标题的提取方法、装置及信息处理系统
CN102799647B (zh) * 2012-06-30 2015-01-21 华为技术有限公司 网页去重方法和设备
CN103870486A (zh) 2012-12-13 2014-06-18 深圳市世纪光速信息技术有限公司 确定网页类型的方法和装置
CN104346405B (zh) 2013-08-08 2018-05-22 阿里巴巴集团控股有限公司 一种从网页中抽取信息的方法及装置
US9274693B2 (en) 2013-10-16 2016-03-01 3M Innovative Properties Company Editing digital notes representing physical notes
JP6350069B2 (ja) * 2014-07-22 2018-07-04 富士ゼロックス株式会社 情報処理システム、情報処理装置およびプログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050066269A1 (en) * 2003-09-18 2005-03-24 Fujitsu Limited Information block extraction apparatus and method for Web pages
CN101515272A (zh) * 2008-02-18 2009-08-26 株式会社理光 提取网页内容的方法和装置
CN103488746A (zh) * 2013-09-22 2014-01-01 成都锐理开创信息技术有限公司 一种获取业务信息的方法及装置
CN105631008A (zh) * 2015-12-28 2016-06-01 深圳市万普拉斯科技有限公司 网页的显示方法和系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569119A (zh) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 一种基于多模态机器学习的新闻网页正文抽取系统及方法
CN113688302A (zh) * 2021-08-30 2021-11-23 百度在线网络技术(北京)有限公司 页面数据分析方法、装置、设备和介质
CN113688302B (zh) * 2021-08-30 2024-03-19 百度在线网络技术(北京)有限公司 页面数据分析方法、装置、设备和介质

Also Published As

Publication number Publication date
US20190220488A1 (en) 2019-07-18
US11074306B2 (en) 2021-07-27

Similar Documents

Publication Publication Date Title
US10642892B2 (en) Video search method and apparatus
US11487844B2 (en) System and method for automatic detection of webpage zones of interest
KR101523450B1 (ko) 관련어 등록 장치, 관련어 등록 방법, 기록 매체 및, 관련어 등록 시스템
CN110110075A (zh) 网页分类方法、装置以及计算机可读存储介质
JP2013531289A (ja) 検索におけるモデル情報群の使用
CN112434691A (zh) 基于智能解析识别的hs编码匹配、展示方法、系统及存储介质
CN103294781A (zh) 一种用于处理页面数据的方法与设备
US20160292275A1 (en) System and method for extracting and searching for design
WO2015066891A1 (en) Systems and methods for extracting and generating images for display content
CN102193946A (zh) 为媒体文件添加标签方法和使用该方法的系统
CN111192176A (zh) 一种支持教育信息化评估的在线数据采集方法及装置
CN104899306A (zh) 信息处理方法、信息显示方法及装置
JP2019032704A (ja) 表データ構造化システムおよび表データ構造化方法
WO2018103540A1 (zh) 网页内容提取方法、装置、存储介质
CN106446123A (zh) 一种网页中验证码元素识别方法
CN108280102B (zh) 上网行为记录方法、装置及用户终端
CN104036189A (zh) 页面篡改检测方法及黑链数据库生成方法
EP3408797B1 (en) Image-based quality control
TW201705021A (zh) 利用網頁視覺特徵及網頁語法特徵之資訊擷取系統與方法
CN112612990A (zh) 网页解析方法、系统及计算机可读存储介质
CN109948015B (zh) 一种元搜索列表结果抽取方法及系统
CN115546815A (zh) 一种表格识别方法、装置、设备及存储介质
CN108171074A (zh) 一种基于内容关联的Web追踪自动检测方法
CN110147477B (zh) Web系统的数据资源模型化提取方法、装置以及设备
US9753901B1 (en) Identifying important document content using geometries of page elements

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17878141

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17878141

Country of ref document: EP

Kind code of ref document: A1