CN110633399A - Data processing method and device and data processing device - Google Patents

Data processing method and device and data processing device Download PDF

Info

Publication number
CN110633399A
CN110633399A CN201810555878.XA CN201810555878A CN110633399A CN 110633399 A CN110633399 A CN 110633399A CN 201810555878 A CN201810555878 A CN 201810555878A CN 110633399 A CN110633399 A CN 110633399A
Authority
CN
China
Prior art keywords
picture
page
webpage
around
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810555878.XA
Other languages
Chinese (zh)
Inventor
孙玉玺
丁文彪
周泽南
苏雪峰
佟子健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Sogou Hangzhou Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd, Sogou Hangzhou Intelligent Technology Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201810555878.XA priority Critical patent/CN110633399A/en
Publication of CN110633399A publication Critical patent/CN110633399A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the invention provides a data processing method and device and a device for data processing. The method specifically comprises the following steps: according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage; determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and judging whether the corresponding picture is the target picture or not according to the picture characteristics. The embodiment of the invention can save system resources and time resources consumed in the rendering process of the webpage and can improve the recall efficiency of the target picture.

Description

Data processing method and device and data processing device
Technical Field
The present invention relates to the field of internet information processing technologies, and in particular, to a data processing method and apparatus, and an apparatus for data processing.
Background
The picture search refers to an information retrieval process of searching from picture data according to a search request of a user, returning the sorted picture results to the user according to indexes such as relevance and the like, and the picture search can meet the requirement of the user for finding pictures on the Internet. In the accumulation process of the picture data, the recall of the target picture of the webpage has an important influence on the sequencing result in the picture search. The target picture of the web page may refer to a picture in the web page that is closely related to the text content. The concepts corresponding to the target picture may include: advertisement pictures, recommended content pictures, website LOGO (LOGO), and the like, which are small in association with the text content, and are hereinafter referred to as non-target pictures.
One related technology can render a webpage and recall a target picture on the basis of the rendered webpage; however, the rendering process of the web page needs to consume more system resources and time resources, and the recall efficiency of the target picture is low.
Disclosure of Invention
Embodiments of the present invention provide a data processing method and apparatus, and an apparatus for data processing, which can save system resources and time resources consumed in a rendering process of a webpage, and can improve recall efficiency of a target picture.
In order to solve the above problem, an embodiment of the present invention discloses a data processing method, including:
according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage;
determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture;
and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
On the other hand, the embodiment of the invention discloses a data processing device, which comprises:
the webpage blocking module is used for blocking a webpage according to page elements included by the webpage source code to obtain a plurality of page blocks included by the webpage;
the image characteristic determining module is used for determining the image characteristics of the image corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and
and the target picture judging module is used for judging whether the corresponding picture is the target picture or not according to the picture characteristics.
In yet another aspect, an embodiment of the present invention discloses an apparatus for data processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:
according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage;
determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture;
and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
In yet another aspect, an embodiment of the invention discloses a machine-readable medium having stored thereon instructions, which, when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of the preceding.
The embodiment of the invention has the following advantages:
according to the embodiment of the invention, the webpage is partitioned according to the page elements included by the webpage source code, and the partitioning can be carried out without rendering the webpage, so that the system resource and the time resource consumed in the rendering process of the webpage can be saved, and the recall efficiency of the target picture can be improved.
In addition, the embodiment of the invention determines the picture characteristics by taking the page block as a unit, can realize the distinguishing of the picture characteristics among different pictures, can overcome the problem that the distinguishing degree of the picture characteristics among different pictures is lower due to the determination of the picture characteristics in the whole page, namely, can improve the distinguishing degree of the picture characteristics among different pictures; on the basis, whether the corresponding picture is the target picture or not is judged according to the picture characteristics with higher distinguishing degree, and the judgment accuracy of the target picture can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic representation of an application environment for a data processing method of an embodiment of the present invention;
FIG. 2 is a flow chart of steps of a first embodiment of a data processing method of the present invention;
FIG. 3 is a schematic representation of a tree structure of an embodiment of the present invention;
FIG. 4 is a flowchart illustrating steps of a second embodiment of a data processing method according to the present invention;
FIG. 5 is a flowchart of the third step of an embodiment of a data processing method of the present invention;
FIG. 6 is a diagram of a picture corresponding to a page block according to the present invention;
FIG. 7 is a block diagram of an embodiment of a data processing apparatus of the present invention;
FIG. 8 is a block diagram of an apparatus 800 for data processing of the present invention; and
fig. 9 is a schematic diagram of a server in some embodiments of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a data processing scheme, which can be used for partitioning a webpage according to page elements included in a webpage source code to obtain a plurality of page blocks included in the webpage; determining picture characteristics of a picture corresponding to the page block; the picture features may include: surrounding text features and page structure features around the picture; and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
According to the embodiment of the invention, the webpage is partitioned according to the page elements included by the webpage source code, and the partitioning can be carried out without rendering the webpage, so that the system resource and the time resource consumed in the rendering process of the webpage can be saved, and the recall efficiency of the target picture can be improved.
In addition, the embodiment of the invention determines the picture characteristics by taking the page block as a unit, can realize the distinguishing of the picture characteristics among different pictures, can overcome the problem that the distinguishing degree of the picture characteristics among different pictures is lower due to the determination of the picture characteristics in the whole page, namely, can improve the distinguishing degree of the picture characteristics among different pictures; on the basis, whether the corresponding picture is the target picture or not is judged according to the picture characteristics with higher distinguishing degree, and the judgment accuracy of the target picture can be improved.
The data processing method provided by the embodiment of the invention can be applied to Application environments such as websites and/or APPs (Application programs) to realize recall of target pictures of webpages.
The data processing method provided by the embodiment of the present invention can be applied to the application environment shown in fig. 1, as shown in fig. 1, the client 100 and the server 200 are located in a wired or wireless network, and the client 100 and the server 200 perform data interaction through the wired or wireless network.
In an embodiment of the present invention, a picture recall interface may be provided for calling, where the picture recall interface may be used to recall a target picture in a web page, and specifically, the picture recall interface may recall the target picture in the web page according to an input web page by using the data processing method according to the embodiment of the present invention, and output the target picture. Optionally, the user may upload a webpage code of the webpage during the process of calling the picture recall interface, or may upload information such as a URL (Uniform Resource Locator) of the webpage.
In another embodiment of the present invention, a server (e.g., a picture search server) may determine a web page library, and recall a target picture in a web page of the web page library by using the data processing method of the embodiment of the present invention. Recalled pictures may be used in a picture search scene. Alternatively, the web pages of the web page library may be obtained by web page crawling.
It can be understood that the above-mentioned picture recall interface and picture search scenario are only examples of application scenarios according to the embodiment of the present invention, and the embodiment of the present invention does not impose any limitation on specific application scenarios. The categories of the web pages in the embodiment of the invention can include: news category, blog category, forum category, etc., and the specific category of the web page is not limited by the embodiment of the present invention.
Optionally, the client 100 may run on a terminal, which specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture experts Group Audio Layer III) players, MP4 (Moving Picture experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.
Method embodiment one
Referring to fig. 2, a flowchart illustrating steps of a first embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:
step 201, according to page elements included in a webpage source code, partitioning a webpage to obtain a plurality of page blocks included in the webpage;
step 202, determining picture characteristics of a picture corresponding to the page block; the picture features may include: surrounding text features and page structure features around the picture;
step 203, judging whether the corresponding picture is the target picture according to the picture characteristics.
At least one step of the embodiment shown in fig. 2 may be performed by a server and/or a client, and of course, the embodiment of the present invention does not limit the specific execution subject of each step.
In step 201, the web page source code may refer to the source code of the web page, which may represent the language composition of the web page.
The computer language corresponding to the webpage code mainly comprises: HTML (Hypertext Markup Language), vb (visual basic) Language, JAVA Language, and the like. Among them, HTML is the most common and basic language, and is an indispensable language in web pages. The setting of the title, frame, background, font, hyperlink, color, etc. of the web page may be done by the HTML language. Of course, the embodiment of the present invention does not limit the specific computer language corresponding to the web page code.
The web page source code is actually a web page file consisting of a large number of various page elements, and the browser can generally directly run the web page file such as an HTML file. The page elements may serve as basic objects constituting the web page file. The page elements may be defined by tags.
Tags are used to tag HTML elements. The text located between the start tab and the end tab may serve as the content of the page element. In one example, tags may be < head > (used to define information about a document), < body > (used to define the body of a document), < table > (used to define a table), etc. objects enclosed by angle brackets "<" and ">", some tags all appear in pairs, such as < table > </tab >, < form > </form >, where < form > represents an HTML form for defining input by a user. Of course, there are also labels that do not appear in pairs, such as < br > (for defining simple break lines), < hr > (for defining horizontal lines), etc. The label and the page element have the corresponding relation, so that the page element can be represented by the label.
The page elements also correspond to attributes. Attributes are used to provide additional information for page elements. The attribute may be present in the form of a name-value pair such as "attribute name-attribute value", and the attribute may be defined in a start tag of a page element.
Step 201 may perform blocking on the web page according to a relationship between page elements included in the web page source code to obtain a plurality of page blocks included in the web page.
In an optional embodiment of the present invention, step 201 may specifically include: obtaining a corresponding one-dimensional vector according to page elements in the webpage source code; and partitioning the one-dimensional vector according to preset boundary elements to obtain a plurality of page blocks included by the web page.
According to one embodiment, the relationship between page elements included in the source code of a web page may be represented by a tree structure, which may include a DOM (Document Object Model) tree or the like. According to the HTML DOM standard, the content in an HTML document can be treated as a node in the DOM tree:
the whole document is a document node;
HTML elements are element nodes;
the text in the HTML elements is a text node;
HTML attributes are attribute nodes; and
the annotation is an annotation node.
The embodiment of the invention can convert the tree structure into the one-dimensional vector, wherein the elements of the one-dimensional vector can represent nodes in the tree structure. The process of converting the tree structure into the one-dimensional vector may include: and taking the root node as a first element of the one-dimensional vector, and arranging nodes corresponding to branches below the root node behind the first element according to the sequence from left to right. Specifically, the node of the 1 st branch from the left below the root node is arranged after the first element, the node of the 2 nd branch from the left below the root node is arranged after the last node of the 1 st branch, and … the node of the (i +1) th branch from the left below the root node is arranged after the last node of the ith branch, where i is a natural number.
Referring to fig. 3, an example of a tree structure is shown, according to an embodiment of the present invention, the corresponding page elements include: html element, head element, body element, p element and img element, the relationship between page elements includes:
<html><head></head><body><p></p><img/></body></html>
wherein < p > is used to define the paragraph.
The tree structure shown in fig. 3 is converted into a one-dimensional vector: [ html, head, body, p, img ].
According to the embodiment of the invention, the one-dimensional vector is partitioned according to preset boundary elements; wherein the preset boundary element may function as a boundary.
In an optional embodiment of the present invention, the preset boundary element may include: a style (style is used to define style information for an HTML document) element, a script (script is used to define a client-side script, such as JavaScript) element, a comment (comment) element, an external content (object) element, or an element containing a non-numeric Identification (ID).
The default boundary element may generally indicate the end of a logical unit and may be used as a boundary. The page elements with the IDs generally belong to the areas which are focused by page editors, and the probability of containing text pictures is higher; however, the images with digital IDs, which are more likely to be head images from the hierarchical structure, generally do not belong to the text effective images, and are similar to the images of the head images of forum users, so that the elements containing non-digital IDs can be used as the preset boundary elements.
Optionally, the element including the non-numeric identifier may specifically include: a table element containing a non-numeric identifier, a section (div) element containing a non-numeric identifier, or an unordered list (ul) element containing a non-numeric identifier, etc.
It is understood that, one skilled in the art may use an element with boundary properties as a preset boundary element according to the actual application requirement, and the embodiment of the present invention does not limit the specific preset boundary element.
In an application example of the present invention, assume that a one-dimensional vector is:
[body,p,div,img,comment,div,p,span,text,script,text,div(id=’foot’)]
the one-dimensional vector may be divided into the following page blocks:
[body,p,div,img,comment|,div,p,span,text,script|,text,div(id=’foot’)]
wherein, "|" represents a segmentation point, span is used to represent a section in a document, and text is used to represent text within a page element; this example performs one-dimensional vector segmentation using comment, div, script as boundary elements to get the corresponding 4 page blocks.
Step 201 divides the web page into a plurality of page blocks, wherein one page block may include a page element img (for representing a picture) element, or one page block may not include the img element. The embodiment of the invention can take the picture corresponding to the img element in one page block as the picture corresponding to the page block. The embodiment of the invention can extract the corresponding picture characteristics aiming at the picture corresponding to the page block. The embodiment of the invention determines the picture characteristics by taking the page block as a unit, can realize the distinguishing of the picture characteristics among different pictures, can overcome the problem of low distinguishing degree of the picture characteristics among different pictures caused by determining the picture characteristics in the whole page, and can also improve the distinguishing degree of the picture characteristics among different pictures.
The picture features of the embodiment of the invention can include: the text feature and the page structure feature surround the picture.
The surrounding text feature around the picture can be used for representing the property of the surrounding text around the picture.
According to one embodiment, the nature of the surrounding text around the picture may include: the density of the surrounding text around the picture. Generally, the higher the density, the greater the probability that a picture is the target picture; conversely, the lower the density, the smaller the probability that the picture is the target picture.
According to another embodiment, the properties of the surrounding text around the picture may include: the degree of variation of the surrounding text around the picture. Generally, the greater the degree of change, the greater the probability that the picture is the target picture; conversely, the smaller the degree of change, the more similar the length of the surrounding text around the picture is, the smaller the probability that the picture is the target picture.
According to yet another embodiment, the nature of the surrounding text around the picture may include: recommended text around the picture. Examples of recommended text may include: guessing your likes, popular short video recommendations, watching still, floral clips and the like. In the case where the recommended text appears around the picture, the probability that the picture is the target picture is small.
In an optional embodiment of the present invention, the feature of the surrounding text around the picture may include, but is not limited to, at least one of the following features:
feature A1, recommended text feature;
feature a2, variance of the length of the surrounding text around the picture;
feature a3, total number of surrounding text around the picture;
feature a4, average length of surrounding text around the picture; and
feature a5, the proportion of hyperlinked text in the surrounding text around the picture.
The feature a1 is used to characterize the recommended text around the picture, and optionally, the recommended text may also correspond to a position feature, where the position feature is used to characterize the position relationship between the recommended text and the picture. The position characteristics may include: and the first position characteristic of the picture relative to the recommended text or the second position characteristic of the recommended text relative to the picture. Therefore, the recommended text features of the embodiment of the present invention may include: text and location features are recommended. Taking "guess you like" as an example, the picture is usually located below "guess you like", and therefore the first location feature is "below".
The embodiment of the invention can judge whether the short text in the webpage is the recommended text or not through the judging method, thereby realizing the collection of the recommended text. The judging method can comprise the following steps: natural language processing related algorithms, manual screening methods, and the like. Wherein the natural language processing related algorithm may include: a parsing method, or a machine classification method, etc. The grammar analysis method is used for judging whether the short texts in the web pages are recommended texts from grammar levels, and the machine classification method is used for judging whether the short texts in the web pages are recommended texts from classification levels. The natural language processing related algorithm can mine more recommended texts, so that the coverage rate of the recommended texts can be improved, and the judgment accuracy rate of the target picture can be improved.
The embodiment of the invention can also determine the corresponding position characteristics of the collected recommended texts. Taking the location feature as the first location feature, the first location feature may include: "below," "above," "left," or "right," etc., it is to be understood that embodiments of the present invention are not limited to specific positional features.
Feature a2 is used to characterize the degree of variation of the surrounding text around the picture. A process for determining a variance of a length of surrounding text around a picture may comprise: counting the length changes of N texts in front of and behind the picture in the page block corresponding to the picture; n may be a natural number greater than 1, for example, N may be 2, 3, 4, or 5. "front-to-back" may follow the order of elements within a one-dimensional vector. For example, if the vector segment in which a certain page block is located is [ a, b, c, img, e, f, g ], N non-empty text elements may be taken from the front and back of the img element in the vector segment, the length of the N text elements may be determined, and the variance corresponding to the N text elements may be calculated.
Feature A3 and feature a4 are used to characterize how dense the surrounding text is around the picture. Wherein, the characteristic A3 is the total number of characters of surrounding text around the picture; feature a4 is the average of the total number of characters surrounding the text around the picture.
In feature A5, a hyperlink is a link that, when clicked, jumps to the page to which the link points. Hyperlinks may link pictures, programs, music, videos, even non-page elements, and the like. The higher the proportion of the hyperlink text in the surrounding text around the picture is, the higher the probability that the picture is the recommended content picture is, that is, the lower the probability that the picture is the target picture is. Conversely, the lower the proportion of the hyperlink text in the surrounding text around the picture is, the lower the probability that the picture is the recommended content picture is, that is, the higher the probability that the picture is the target picture is.
Under the condition that surrounding texts around the picture are more, the surrounding text features around the picture can play a better judgment basis role in judging whether the picture is the target picture. However, for a short text page or a single picture page, surrounding texts around a picture corresponding to a target picture are few, and a non-target picture also corresponds to a certain surrounding text around the picture, so that the target picture and the non-target picture in the short text page or the single picture page cannot be effectively distinguished, and further the determination accuracy of the target picture is low, that is, the recall effect of the target picture is poor.
The page structure feature may refer to a page structure corresponding to the page block, that is, a layout of the content of the page block where the picture is located. The page structure features can reflect features different from the surrounding text features around the picture, and therefore, can be used as supplements to the surrounding text features around the picture. Particularly, under the condition that surrounding texts around the picture corresponding to the picture are few, the judgment accuracy of the target picture can be improved by combining surrounding text characteristics around the picture and page structure characteristics, and the recall effect of the target picture can also be improved.
In an optional embodiment of the present invention, the page structure feature may include: the image processing method comprises the steps of page block characteristics corresponding to an image and/or page element characteristics around the image.
The page block characteristics may reflect the type of page block to some extent. For example, the types of page blocks may include: a text type, a comment type, a directory area type, a personal information type, or a recommended area type, etc. If the type of the page block is the text type, the probability that the picture is the target picture is higher. If the type of the page block is a comment type, a directory area type, a personal information type, or a recommended area type, the probability that the picture is the target picture is small.
Optionally, the page block feature may include at least one of the following features:
feature B1, a first number of page elements included in a page block;
characteristic B2, a ratio of a first number of page elements included in the page block to a second number of page elements included in the page;
a third amount of time information contained in the feature B3, page block;
a ratio of the third quantity of the time information included in the feature B4 and the page block to the fourth quantity of the time information included in the page block;
a ratio of the fifth amount of hyperlinked text contained by the feature B5 and the page block to the sixth amount of text contained by the page block;
the ratio of the seventh number of hyperlinked pictures contained by the feature B6 to the eighth number of pictures contained by the page block.
Examples of feature B1 may include: a first number of li elements comprised by a page block, or a first number of br elements comprised by a page block, or a first number of p elements comprised by a page block, etc. Where li elements are used to define the items of the list, br elements are used to define simple tear lines, and p elements are used to define paragraphs.
Examples of feature B2 may include: the ratio of the first number of li elements included in the page block to the second number of li elements included in the web page, the ratio of the first number of br elements included in the page block to the second number of br elements included in the web page, the ratio of the first number of p elements included in the page block to the second number of p elements included in the web page, and the like.
Both trait B1 and trait B2 may reflect how many page elements a page tile contains. In one example of the invention, if a page block contains more br elements and p elements, the type of the page block may be a body type.
For feature B3 and feature B4, the time information may refer to information representing time such as "x month x day of x year", "x month x day", "x point", and the like. If the page block contains more time information, the probability of belonging to the comment type is higher.
For the feature B5 and the feature B6, the hyperlink text and the hyperlink picture belong to hyperlinks, and if the ratio of the number of hyperlinks included in the page block to the number of normal texts or normal pictures included in the page block is high, it indicates that the probability that the page block is the recommended area type is high.
In an optional embodiment of the present invention, the feature of the page elements around the picture may specifically include: the number of page elements around the picture, and/or the proportion of hyperlinked pictures around the picture.
The number of page elements around the picture may include: the picture front (back) contains h1 element (for defining HTML title), h2 element (for defining HTML title), h3 element (for defining HTML title), h4 element (for defining HTML title), input element (for defining input control), select element (for defining selection list (pull-down list)), form element (for defining HTML form for user input), option element (for defining options in selection list), p element, div element, span element, li element, ol element (for defining ordered list), a element (for defining anchor), br element, or the number of img elements, etc. The ratio of the hyperlinked pictures around the picture may refer to the ratio of the ninth number of hyperlinked pictures around the picture in the page block where the picture is located to the eighth number of pictures contained in the page block. The proportion of the hyperlink pictures around the pictures can comprise: the proportion of the hyperlink pictures before the pictures, and/or the proportion of the hyperlink pictures after the pictures, and the like.
The situation of the page elements around the picture can reflect whether the picture is the target picture or not to a certain extent. For example, for a non-list page, if there are more li elements around the picture, the content around the picture may be recommended content, that is, the probability that the picture is the target picture is smaller. For another example, if an input element is contained around the picture, the picture may be a small picture near the search box, that is, the picture is a target picture with a small probability.
In an optional embodiment of the present invention, the picture feature may further include: the picture self characteristics and/or the page characteristics corresponding to the picture can improve the diversity of the picture characteristics, and further improve the judgment accuracy of the target picture.
The picture self characteristics may specifically include, but are not limited to: the method comprises the following steps of URL length, picture name length, number proportion contained in a picture name, width and height attribute values of a picture, similarity of alt attributes and HTML titles, number of unconventional attributes of the picture, position of the picture in webpage source codes, whether the picture is a hyperlink picture, whether the picture is below a content title, height of a DOM tree corresponding to the picture and the like.
The alt attribute is a parameter attribute for web page languages HTML and XHTML (extensible hypertext markup language) and is used for outputting pure characters, and the alt attribute is used for displaying alt characters as a remedy when objects of the HTML elements cannot be rendered.
The irregular attributes of the picture elements mainly refer to some attributes defined by the web page editor, and are mainly used for distinguishing the regular attributes (such as width, height, class name, etc.).
The characteristics of the picture belong to strong characteristics, and the picture can be used for well distinguishing a target picture from a non-target picture.
Taking the length of the URL as an example, generally, the longer the URL, the greater the probability that the picture is a text picture. For example, the URL of picture 1 is: http:// static. nipic. com/images/oh _ help. jpg, URL of picture 2: http:// pic158.nipic. com/file/20180322/10185515-142500532000-2. jpg. The picture 2 is a target picture with a high probability simply from the feature of URL length.
The page features corresponding to the pictures can be used for determining the category of the webpage, and the category of the webpage can be further used as a basis for judging whether the pictures are the target pictures. The page features may include, but are not limited to: the number of key information such as the owner/post of the building, whether the page contains a title, the position of the title of the page, the position of the breadcrumbs of the page, the position of the page number, etc. For example, if the page includes key information such as a building owner/post, the type of the web page may be described as a forum type.
In step 203, it is determined whether the corresponding picture is a target picture according to the picture characteristics.
In an optional embodiment of the present invention, a mapping relationship between the combination of the picture features and the determination result of the target picture may be preset, and step 203 may determine whether the corresponding picture is the target picture by using the mapping relationship.
Referring to table 1, an example of a mapping relationship of an embodiment of the present invention is shown.
In a record of a mapping relationship, a surrounding text characteristic a around a picture is that "no recommended text appears around the picture", a page structure characteristic a is that "a page block contains more br elements and p elements", and the picture is a target picture.
In another record of the mapping relationship, the surrounding text characteristic B around the picture is "recommended text appears around the picture", and the page structure characteristic B is "more li elements are included around the picture", so that the picture is a non-target picture.
In another mapping relation record, the feature C of the surrounding text around the picture is "the variance of the length of the surrounding text around the picture is large", the feature C of the page structure is "the page block contains more br elements and p elements, the time information is not contained around the picture", and the picture is the target picture if the URL of the picture is long.
In another mapping relation record, the feature D of the surrounding text around the picture is "the variance of the length of the surrounding text around the picture is large", the feature D of the page structure is "the page block contains more br elements and p elements, and the surrounding of the picture does not contain time information", the URL of the picture is long, and the page contains key information such as posts, so that the picture is the target picture.
It should be understood that the above record of the mapping relationship is only an example, and a person skilled in the art may determine the required mapping relationship according to the actual application requirement, and the embodiment of the present invention does not limit the specific mapping relationship.
TABLE 1
Figure BDA0001682391210000121
Figure BDA0001682391210000131
In another optional embodiment of the present invention, the step 203 determines whether the corresponding picture is the target picture according to the picture characteristics, and specifically includes: obtaining a feature vector according to the picture features; inputting the feature vectors into a classification model to obtain a classification result output by the classification model; the categories corresponding to the classification models may include: a target picture category and a non-target picture category.
The classification model may be two types of classification models, that is, the classification result output by the classification model may include: a target picture category, or a non-target picture category.
According to the scheme, the recall problem of the target picture is regarded as a two-classification problem of the page picture, the picture is classified by taking the picture as a sample, the picture characteristics are extracted by analyzing a webpage source code, a machine learning algorithm is utilized to learn large-scale training data, a classification model is trained, and whether the picture corresponding to the page block is a text picture or not is predicted through the classification model.
In practical applications, a machine learning algorithm may be used to train the training data to obtain the classification model. The training data may include: and picture characteristics corresponding to the target picture sample and the non-target picture sample respectively. Examples of the above machine learning algorithm may include: neighbor classification, bayes, LR (Logistic Regression), SVM (Support Vector Machine), Adaboost (adaptive enhancement), neural network, and the like, and it can be understood that the Machine learning algorithm corresponding to the classification model is not limited in the embodiment of the present application.
In an application example of the present application, assume that the corresponding set of training data is { (x)i,yi) Where i ═ 1.., n, xiAs a sample of the target picturePicture features, y, corresponding to non-target picture samples, respectivelyiIs the xiCorresponding picture category, yiThe number of the types of the pictures may be consistent with the number of the types of the pictures, and specifically, in the embodiment of the present invention, the types of the pictures may be 2: target picture category and non-target picture category, then yiThe value of 1 or-1 may represent a target picture category and a non-target picture category, respectively; the SVM model trained by the machine learning algorithm can be represented
Figure BDA0001682391210000141
Where sgn is a sign function, b*To classify the threshold, αi *For the optimal classification parameters obtained by training, x represents the picture characteristics, and f (x) is an output function.
The embodiment of the present invention does not limit the determination manner of the training data. Optionally, the target picture may be obtained offline in a web page rendering manner, and then the classification model of the embodiment of the present invention is trained by using the obtained target picture as training data, so as to improve the accuracy rate and the recall rate of the target picture.
In an optional embodiment of the present invention, the target picture may be recalled incorrectly, that is, the determination in step 203 is incorrect. In this case, the parameters of the classification model can be updated using the erroneous target picture as a non-target picture sample. In addition, the embodiment of the invention can also acquire the corresponding similar webpage aiming at the webpage where the wrong target picture is positioned, acquire the corresponding non-target picture aiming at the similar picture, and update the parameters of the classification model by taking the non-target picture as a non-target picture sample. Because the pictures at the same position are usually recalled incorrectly for similar webpages, the non-target pictures acquired in the embodiment of the present invention may include: pictures that are prone to false recalls.
The scheme can improve the accuracy of target picture recall to a certain extent and reduce the problem of mistakenly recalling non-target pictures; moreover, more training data can be added, the overall effect of the classification model is improved, and meanwhile, the labor cost for collecting the training data is reduced; moreover, the situation of mistakenly recalling the non-target picture can be effectively reduced.
In summary, the data processing method according to the embodiment of the present invention partitions the web page according to the page elements included in the web page source code, and can be performed without rendering the web page, so that system resources and time resources consumed in the rendering process of the web page can be saved, and the recall efficiency of the target picture can be improved.
In addition, the embodiment of the invention determines the picture characteristics by taking the page block as a unit, can realize the distinguishing of the picture characteristics among different pictures, can overcome the problem that the distinguishing degree of the picture characteristics among different pictures is lower due to the determination of the picture characteristics in the whole page, namely, can improve the distinguishing degree of the picture characteristics among different pictures; on the basis, whether the corresponding picture is the target picture or not is judged according to the picture characteristics with higher distinguishing degree, and the judgment accuracy of the target picture can be improved.
Method embodiment two
Referring to fig. 4, a flowchart illustrating steps of a second embodiment of the data processing method of the present invention is shown, which may specifically include the following steps:
step 401, according to page elements included in a webpage source code, partitioning a webpage to obtain a plurality of page blocks included in the webpage;
step 402, determining picture characteristics of a picture corresponding to the page block; the picture features may include: surrounding text features and page structure features around the picture;
step 403, obtaining a feature vector according to the picture features;
step 404, if the webpage is a list page, inputting the feature vector into a first classification model to obtain a first classification result output by the first classification model; the first classification model can be obtained according to training data corresponding to the list page; or
Step 405, if the webpage is a non-list page, inputting the feature vector into a second classification model to obtain a second classification result output by the second classification model; the second classification model may be obtained according to training data corresponding to the non-list page.
Compared with the first embodiment of the method shown in fig. 2, the present embodiment refines the process of determining whether the corresponding picture is the target picture according to the picture features through steps 403 to 405. In practical applications, step 403 and step 404 may be performed, or step 403 and step 405 may be performed.
Since the layout corresponding to the recommended pictures of the non-list pages is similar to the layout corresponding to the target pictures of the list pages, in order to avoid the classification model from erroneously recognizing the recommended pictures of the non-list pages as the target pictures, the embodiments of the present invention separately train the classification models for the list pages and the non-list pages, specifically, train the first classification model for the list pages and train the second classification model for the non-list pages. In this way, the classification effect of the first classification model and the second classification model can be improved.
The embodiment of the present invention does not limit the method for determining whether the web page is a list page. For example, the determination method may be a rule determination method, or the determination method may be a web address matching method, or the determination method may be a classification method, or the like.
In addition, the embodiment of the present invention does not limit the execution timing for determining whether the web page is a list page. For example, it may be determined whether the web page is a list page before step 403, or after step 403, or simultaneously with step 403, and so on.
In summary, the data processing method according to the embodiment of the present invention trains the first classification model for the list pages and the second classification model for the non-list pages, so as to prevent the second classification model from erroneously recognizing the recommended pictures of the non-list pages as the target pictures.
Method embodiment three
Referring to fig. 5, a flowchart illustrating steps of a third embodiment of the data processing method according to the present invention is shown, which may specifically include the following steps:
step 501, judging whether the webpage is a list page or not by using a machine learning algorithm according to the webpage source code to obtain a corresponding judgment result;
for example, a webpage with a URL of http:// blog.sina.com.cn/s/blog _4cd1c1670102x4ku.html can be determined to be a non-list page.
502, converting page elements in a webpage source code into a one-dimensional vector, and partitioning the one-dimensional vector according to preset boundary elements to obtain a plurality of page blocks included by a webpage;
step 503, extracting corresponding picture features for pictures corresponding to the page blocks;
for example, for the page corresponding to the hot event shown in fig. 6, the page may specifically include: picture a, and text 1 and text 2 located above and below picture a, where the URL corresponding to picture a is:
http://s5.sinaimg.cn/mw690/001pdJDVzy7g3dSt8FKd4&690
the picture characteristics corresponding to picture a are as follows:
recommending text characteristics: no recommended text appears around the picture;
the picture self characteristics are as follows: the URL length is 52, with the numbers in the ratio: 14/52, respectively;
picture surrounding element features: the total number of picture front (back) surround texts is: 301, (501);
the characteristics of the page where the picture is located: a title exists.
Step 504, inputting the feature vector corresponding to the image feature into the classification model corresponding to the judgment result to obtain the classification result output by the classification model.
For the picture shown in fig. 6, the second classification model will output 1, i.e. the picture is the target picture.
In summary, the data processing method according to the embodiment of the present invention trains the second classification model for the non-list page, so that the second classification model can be prevented from erroneously recognizing the recommended picture of the non-list page as the target picture.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.
Device embodiment
Referring to fig. 7, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include: a web page blocking module 701, an image characteristic determination module 702 and a target image judgment module 703.
The webpage blocking module 701 is configured to block a webpage according to page elements included in a webpage source code to obtain a plurality of page blocks included in the webpage;
a picture feature determining module 702, configured to determine a picture feature of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and
the target picture determining module 703 is configured to determine whether the corresponding picture is the target picture according to the picture characteristics.
Optionally, the page structure feature may include: the image processing method comprises the steps of page block characteristics corresponding to an image and/or page element characteristics around the image.
Optionally, the page block feature may include at least one of the following features:
the page block may include a first number of page elements;
a ratio of a first number of page elements included in a page block to a second number of page elements that the web page may include;
a third amount of time information contained in the page block;
a ratio of a third amount of time information contained in the page block to a fourth amount of time information that the page may include;
a ratio of a fifth amount of hyperlinked text contained by the page block relative to a sixth amount of text contained by the page block;
a ratio of the seventh number of hyperlinked pictures comprised by the page block to the eighth number of pictures comprised by the page block.
Optionally, the picture surrounding page element characteristics may include: the number of page elements around the picture, and/or the proportion of hyperlinked pictures around the picture.
Optionally, the feature of surrounding text around the picture may include at least one of the following features:
recommending text features;
variance of length of surrounding text around the picture;
total number of surrounding texts around the picture;
the proportion of hyperlink texts in surrounding texts around the picture; and
average length of surrounding text around the picture.
Optionally, the picture feature may further include: the picture self-characteristics and/or the page characteristics corresponding to the picture.
Optionally, the web page blocking module 701 may include:
the one-dimensional vector determining submodule is used for obtaining a corresponding one-dimensional vector according to the page elements in the webpage source code; and
and the vector blocking submodule is used for blocking the one-dimensional vector according to a preset boundary element so as to obtain a plurality of page blocks which can be included in the webpage.
Optionally, the preset boundary element may include: a style element, a script element, an annotation element, an external content element, or an element containing a non-numeric identifier.
Optionally, the picture determining module 703 may include:
the characteristic vector determining submodule is used for obtaining a characteristic vector according to the picture characteristic; and
the model judgment submodule is used for inputting the feature vectors into a classification model so as to obtain a classification result output by the classification model; the categories corresponding to the classification models may include: a target picture category and a non-target picture category.
Optionally, the picture determining module 703 may include:
the feature vector determination submodule is used for obtaining a feature vector according to the picture features;
the first model judgment submodule is used for inputting the feature vector into a first classification model if the webpage is a list page so as to obtain a first classification result output by the first classification model; the first classification model is obtained according to training data corresponding to the list page; or
The second model judgment submodule is used for inputting the feature vector into a second classification model if the webpage is a non-list page so as to obtain a second classification result output by the second classification model; and the second classification model is obtained according to the training data corresponding to the non-list page.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
An embodiment of the present invention provides an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage; determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
Fig. 8 is a block diagram illustrating an apparatus 800 for data processing in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 8, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 9 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the data processing method shown in fig. 2 or 3 or 4 or 5.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a data processing method, the method comprising: according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage; determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
The embodiment of the invention discloses A1 and a data processing method, wherein the method comprises the following steps:
according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage;
determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture;
and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
A2, according to the method in A1, the page structure features include: the image processing method comprises the steps of page block characteristics corresponding to an image and/or page element characteristics around the image.
A3, according to the method of A2, the page block features include at least one of the following:
a first number of page elements included in a page block;
a ratio of a first number of page elements included in the page block to a second number of page elements included in the web page;
a third amount of time information contained in the page block;
a ratio of a third amount of time information included in the page block to a fourth amount of time information included in the page block;
a ratio of a fifth amount of hyperlinked text contained by the page block relative to a sixth amount of text contained by the page block;
a ratio of the seventh number of hyperlinked pictures comprised by the page block to the eighth number of pictures comprised by the page block.
A4, according to the method of A2, the picture surrounding page element features include: the number of page elements around the picture, and/or the proportion of hyperlinked pictures around the picture.
A5, the method of A1, wherein the surrounding text features of the picture comprise at least one of the following features:
recommending text features;
variance of length of surrounding text around the picture;
total number of surrounding texts around the picture;
the proportion of hyperlink texts in surrounding texts around the picture; and
average length of surrounding text around the picture.
A6, the method of any of A1-A5, the picture feature further comprising: the picture self-characteristics and/or the page characteristics corresponding to the picture.
A7, the method of any one of A1 to A5, wherein the blocking the web page comprises:
obtaining a corresponding one-dimensional vector according to page elements in the webpage source code;
and partitioning the one-dimensional vector according to preset boundary elements to obtain a plurality of page blocks included by the web page.
A8, the method of A7, the preset boundary element comprising: a style element, a script element, an annotation element, an external content element, or an element containing a non-numeric identifier.
A9, according to the method of any one of a1 to a5, the determining whether the corresponding picture is the target picture according to the picture feature includes:
obtaining a feature vector according to the picture features;
inputting the feature vectors into a classification model to obtain a classification result output by the classification model; the classification model corresponds to categories including: a target picture category and a non-target picture category.
A10, according to the method of any one of a1 to a5, the determining whether the corresponding picture is the target picture according to the picture feature includes:
obtaining a feature vector according to the picture features;
if the webpage is a list page, inputting the feature vector into a first classification model to obtain a first classification result output by the first classification model; the first classification model is obtained according to training data corresponding to the list page; or
If the webpage is a non-list page, inputting the feature vector into a second classification model to obtain a second classification result output by the second classification model; and the second classification model is obtained according to the training data corresponding to the non-list page.
The embodiment of the invention discloses B11 and a data processing device, which comprises:
the webpage blocking module is used for blocking a webpage according to page elements included by the webpage source code to obtain a plurality of page blocks included by the webpage;
the image characteristic determining module is used for determining the image characteristics of the image corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and
and the target picture judging module is used for judging whether the corresponding picture is the target picture or not according to the picture characteristics.
B12, the apparatus according to B11, the page structure features including: the image processing method comprises the steps of page block characteristics corresponding to an image and/or page element characteristics around the image.
B13, the apparatus of B12, the page block features including at least one of:
a first number of page elements included in a page block;
a ratio of a first number of page elements included in the page block to a second number of page elements included in the web page;
a third amount of time information contained in the page block;
a ratio of a third amount of time information included in the page block to a fourth amount of time information included in the page block;
a ratio of a fifth amount of hyperlinked text contained by the page block relative to a sixth amount of text contained by the page block;
a ratio of the seventh number of hyperlinked pictures comprised by the page block to the eighth number of pictures comprised by the page block.
B14, the apparatus of B12, the picture surrounding page element features comprising: the number of page elements around the picture, and/or the proportion of hyperlinked pictures around the picture.
B15, the apparatus of B11, the surround text feature around the picture comprising at least one of:
recommending text features;
variance of length of surrounding text around the picture;
total number of surrounding texts around the picture;
the proportion of hyperlink texts in surrounding texts around the picture; and
average length of surrounding text around the picture.
B16, the apparatus of any of B11-B15, the picture feature further comprising: the picture self-characteristics and/or the page characteristics corresponding to the picture.
B17, the device according to any one of B11-B15, the web page blocking module comprising:
the one-dimensional vector determining submodule is used for obtaining a corresponding one-dimensional vector according to the page elements in the webpage source code; and
and the vector blocking submodule is used for blocking the one-dimensional vector according to preset boundary elements so as to obtain a plurality of page blocks included by the webpage.
B18, the apparatus of B17, the preset boundary element comprising: a style element, a script element, an annotation element, an external content element, or an element containing a non-numeric identifier.
B19, the device according to any of B11 to B15, the picture judgment module comprising:
the characteristic vector determining submodule is used for obtaining a characteristic vector according to the picture characteristic; and
the model judgment submodule is used for inputting the feature vectors into a classification model so as to obtain a classification result output by the classification model; the classification model corresponds to categories including: a target picture category and a non-target picture category.
B20, the device according to any of B11 to B15, the picture judgment module comprising:
the feature vector determination submodule is used for obtaining a feature vector according to the picture features;
the first model judgment submodule is used for inputting the feature vector into a first classification model if the webpage is a list page so as to obtain a first classification result output by the first classification model; the first classification model is obtained according to training data corresponding to the list page; or
The second model judgment submodule is used for inputting the feature vector into a second classification model if the webpage is a non-list page so as to obtain a second classification result output by the second classification model; and the second classification model is obtained according to the training data corresponding to the non-list page.
The embodiment of the invention discloses C21, an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:
according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage;
determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture;
and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
C22, the apparatus of C21, the page structure features comprising: the image processing method comprises the steps of page block characteristics corresponding to an image and/or page element characteristics around the image.
C23, the apparatus of C22, the page block features including at least one of:
a first number of page elements included in a page block;
a ratio of a first number of page elements included in the page block to a second number of page elements included in the web page;
a third amount of time information contained in the page block;
a ratio of a third amount of time information included in the page block to a fourth amount of time information included in the page block;
a ratio of a fifth amount of hyperlinked text contained by the page block relative to a sixth amount of text contained by the page block;
a ratio of the seventh number of hyperlinked pictures comprised by the page block to the eighth number of pictures comprised by the page block.
C24, the apparatus of C22, the picture surrounding page element features comprising: the number of page elements around the picture, and/or the proportion of hyperlinked pictures around the picture.
C25, the apparatus of C21, the surround text feature around the picture comprising at least one of:
recommending text features;
variance of length of surrounding text around the picture;
total number of surrounding texts around the picture;
the proportion of hyperlink texts in surrounding texts around the picture; and
average length of surrounding text around the picture.
C26, the device of any of C21-C25, the picture feature further comprising: the picture self-characteristics and/or the page characteristics corresponding to the picture.
C27, the blocking the web page according to the apparatus of any one of C21 to 25, comprising:
obtaining a corresponding one-dimensional vector according to page elements in the webpage source code;
and partitioning the one-dimensional vector according to preset boundary elements to obtain a plurality of page blocks included by the web page.
C28, the apparatus of C27, the preset boundary element comprising: a style element, a script element, an annotation element, an external content element, or an element containing a non-numeric identifier.
C29, determining whether the corresponding picture is the target picture according to the picture feature by the apparatus according to any one of C21 to C25, including:
obtaining a feature vector according to the picture features;
inputting the feature vectors into a classification model to obtain a classification result output by the classification model; the classification model corresponds to categories including: a target picture category and a non-target picture category.
C30, determining whether the corresponding picture is the target picture according to the picture feature by the apparatus according to any one of C21 to C25, including:
obtaining a feature vector according to the picture features;
if the webpage is a list page, inputting the feature vector into a first classification model to obtain a first classification result output by the first classification model; the first classification model is obtained according to training data corresponding to the list page; or
If the webpage is a non-list page, inputting the feature vector into a second classification model to obtain a second classification result output by the second classification model; and the second classification model is obtained according to the training data corresponding to the non-list page.
Embodiments of the present invention disclose D31, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of a 1-a 10.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
The data processing method, the data processing apparatus and the apparatus for data processing provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of data processing, the method comprising:
according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage;
determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture;
and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
2. The method of claim 1, wherein the page structure features comprise: the image processing method comprises the steps of page block characteristics corresponding to an image and/or page element characteristics around the image.
3. The method of claim 2, wherein the page block characteristics comprise at least one of:
a first number of page elements included in a page block;
a ratio of a first number of page elements included in the page block to a second number of page elements included in the web page;
a third amount of time information contained in the page block;
a ratio of a third amount of time information included in the page block to a fourth amount of time information included in the page block;
a ratio of a fifth amount of hyperlinked text contained by the page block relative to a sixth amount of text contained by the page block;
a ratio of the seventh number of hyperlinked pictures comprised by the page block to the eighth number of pictures comprised by the page block.
4. The method of claim 2, wherein the picture surrounding page element features comprise: the number of page elements around the picture, and/or the proportion of hyperlinked pictures around the picture.
5. The method of claim 1, wherein the surrounding text features around the picture comprise at least one of:
recommending text features;
variance of length of surrounding text around the picture;
total number of surrounding texts around the picture;
the proportion of hyperlink texts in surrounding texts around the picture; and
average length of surrounding text around the picture.
6. The method of any of claims 1 to 5, wherein the picture feature further comprises: the picture self-characteristics and/or the page characteristics corresponding to the picture.
7. The method of any of claims 1 to 5, wherein the blocking the web page comprises:
obtaining a corresponding one-dimensional vector according to page elements in the webpage source code;
and partitioning the one-dimensional vector according to preset boundary elements to obtain a plurality of page blocks included by the web page.
8. A data processing apparatus, comprising:
the webpage blocking module is used for blocking a webpage according to page elements included by the webpage source code to obtain a plurality of page blocks included by the webpage;
the image characteristic determining module is used for determining the image characteristics of the image corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture; and
and the target picture judging module is used for judging whether the corresponding picture is the target picture or not according to the picture characteristics.
9. An apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:
according to page elements included by a webpage source code, partitioning a webpage to obtain a plurality of page blocks included by the webpage;
determining picture characteristics of a picture corresponding to the page block; the picture features include: surrounding text features and page structure features around the picture;
and judging whether the corresponding picture is the target picture or not according to the picture characteristics.
10. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 1 to 7.
CN201810555878.XA 2018-06-01 2018-06-01 Data processing method and device and data processing device Pending CN110633399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810555878.XA CN110633399A (en) 2018-06-01 2018-06-01 Data processing method and device and data processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810555878.XA CN110633399A (en) 2018-06-01 2018-06-01 Data processing method and device and data processing device

Publications (1)

Publication Number Publication Date
CN110633399A true CN110633399A (en) 2019-12-31

Family

ID=68967545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810555878.XA Pending CN110633399A (en) 2018-06-01 2018-06-01 Data processing method and device and data processing device

Country Status (1)

Country Link
CN (1) CN110633399A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343046A (en) * 2021-05-20 2021-09-03 成都美尔贝科技股份有限公司 Intelligent search sequencing system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706793A (en) * 2009-11-16 2010-05-12 中兴通讯股份有限公司 Method and device for searching picture
US20110082868A1 (en) * 2009-10-02 2011-04-07 Aravind Musuluri System and method for block segmenting, identifying and indexing visual elements, and searching documents
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103473338A (en) * 2013-09-22 2013-12-25 北京奇虎科技有限公司 Webpage content extraction method and webpage content extraction system
CN103942211A (en) * 2013-01-21 2014-07-23 腾讯科技(深圳)有限公司 Text page recognition method and device
CN104679788A (en) * 2013-12-02 2015-06-03 中国移动通信集团广东有限公司 Image processing method and device as well as terminal equipment
CN105183478A (en) * 2015-09-11 2015-12-23 中山大学 Webpage reestablishing method and device based on color transmission
CN105528758A (en) * 2016-01-12 2016-04-27 武汉精测电子技术股份有限公司 Image remapping method and device based on programmable logic device
CN106446139A (en) * 2016-09-20 2017-02-22 微梦创科网络科技(中国)有限公司 Webpage content extracting method and device
CN106649767A (en) * 2016-12-27 2017-05-10 东软集团股份有限公司 Web page information extraction method and device
WO2017177957A1 (en) * 2016-04-14 2017-10-19 Mediatek Inc. Non-local adaptive loop filter
CN107341162A (en) * 2016-05-03 2017-11-10 北京搜狗科技发展有限公司 Web page processing method and device, the device for Web Page Processing

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110082868A1 (en) * 2009-10-02 2011-04-07 Aravind Musuluri System and method for block segmenting, identifying and indexing visual elements, and searching documents
CN101706793A (en) * 2009-11-16 2010-05-12 中兴通讯股份有限公司 Method and device for searching picture
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103942211A (en) * 2013-01-21 2014-07-23 腾讯科技(深圳)有限公司 Text page recognition method and device
CN103473338A (en) * 2013-09-22 2013-12-25 北京奇虎科技有限公司 Webpage content extraction method and webpage content extraction system
CN104679788A (en) * 2013-12-02 2015-06-03 中国移动通信集团广东有限公司 Image processing method and device as well as terminal equipment
CN105183478A (en) * 2015-09-11 2015-12-23 中山大学 Webpage reestablishing method and device based on color transmission
CN105528758A (en) * 2016-01-12 2016-04-27 武汉精测电子技术股份有限公司 Image remapping method and device based on programmable logic device
WO2017177957A1 (en) * 2016-04-14 2017-10-19 Mediatek Inc. Non-local adaptive loop filter
CN107341162A (en) * 2016-05-03 2017-11-10 北京搜狗科技发展有限公司 Web page processing method and device, the device for Web Page Processing
CN106446139A (en) * 2016-09-20 2017-02-22 微梦创科网络科技(中国)有限公司 Webpage content extracting method and device
CN106649767A (en) * 2016-12-27 2017-05-10 东软集团股份有限公司 Web page information extraction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高乐等: "基于视觉的Web页面分块算法的改进与实现", 《计算机系统应用》, no. 04, 15 April 2009 (2009-04-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343046A (en) * 2021-05-20 2021-09-03 成都美尔贝科技股份有限公司 Intelligent search sequencing system
CN113343046B (en) * 2021-05-20 2023-08-25 成都美尔贝科技股份有限公司 Intelligent search ordering system

Similar Documents

Publication Publication Date Title
CN109819284B (en) Short video recommendation method and device, computer equipment and storage medium
CN109740085B (en) Page content display method, device, equipment and storage medium
CN109614482B (en) Label processing method and device, electronic equipment and storage medium
CN106095453B (en) Information display method and device and electronic equipment
US10878044B2 (en) System and method for providing content recommendation service
CN109325223B (en) Article recommendation method and device and electronic equipment
US20170109339A1 (en) Application program activation method, user terminal, and server
CN107918496B (en) Input error correction method and device for input error correction
WO2021017238A1 (en) Text generation method and apparatus
CN110222256B (en) Information recommendation method and device and information recommendation device
CN111382339B (en) Search processing method and device for search processing
CN107515870B (en) Searching method and device and searching device
CN110598098A (en) Information recommendation method and device and information recommendation device
CN107515869B (en) Searching method and device and searching device
CN107491453B (en) Method and device for identifying cheating web pages
CN110309324B (en) Searching method and related device
CN107784037B (en) Information processing method and device, and device for information processing
CN113407775B (en) Video searching method and device and electronic equipment
CN111274389B (en) Information processing method, device, computer equipment and storage medium
CN108874758B (en) Note processing method and device, and device for note processing
CN110929176A (en) Information recommendation method and device and electronic equipment
CN110110046B (en) Method and device for recommending entities with same name
CN106886541B (en) Data searching method and device for data searching
CN110633399A (en) Data processing method and device and data processing device
CN110020335B (en) Favorite processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220729

Address after: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant before: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd.

Applicant before: SOGOU (HANGZHOU) INTELLIGENT TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right