CN113688302B - Page data analysis method, device, equipment and medium - Google Patents
Page data analysis method, device, equipment and medium Download PDFInfo
- Publication number
- CN113688302B CN113688302B CN202111005656.9A CN202111005656A CN113688302B CN 113688302 B CN113688302 B CN 113688302B CN 202111005656 A CN202111005656 A CN 202111005656A CN 113688302 B CN113688302 B CN 113688302B
- Authority
- CN
- China
- Prior art keywords
- page
- node
- visual area
- visual
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000007405 data analysis Methods 0.000 title claims description 36
- 230000000007 visual effect Effects 0.000 claims abstract description 186
- 238000004590 computer program Methods 0.000 claims description 16
- 238000004458 analytical method Methods 0.000 claims description 10
- 230000004044 response Effects 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000004891 communication Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000002093 peripheral effect Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 240000004282 Grewia occidentalis Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure provides a method, apparatus, device, and medium for analyzing page data, and relates to the field of computers, in particular to computer network technology, search engine technology, and software application technology. The method comprises the following steps: determining a plurality of page nodes in a target page; determining at least one visual area in the target page, wherein each visual area in the at least one visual area comprises at least one page node in a plurality of page nodes; for each of the at least one visual region, determining a score for the visual region based on at least one page node included in the visual region; determining important visual areas based on the respective scores of the at least one visual area; and analyzing the important visual area.
Description
Technical Field
The present disclosure relates to the field of computers, and in particular to computer network technology, search engine technology, and software application technology, and more particularly to a page data analysis method, apparatus, electronic device, computer readable storage medium, and computer program product.
Background
The search engine captures a large number of web pages, filters the web pages, and then receives the filtered web pages into an index library. After the user sends a query request to the search engine, the search engine screens out relevant pages according to the request, sorts the pages by various means, and presents all or part of the relevant pages to the user based on the sorting result.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
The present disclosure provides a page data analysis method, apparatus, electronic device, computer readable storage medium, and computer program product.
According to an aspect of the present disclosure, a page data analysis method is provided. The page data analysis method comprises the following steps: determining a plurality of page nodes in a target page; determining at least one visual area in the target page, wherein each visual area in the at least one visual area comprises at least one page node in a plurality of page nodes; for each of the at least one visual region, determining a score for the visual region based on at least one page node included in the visual region; determining important visual areas based on the respective scores of the at least one visual area; and analyzing the important visual area.
According to another aspect of the present disclosure, there is provided a page data analysis apparatus. The page data analysis device includes: a first determination unit configured to determine a plurality of page nodes in a target page; a second determining unit configured to determine at least one visual area in the target page, wherein each of the at least one visual area includes at least one of a plurality of page nodes; a third determination unit configured to determine, for each of the at least one visual area, a score for the visual area based on at least one page node included in the visual area; a fourth determination unit configured to determine important visual areas based on the respective scores of the at least one visual area; and an analysis unit configured to analyze the important visual area.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the page data analysis method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described page data analysis method.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described page data analysis method.
According to one or more embodiments of the present disclosure, by determining at least one visual area in a page and calculating scores of the visual areas based on at least one page node included in the visual areas, important visual areas can be determined in the visual areas, and further analysis work is performed on the important visual areas to complete tasks such as extraction of key information of page data and evaluation of page data quality, so that an efficient and accurate page data analysis method is achieved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a flow chart of a page data analysis method according to an exemplary embodiment of the present disclosure;
2A-2C illustrate schematic diagrams of a target page, page nodes, and visual areas according to an exemplary embodiment of the present disclosure;
FIG. 3 illustrates a flowchart for determining a score for a visual area according to an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a flowchart for determining a region of interest in a visual sense according to an exemplary embodiment of the present disclosure;
FIG. 5 shows a block diagram of a page data analysis device according to an exemplary embodiment of the present disclosure; and
fig. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
In the related art, the existing method for analyzing page data is generally performed by using page data as a whole, for example, evaluating the quality of page data according to the text, links, image density, or the number of text, links, and images in a page, or extracting key information for the whole page, etc. The analysis method has the advantages of single dimension, poor universality and low accuracy.
In order to solve the above problems, the present disclosure determines at least one visual area in a page, and calculates a score of the visual area based on at least one page node included in the visual area, so that important visual areas can be determined in the visual areas, and further analysis work is performed on the important visual areas, so as to complete tasks such as extraction of key information of page data, evaluation of page data quality, and the like, thereby realizing an efficient and accurate page data analysis method.
According to an aspect of the present disclosure, a page data analysis method is provided. As shown in fig. 1, the page data analysis method includes: step S101, determining a plurality of page nodes in a target page; step S102, determining at least one visual area in a target page, wherein each visual area in the at least one visual area comprises at least one page node in a plurality of page nodes; step S103, for each visual area in at least one visual area, determining a score of the visual area based on at least one page node included in the visual area; step S104, determining important visual areas based on the respective scores of at least one visual area; and step S105, analyzing the important visual area.
Therefore, by determining at least one visual area in the page and calculating the scores of the visual areas based on at least one page node included in the visual areas, important visual areas can be determined in the visual areas, and further analysis work is carried out on the important visual areas so as to finish tasks such as page data key information extraction and page data quality assessment, and an efficient and accurate page data analysis method is realized.
According to some embodiments, the target page may be, for example, a web page crawled by a search engine. The page data analysis method described in the present disclosure may be applied to page data that has not been stored in the index base after being grabbed, or may be applied to page data that already exists in the index base, which is not limited herein.
According to some embodiments, the page structure of the target page may be, for example, a dom-tree structure, so that a plurality of page nodes and corresponding page node information in the target page may be determined according to the dom-tree structure.
According to some embodiments, the page node information may include, for example, at least one of: node abscissa, node ordinate, node width, node height, and relationships with other nodes. In one exemplary embodiment, for node information such as node abscissa, node ordinate, node width, and node height, for example, the target page may be drawn onto the canvas, and these node information may be determined from the nodes drawn on the canvas. In another exemplary embodiment, the relationship with other nodes may be obtained directly from the page data.
According to some embodiments, the relationships with other nodes may include, for example, parent-child nodes, ancestor-offspring nodes, co-parent nodes, co-ancestor nodes, sibling nodes, and unrelated nodes, among others.
According to some embodiments, step S102, determining at least one visual area in the target page may include: and based on the node information of each of the plurality of page nodes, partially merging the plurality of page nodes to obtain at least one visual area. Therefore, the page nodes are combined based on the node information, so that each visual area comprises at least one page node with the same or similar characteristics, and each visual area can be used as an independent whole with a certain commonality for further analysis, and the accuracy of the page data analysis method is improved.
According to some embodiments, partially merging the plurality of page nodes may include at least one of: merging at least one page node which has the same node abscissa and the same node width and has the relationship with other nodes meeting the preset relationship in the plurality of page nodes; and merging at least one page node with the same node ordinate, the same node height and the relationship with other nodes meeting the preset relationship in the plurality of page nodes. Therefore, at least one page node which is matched in size and has the relation with other nodes to meet the preset relation is combined in the transverse direction and the longitudinal direction, so that at least one visual area which is regular in shape and has the included nodes to meet the preset relation can be obtained. It is understood that the node abscissa may include a node center position abscissa, a node four-corner abscissa, an abscissa of a center point of four sides of a node, and the like, which are not limited herein. Similarly, the node ordinate may include, without limitation, the node center position ordinate, the node four corner center point ordinate, and the like.
According to some embodiments, the preset relationship may include, for example, a co-parent node, a co-ancestor node, and a sibling node. In some embodiments, there is a longitudinal alignment between the homoparent node and the homoancestor node. In other embodiments, sibling nodes are in a laterally aligned relationship.
In one exemplary embodiment, as shown in FIGS. 2A-2B, in a target page 200, multiple sibling nodes 202 in the navigation bar above the page may be merged into a visual area 220; title node 204, publication time, and source node 206 of the same parent node may be merged into visual area 222; the plurality of sibling nodes 208 of the left sharing region may be merged into a visual region 224; the two text paragraph nodes 210 and the image node 212 of the body part may be merged into a visual area 226; the right "forward review" heading node 214, the article heading node 216 for the plurality of articles, and the publication time node 218 for the same parent node may be merged into the visual area 228.
According to some embodiments, the size of each visual area may not exceed the display interface size for displaying the target page. In some embodiments, the length of portions of the web pages may be long, such as long articles, and the user typically can only see a portion of the pages in the browser's display interface when browsing the web pages. Therefore, the size of the visual area is limited within the size of the display interface, so that the visual area obtained by the method is closer to the visual area which can be browsed by a user in practice, and meanwhile, the situation that a larger visual area is generated under the condition that the page is larger is avoided, and the processing speed and accuracy of a subsequent analysis link are affected.
According to some embodiments, as shown in fig. 3, step S103, determining the score of the visual area may include: step S301, determining respective first areas and node weights of at least one page node included in the visual area, wherein the first areas are determined according to the width and the height of the corresponding page node; step S302, multiplying the respective first areas of at least one page node by node weights to obtain at least one first score corresponding to the at least one page node one by one; and step S303, summing at least one first score to obtain a second score. Thus, the area of each node and the corresponding weight are multiplied to obtain the score of each node, and the score of each node is summed to obtain the score of the visual area. By using the method, when the score of the visual area is calculated, the area of the page node included in the visual area and the importance of the page node can be integrated, so that the calculated score of the visual area is more reasonable, and the accuracy of subsequent page data analysis is improved.
According to some embodiments, the first area may be, for example, an area of the page node itself, and may be determined according to a width and a height of the corresponding page node. In some embodiments, the upper left corner coordinates (refbottom_xpos, refbottom_ypos) and lower right corner coordinates (lightbottom_xpos, lightbottom_ypos) of the page node may be determined, thereby determining the width and height of the page node to be (lightbottom_xpos-refbottom_xpos) and (lightbottom_ypos-refbottom_ypos), respectively, and calculating the first area:
vnode=(rightbottom_xpos-lefttop_xpos)*(rightbottom_ypos-lefttop_ypos)
it will be appreciated that the calculation may also be performed using the upper right and lower left coordinates of the page node, or may be performed using other means to obtain the first area of the page node, without limitation.
According to some embodiments, each of the at least one page node may have a node type, and the node weight of the page node may be determined by the node type of the page node. Therefore, by setting different node weights for the page nodes of different node types, the score of the node type with more attention or more information content can be improved, the score of the node type with less attention or less information content can be reduced, the score of the visual area obtained by calculation is more reasonable, and the accuracy of the subsequent page data analysis is improved.
According to some embodiments, the node type comprises, for example, at least one of: images, videos, text, and links. In one exemplary embodiment, the video node corresponds to the highest node weight, the image node corresponds to the second highest node weight, the link node corresponds to the third highest node weight, and the text node corresponds to the lowest node weight. The node weight can also be adjusted according to the page type of the target page. In another embodiment, the node weight ranking of the content page may be, for example, "video" above "image" above "link" above "text", while the node weight ranking of the catalog page may be, for example, "link" above "text" above "video" above "image". It will be appreciated that the foregoing is merely exemplary of a node weight setting manner, and those skilled in the art may set corresponding node weights for different nodes by themselves, which is not limited herein.
According to some embodiments, after obtaining the respective first area vnode and the node weight w_res of the at least one page node, the respective first area and the node weight of the at least one page node may be multiplied to obtain at least one first score (w_res) in one-to-one correspondence with the at least one page node, and the first scores are summed to obtain a second score, i.e. a score of the visual area calculated based on the first area of the page node:
varea_1=∑(w_res*vnode)
according to some embodiments, the page node periphery may also include a pad (padding) area, a boundary (border) area, and a margin (margin) area. These regions are used to enclose the page nodes and can highlight the contents of the nodes, so the area of these regions can be taken as part of the node area. In one exemplary embodiment, as shown in FIG. 2C, the title node 204 and its peripheral region together comprise a region 230.
According to some embodiments, as shown in fig. 3, step S103, determining the score of the visual area may further include: step S304, determining respective second areas of at least one page node included in the visual area, wherein the second areas comprise first areas of corresponding page nodes and at least one of filling areas, boundary areas and blank areas of the corresponding page nodes; step S305, multiplying the second areas of the at least one page node and the node weights to obtain at least one third score corresponding to the at least one page node one by one; and step S306, summing the at least one third score to obtain a fourth score. Therefore, the filling area, the boundary area and the blank area of the periphery of the node are also used as a part of the node area, so that the score of the visual area obtained by calculation is more reasonable, and the accuracy of subsequent page data analysis is improved.
According to some embodiments, the peripheral areas of the partial page nodes are larger, these areas typically being the title of the page, or the more important areas in the page. Therefore, the visual area with a larger peripheral area can be directly used as the important visual area. As shown in fig. 3, step S103, determining the score of the visual area may further include: step S307, in response to determining that the ratio of the fourth score and the second score is greater than the preset ratio, the visual area is directly determined as the important visual area. Therefore, the efficiency of determining the important visual area is further improved, and the accuracy of subsequent page data analysis is improved.
According to some embodiments, the predetermined ratio may be, for example, 1.5:1, 2:1, 2.5:1, 3:1, or other predetermined ratios, which are not limited herein. It will be appreciated that those skilled in the art may also determine the page node with a larger peripheral area directly as the important visual area by other methods, and is not limited herein.
It will be appreciated that the important visual area in the page may include only one visual area, or may include multiple visual areas. In the case where a certain visual area is directly determined as an important visual area due to a large peripheral area, the visual area may be used only as an important visual area of the target page, or a potential important visual area may be continuously found in other visual areas, which is not limited herein.
According to some embodiments, in a web page, the user's attention to different positions is very poor, and the producer of the web page will correspondingly place the most important content in the area with the highest attention of the user. As shown in fig. 3, step S103, determining the score of the visual area may further include: step S308, the score of the visual area is adjusted based on the position of the visual area on the target page. Therefore, the scores of the visual areas located at different positions in the target page are differentiated, so that the scores of the visual areas are more reasonable, and the accuracy of subsequent page data analysis is improved.
According to some embodiments, adjusting the score of the visual area based on the location of the visual area may include, for example: in response to detecting that the visual area is located at the first screen of the target page, the score of the visual area is adjusted upward. Therefore, the important visual area is more prone to appear in the first screen by improving the score of the visual area in the first screen, so that the score of the visual area is more reasonable, and the accuracy of subsequent page data analysis is improved.
According to some embodiments, as shown in fig. 4, step S104, determining the important visual area includes: step S401, calculating a total score of the page based on the respective scores of at least one visual area; step S402, calculating the respective resource density proportion of at least one visual area based on the respective score of the at least one visual area and the total score of the page; and step S403, determining important visual areas in the at least one visual area based on the respective resource density proportions of the at least one visual area. Therefore, a reasonable determination mode of the important visual areas is provided by calculating the total score of the page and calculating the proportion of the score of each visual area to the total score of the page, namely the resource density proportion, so that the important visual areas are determined based on the resource density proportion of each visual area.
According to some embodiments, the calculation of the page total score may be, for example, a sum of the respective scores of the at least one visual area:
all_varea=∑varea
according to some embodiments, the resource density ratio of each of the at least one visual area may be, for example, the quotient of the score of that visual area divided by the total score of the page:
varea_density_ratio=varea/all_varea
according to some embodiments, the important visual zone comprises at least one of: a preset number of vision areas having a highest proportion of resource density among the at least one vision area; and a visual area in which the resource density ratio in the at least one visual area exceeds a preset ratio. In this way, the visual area with the highest resource density proportion or meeting the preset proportion requirement can be used as the important visual area, so that a reasonable determination mode of the important visual area is provided. It will be appreciated that, a person skilled in the art may set the corresponding preset number and preset ratio by himself or herself to obtain a suitable number of vision areas as important vision areas, or may determine the important vision areas based on the resource density ratio of each vision area in other manners, which is not limited herein.
According to some embodiments, step S105, analyzing the important visual area may include, for example: and carrying out data mining on the important visual area to extract key information or core content of the target page. Analyzing the important visual area may further include, for example, evaluating the quality of the text, links, images, video, etc. resources in the important visual area to obtain an evaluation result of the quality of the target page, and determining whether to incorporate the target page into an index base of the search engine according to the evaluation result. It will be appreciated that the analysis to be performed on the important visual areas can be determined by one skilled in the art on his own as desired, and is not limited herein.
According to another aspect of the present disclosure, there is also provided a page data analysis apparatus. As shown in fig. 5, the page data analysis device 500 includes: a first determining unit 510 configured to determine a plurality of page nodes in a target page; a second determining unit 520 configured to determine at least one visual area in the target page, wherein each of the at least one visual area comprises at least one of a plurality of page nodes; a third determining unit 530 configured to determine, for each of the at least one visual area, a score of the visual area based on at least one page node included in the visual area; a fourth determining unit 540 configured to determine important visual areas based on the respective scores of the at least one visual area; and an analysis unit 550 configured to analyze the important visual area.
The operations of the units 510-550 of the page data analysis device 500 are similar to those of the steps S101-S105 of the page data analysis method described above, and will not be described here.
According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.
Referring to fig. 6, a block diagram of an electronic device 600 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the device 600 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a page data analysis method. For example, in some embodiments, the page data analysis method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When a computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the page data analysis method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the page data analysis method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.
Claims (13)
1. A method of page data analysis, comprising:
determining a plurality of page nodes in a target page;
determining at least one visual area in the target page, wherein each visual area in the at least one visual area comprises at least one page node in the plurality of page nodes;
for each of the at least one visual region, determining a score for the visual region based on at least one page node included in the visual region, comprising:
determining respective first areas and node weights of at least one page node included in the visual area, wherein the first areas are determined according to the width and the height of the corresponding page node;
multiplying the respective first areas of the at least one page node by node weights to obtain at least one first score corresponding to the at least one page node one by one;
summing the at least one first score to obtain a second score for the visual area;
determining respective second areas of at least one page node included in the visual area, wherein the second areas comprise first areas of corresponding page nodes and at least one of filling areas, boundary areas and blank areas of the corresponding page nodes;
multiplying the second areas of the at least one page node and the node weights respectively to obtain at least one third score corresponding to the at least one page node one by one; and
summing the at least one third score to obtain a fourth score for the visual area;
determining important visual areas based on the respective scores of the at least one visual area, comprising:
in response to determining that the ratio of the fourth score to the second score for one of the at least one visual region is greater than a preset ratio, directly determining that visual region as an important visual region; and
the important visual area is analyzed.
2. The method of claim 1, wherein determining at least one visual area in the target page comprises:
and based on the node information of each of the plurality of page nodes, partially merging the plurality of page nodes to obtain the at least one visual area.
3. The method of claim 2, wherein the node information comprises at least one of: node abscissa, node ordinate, node width, node height, and relationships with other nodes.
4. The method of claim 3, wherein partially merging the plurality of page nodes comprises at least one of:
merging at least one page node with the same node abscissa and the same node width in the plurality of page nodes and the relationship with other nodes meeting the preset relationship; and
and merging at least one page node with the same node ordinate, the same node height and the relationship with other nodes meeting the preset relationship.
5. The method of any of claims 1-4, wherein the size of each visual area does not exceed a display interface size for displaying the target page.
6. The method of claim 1, wherein each of the at least one page node has a node type, and the node weight of the page node is determined by the node type of the page node.
7. The method of claim 6, wherein the node type comprises at least one of: images, videos, text, and links.
8. The method of claim 1, wherein determining the score for the visual area further comprises:
the score of the visual area is adjusted based on the location of the visual area in the target page.
9. The method of claim 8, wherein adjusting the score of the visual area based on the location of the visual area comprises:
and in response to detecting that the visual area is positioned at the first screen of the target page, adjusting the score of the visual area upwards.
10. A page data analysis device comprising:
a first determination unit configured to determine a plurality of page nodes in a target page;
a second determining unit configured to determine at least one visual area in the target page, wherein each of the at least one visual area includes at least one of the plurality of page nodes;
a third determining unit configured to determine, for each of the at least one visual area, a score of the visual area based on at least one page node included in the visual area, including:
determining respective first areas and node weights of at least one page node included in the visual area, wherein the first areas are determined according to the width and the height of the corresponding page node;
multiplying the respective first areas of the at least one page node by node weights to obtain at least one first score corresponding to the at least one page node one by one;
summing the at least one first score to obtain a second score for the visual area;
determining respective second areas of at least one page node included in the visual area, wherein the second areas comprise first areas of corresponding page nodes and at least one of filling areas, boundary areas and blank areas of the corresponding page nodes;
multiplying the second areas of the at least one page node and the node weights respectively to obtain at least one third score corresponding to the at least one page node one by one; and
summing the at least one third score to obtain a fourth score for the visual area;
a fourth determination unit configured to determine important visual areas based on the respective scores of the at least one visual area, including:
in response to determining that the ratio of the fourth score to the second score for one of the at least one visual region is greater than a preset ratio, directly determining that visual region as an important visual region; and
and an analysis unit configured to analyze the important visual area.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
13. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111005656.9A CN113688302B (en) | 2021-08-30 | 2021-08-30 | Page data analysis method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111005656.9A CN113688302B (en) | 2021-08-30 | 2021-08-30 | Page data analysis method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113688302A CN113688302A (en) | 2021-11-23 |
CN113688302B true CN113688302B (en) | 2024-03-19 |
Family
ID=78584033
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111005656.9A Active CN113688302B (en) | 2021-08-30 | 2021-08-30 | Page data analysis method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113688302B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107741942A (en) * | 2016-12-09 | 2018-02-27 | 腾讯科技(深圳)有限公司 | A kind of webpage content extracting method and device |
WO2018103540A1 (en) * | 2016-12-09 | 2018-06-14 | 腾讯科技(深圳)有限公司 | Webpage content extraction method, device, and data storage medium |
CN113076480A (en) * | 2021-04-21 | 2021-07-06 | 百度在线网络技术(北京)有限公司 | Page recommendation method and device, electronic equipment and medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11194884B2 (en) * | 2019-06-19 | 2021-12-07 | International Business Machines Corporation | Method for facilitating identification of navigation regions in a web page based on document object model analysis |
-
2021
- 2021-08-30 CN CN202111005656.9A patent/CN113688302B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107741942A (en) * | 2016-12-09 | 2018-02-27 | 腾讯科技(深圳)有限公司 | A kind of webpage content extracting method and device |
WO2018103540A1 (en) * | 2016-12-09 | 2018-06-14 | 腾讯科技(深圳)有限公司 | Webpage content extraction method, device, and data storage medium |
CN113076480A (en) * | 2021-04-21 | 2021-07-06 | 百度在线网络技术(北京)有限公司 | Page recommendation method and device, electronic equipment and medium |
Non-Patent Citations (2)
Title |
---|
Web页面最大有意义节点发现算法研究;李亚子;方安;陈薇;朱峰;;现代图书情报技术(第10期);全文 * |
面向导航型网页关键词自动抽取的视觉模型与算法;彭浩;蔡美玲;陈继锋;刘炽;余炳锐;;计算机应用(第08期);第2360-2368段 * |
Also Published As
Publication number | Publication date |
---|---|
CN113688302A (en) | 2021-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112749344B (en) | Information recommendation method, device, electronic equipment, storage medium and program product | |
JP2017142768A (en) | Topic center visible information display method, program and calculation processing device | |
JP7119124B2 (en) | Action indicator for search behavior output element | |
US20210357710A1 (en) | Text recognition method and device, and electronic device | |
EP3916738A1 (en) | Medical fact verification method and apparatus, electronic device, and storage medium | |
US9026643B2 (en) | Contents' relationship visualizing apparatus, contents' relationship visualizing method and its program | |
CN111368153A (en) | Searching method and device | |
CN111309215A (en) | Processing method, device, equipment and storage medium of sliding list in Unity | |
WO2022156534A1 (en) | Video quality assessment method and device | |
CN114610859A (en) | Product recommendation method, device and equipment based on content and collaborative filtering | |
CN113204665B (en) | Image retrieval method, image retrieval device, electronic equipment and computer readable storage medium | |
CN113204614B (en) | Model training method, method for optimizing training data set and device thereof | |
CN113688302B (en) | Page data analysis method, device, equipment and medium | |
CN114663902B (en) | Document image processing method, device, equipment and medium | |
CN113761381B (en) | Method, device, equipment and storage medium for recommending interest points | |
CN114519153A (en) | Webpage history record display method, device, equipment and storage medium | |
CN113722593A (en) | Event data processing method and device, electronic equipment and medium | |
CN112597760A (en) | Method and device for extracting domain words in document | |
US9317125B2 (en) | Searching of line pattern representations using gestures | |
CN110262953A (en) | Method for testing software system performance and computer readable storage medium | |
CN113239296B (en) | Method, device, equipment and medium for displaying small program | |
US20160170995A1 (en) | Method for processing of search results | |
CN115578583B (en) | Image processing method, device, electronic equipment and storage medium | |
CN113360765B (en) | Event information processing method and device, electronic equipment and medium | |
CN118151855A (en) | SSD working scene switching method, SSD working scene switching device, SSD working scene switching equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |