CN114139078B - Method and device for extracting elements from web page, computer equipment and readable storage medium - Google Patents
Method and device for extracting elements from web page, computer equipment and readable storage medium Download PDFInfo
- Publication number
- CN114139078B CN114139078B CN202111437278.1A CN202111437278A CN114139078B CN 114139078 B CN114139078 B CN 114139078B CN 202111437278 A CN202111437278 A CN 202111437278A CN 114139078 B CN114139078 B CN 114139078B
- Authority
- CN
- China
- Prior art keywords
- score
- area
- elements
- sub
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012163 sequencing technique Methods 0.000 claims abstract description 6
- 238000004364 calculation method Methods 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 15
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 12
- 230000000694 effects Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method and a device for extracting elements in a webpage, computer equipment and a readable storage medium, wherein the method comprises the following steps: obtaining influence factor information of each element in a target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of sub-elements in the element, the area variance of the sub-elements in the element and the distance from the element to a root node; according to the influence factor information of each element in the target webpage, calculating the final score of each element; and sequencing the elements according to the sequence from the final score to the small score, and acquiring the elements with the preset quantity which are sequenced at the front. The method and the device can rapidly and accurately extract the target data from the webpage. Meanwhile, the invention also relates to a block chain technology.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for extracting elements from a web page, a computer device, and a readable storage medium.
Background
The web pages are visible everywhere, and the user can easily find the data of interest when browsing a web page, but can do so if all the elements of interest are automatically marked according to some algorithm, which is not so intuitive. Most of the current extraction of web page data still depends on human initiative to customize different screening rules according to different web pages.
A common example is the model of a car, all of which can be found very conveniently when the user accesses their corporate network if the user is interested in some of the models. However, if this process is handed to a program, it is not easy to find a suitable algorithm or rule to obtain the data required by the user because of the large difference in web page construction between different brands.
Therefore, how to quickly and accurately extract the target data from the web page is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
Based on the above, the invention provides a method, a device, a computer device and a readable storage medium for extracting elements in a webpage so as to extract target data from the webpage rapidly and accurately.
In order to achieve the above object, the present invention provides a method for extracting elements from a web page, the method comprising:
obtaining influence factor information of each element in a target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of sub-elements in the element, the area variance of the sub-elements in the element and the distance from the element to a root node;
Calculating an average occupation area score of the element according to the average occupation area of the element, calculating a subelement number score of the element according to the number of subelements in the element, calculating a subelement area variance score of the element according to the area variance of the subelements in the element, calculating a root node distance score of the element according to the distance from the element to the root node, and calculating a final score of the element according to the average occupation area score, the subelement number score, the subelement area variance score and the root node distance score of the element;
And sequencing the elements according to the sequence from the final score to the small score, and acquiring the elements with the preset quantity which are sequenced at the front.
Preferably, the step of calculating an average footprint score for an element from the average footprint of the element comprises:
acquiring the area of each subelement in the element;
judging the size relation between the sum of the areas of all the subelements in the element and a preset area reference value;
if the sum of the areas of all the subelements in the element is larger than or equal to a preset area reference value, calculating the average occupation area score of the element by adopting the following formula;
S(E)=a1+(S/V)*b1;
wherein S (E) represents an average occupation area score of the element, S represents a sum of areas of all sub-elements in the element, V represents an area reference value, a1 represents a first set value, and b1 represents a first scale factor;
If the sum of the areas of all the subelements in the element is smaller than a preset area reference value, calculating the average occupation area score of the element by adopting the following formula:
S(E)=a2+(S/V)*b2;
where a2 represents a second set value and b2 represents a second scaling factor.
Preferably, after calculating the average footprint score of the element, the method further comprises:
Judging whether the calculated average area score of the elements is larger than 1;
if the calculated average area score of the element is larger than 1, taking 1 as the final average area score of the element;
If the calculated average area score of the element is less than or equal to 1, the actual calculation result is taken as the final average area score of the element.
Judging whether the calculated average area score of the elements is larger than 1;
if the calculated average area score of the element is larger than 1, taking 1 as the final average area score of the element;
If the calculated average area score of the element is less than or equal to 1, the actual calculation result is taken as the final average area score of the element.
Preferably, the step of calculating the subelement number score of the element from the number of subelements in the element comprises:
Three number intervals are obtained, and the three number intervals are respectively: the numerical ranges of the first interval, the second interval and the third interval are sequentially increased, wherein the first interval is (0, x), the second interval is [ x, y ], the third interval is [ y, ++), and x is more than 0 and less than y;
Acquiring the number of neutron elements in the element;
if the number of the sub-elements in the element is in the first interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a3*n;
wherein N (E) is the number score of the sub-elements of the element, a3 is a third set value, and N is the number of the sub-elements of the element;
If the number of the sub-elements in the element is in the second interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a4+(n-x)*[a5/(y-x)];
wherein a4 is a fourth set value, and a5 is a number of coefficients;
And if the number of the sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value.
Preferably, the step of calculating the sub-element area variance score of the element from the area variance of the sub-element of the element comprises:
acquiring the area of each subelement in the element;
Dividing the area of each subelement by the sum of the areas of the subelements to obtain an array;
calculating the variance of the array;
And calculating the subelement area variance score of the element according to the variance, wherein the calculation formula is as follows:
D(E)=F-d;
wherein D (E) is the subelement area variance score of the element, D is the variance of the array, and F is the empirical value.
Preferably, in the step of calculating the root node distance score of the element according to the element-to-root node distance, the root node distance score of the element is calculated using the following formula:
L(E)=1–e(-Path(E));
Where L (E) is the root node distance score of the element, E represents the element being calculated, and Path (E) represents the distance of the element E to the root node.
Preferably, in the step of calculating the final score of the element according to the average occupation area score, the number of sub-elements score, the sub-element area variance score and the root node distance score of the element, the final score of the element is calculated by adopting the following formula:
Score(E)=S(E)*N(E)*D(E)*L(E)
Where Score (E) represents the final Score of the element, S (E) represents the average footprint Score of the element, N (E) represents the number of sub-elements Score of the element, D (E) represents the sub-element area variance Score of the element, and L (E) represents the root node distance Score of the element.
In order to achieve the above object, the present invention further provides a device for extracting elements from a web page, where the device includes:
the first acquisition module is used for acquiring influence factor information of each element in the target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of sub-elements in the element, the area variance of the sub-elements in the element and the distance between the element and the root node;
The score calculating module is used for calculating the average occupation area score of the element according to the average occupation area of the element, calculating the subelement quantity score of the element according to the quantity of subelements in the element, calculating the subelement area variance score of the element according to the area variance of the subelement in the element, calculating the root node distance score of the element according to the distance from the element to the root node, and calculating the final score of the element according to the average occupation area score, the subelement quantity score, the subelement area variance score and the root node distance score of the element;
And the second acquisition module is used for sequencing the elements according to the sequence from the final score to the low score and acquiring the elements with the preset quantity which are sequenced at the front.
Preferably, the score calculating module specifically includes a first calculating unit;
the first computing unit is specifically configured to:
acquiring the area of each subelement in the element;
judging the size relation between the sum of the areas of all the subelements in the element and a preset area reference value;
if the sum of the areas of all the subelements in the element is larger than or equal to a preset area reference value, calculating the average occupation area score of the element by adopting the following formula;
S(E)=a1+(S/V)*b1;
wherein S (E) represents an average occupation area score of the element, S represents a sum of areas of all sub-elements in the element, V represents an area reference value, a1 represents a first set value, and b1 represents a first scale factor;
If the sum of the areas of all the subelements in the element is smaller than a preset area reference value, calculating the average occupation area score of the element by adopting the following formula:
S(E)=a2+(S/V)*b2;
wherein a2 represents a second set value, and b2 represents a second scaling factor;
Judging whether the calculated average area score of the elements is larger than 1;
if the calculated average area score of the element is larger than 1, taking 1 as the final average area score of the element;
If the calculated average area score of the element is less than or equal to 1, the actual calculation result is taken as the final average area score of the element.
Preferably, the score calculating module specifically includes a second calculating unit;
the second computing unit is specifically configured to:
Three number intervals are obtained, and the three number intervals are respectively: the numerical ranges of the first interval, the second interval and the third interval are sequentially increased, wherein the first interval is (0, x), the second interval is [ x, y ], the third interval is [ y, ++), and x is more than 0 and less than y;
Acquiring the number of neutron elements in the element;
if the number of the sub-elements in the element is in the first interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a3*n;
wherein N (E) is the number score of the sub-elements of the element, a3 is a third set value, and N is the number of the sub-elements of the element;
If the number of the sub-elements in the element is in the second interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a4+(n-x)*[a5/(y-x)];
wherein a4 is a fourth set value, and a5 is a number of coefficients;
And if the number of the sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value.
Preferably, the score calculating module specifically includes a third calculating unit;
the third computing unit is specifically configured to:
acquiring the area of each subelement in the element;
Dividing the area of each subelement by the sum of the areas of the subelements to obtain an array;
calculating the variance of the array;
And calculating the subelement area variance score of the element according to the variance, wherein the calculation formula is as follows:
D(E)=F-d;
wherein D (E) is the subelement area variance score of the element, D is the variance of the array, and F is the empirical value.
Preferably, the score calculating module specifically includes a fourth calculating unit;
the fourth computing unit is specifically configured to compute a root node distance score of the element using the following formula:
L(E)=1–e(-Path(E));
Where L (E) is the root node distance score of the element, E represents the element being calculated, and Path (E) represents the distance of the element E to the root node.
Preferably, the score calculating module specifically includes a fifth calculating unit;
the fifth calculation unit is specifically configured to calculate a final score of the element by using the following formula:
Score(E)=S(E)*N(E)*D(E)*L(E)
Where Score (E) represents the final Score of the element, S (E) represents the average footprint Score of the element, N (E) represents the number of sub-elements Score of the element, D (E) represents the sub-element area variance Score of the element, and L (E) represents the root node distance Score of the element.
To achieve the above object, the present invention also provides a computer device including a memory and a processor, wherein the memory stores computer readable instructions that, when executed by the processor, cause the processor to perform the steps of the element extraction method in a web page as described above.
To achieve the above object, the present invention also provides a readable storage medium storing computer readable instructions which, when executed by a processor, implement the steps of the element extraction method in a web page as described above.
The method comprises the steps of obtaining influence factor information of each element in a target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of the sub-elements in the element, the area variance of the sub-elements in the element and the distance between the element and a root node, then respectively calculating the final score of each element according to the influence factor information of each element, sequencing the elements according to the sequence from large to small of the final score, and obtaining the elements with the preset number in front of the sequence, thereby realizing the effect of automatically extracting the core element from the target webpage.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for extracting elements from a web page according to an embodiment of the present invention;
Fig. 2 is a detailed flowchart of step S102 in fig. 1;
Fig. 3 is a detailed flowchart of step S1021 in fig. 2;
Fig. 4 is a detailed flowchart of step S1022 in fig. 2;
Fig. 5 is a detailed flowchart of step S1023 in fig. 2;
FIG. 6 is a schematic diagram illustrating a device for extracting elements from a web page according to an embodiment of the invention;
FIG. 7 is a schematic diagram of the score computation module of FIG. 6;
FIG. 8 is a schematic diagram of a computer device in accordance with an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the method for extracting elements in a web page according to an embodiment of the invention includes steps S101 to S103 as follows.
S101: and acquiring influence factor information of each element in the target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of sub-elements in the element, the area variance of the sub-elements in the element and the distance from the element to the root node.
The method for extracting the elements in the web page of the embodiment is illustrated as being applied to the terminal, and it can be understood that the method can also be applied to the server, and can also be applied to a system comprising the terminal and the server, and is realized through interaction between the terminal and the server. The terminal device may be configured with a browser, where the browser may be a browser such as IE, firefox, chrome, safari, opera, and the method for extracting elements in a web page provided in the embodiment of the present invention may be specifically implemented through a browser on the terminal.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
A web page is typically made up of a plurality of elements, each of which in turn contains a plurality of sub-elements. The target webpage refers to a webpage needing element extraction, and the target webpage can refer to a webpage currently displayed in a terminal display screen, or a webpage currently selected by a user, or a target webpage needing element extraction is automatically selected through software. Element extraction refers to automatically extracting core elements, which are typically elements of interest to the user, from all elements in the web page.
In this embodiment, it is necessary to acquire and analyze the influence factor information of each element, where the influence factor information includes at least an average occupied area of the element, the number of sub-elements in the element, an area variance of the sub-elements in the element, and a distance from the element to the root node.
Wherein, the average occupied area of the elements is represented by a symbol S, the larger the value of S is, the more visual attention is easily drawn, and the importance is also higher.
The number of sub-elements in the element is denoted by the symbol N. The value of N should be either too low or too high within a certain interval.
The area variance of the sub-elements in the element is denoted by the symbol D. D represents the proximity of the area of each subelement in the element, i.e., the degree of fluctuation of the data, and a smaller D indicates that the area of each subelement is closer and more desirable.
The distance of the element to the root node is denoted by the symbol L. The larger L indicates the farther an element is from the root node, the more specific the data.
S102, respectively calculating the final score of each element according to the influence factor information of each element in the target webpage.
In one embodiment, referring to fig. 2, step S102 specifically includes steps S1021 to S1025.
S1021, calculating the average occupation area score of the element according to the average occupation area of the element.
Referring to fig. 3, step S1021 specifically includes steps S10211 to S10216.
S10211, obtaining the area of each subelement in the element;
S10212, judging the size relation between the sum of the areas of all the sub-elements in the element and a preset area reference value;
S10213, if the sum of the areas of all the subelements in the element is larger than or equal to a preset area reference value, calculating the average occupation area score of the element by adopting the following formula:
S(E)=a1+(S/V)*b1; (1)
Where S (E) represents the average area score of the element, S represents the sum of the areas of all the sub-elements in the element (the unit is the square of the pixel), V represents the area reference value (the unit is the square of the pixel), a1 represents the first set value, b1 represents the first scale factor, and neither a1 nor b1 has a unit.
In this embodiment, a1 is specifically 0.8, and b1 is specifically 0.003, and it should be noted that, in implementation, the values of a1 and b1 may be manually set first, then a plurality of web pages are calculated, and the values of a1 and b1 are finely tuned by comparing the differences between the actual results and the estimated results.
S10214, if the sum of the areas of all the sub-elements in the element is smaller than a preset area reference value, calculating the average occupation area score of the element by adopting the following formula:
S(E)=a2+(S/V)*b2; (2)
wherein a2 represents a second set value, b2 represents a second scaling factor, and neither a2 nor b2 has a unit.
In this embodiment, a2 is specifically 0.5, and b1 is specifically 0.3, and it should be noted that, in implementation, a2 and b2 values may be manually set first, then a plurality of web pages are calculated, and fine adjustment is performed on the a2 and b2 values by comparing differences between actual results and predicted results.
S10215, judging whether the calculated average area score of the elements is larger than 1;
S10216, if the calculated average area score of the elements is larger than 1, taking 1 as the final average area score of the elements;
and S10217, if the calculated average area score of the elements is less than or equal to 1, taking the actual calculation result as the final average area score of the elements.
In this embodiment, S (E) calculated in step S10213 or S10214 should be a value with 1 direct value in the range of [0,1], and in this case, the actual calculation result is taken as the final element average occupation area score. For some special cases, S (E) calculated by step S10213 or S10214 may be greater than 1, where 1 is taken directly as the average footprint score of the final element, and S (E) =1.
For example, in general, a page size that is easily noticeable to a person is 100 pixels by 100 pixels, and thus V takes 10000 (units: squares of pixels).
For a certain element a in the target web page, the sum of the areas of all the sub-elements is 800 (unit: square of pixel), so that the calculation is performed by using formula (2), S (E) =0.5+ (800/8000)
0.3=0.53。
For a certain element B in the target web page, the sum of the areas of all the sub-elements is 8000 (unit: square of pixel), so that S (E) =0.5+ (8000/10000) ×0.3=0.74 is calculated by using formula (2).
For a certain element C in the target web page, the sum of the areas of all the sub-elements is 10000 (unit: square of pixel), so that S (E) =0.8+ (10000/10000) is calculated by using formula (1) and 0.003=0.803.
For a certain element D in the target web page, the sum of the areas of all the sub-elements is 40000 (unit: square of pixel), so that S (E) =0.8+ (40000/10000) is 0.003=0.812 calculated by using formula (1).
The result of the comparison of elements A, B, C, D shows that the larger the sum of the areas of the sub-elements is, the higher the average area score of the corresponding elements is, i.e. the higher the importance is. However, when the sum of the areas of the sub-elements exceeds the area reference value V, the value of S (E) increases as the sum of the areas of the sub-elements increases, but the ratio of the increase is relatively small. For example, the sum of the areas of all the sub-elements in element C is 10000, the sum of the areas of all the sub-elements in element D is 40000, and the sum of the areas of the sub-elements of the latter is four times the former, but the value of S (E) is increased by only 0.009, because the area of the front is enough for the user to pay attention to on the web page, so that the latter has a higher score than the former, but the two should not differ too much.
S1022, calculating the subelement number score of the element according to the number of subelements in the element.
Referring to fig. 4, step S1022 specifically includes steps S10221 to S10225.
S10221, three number intervals are acquired, and the number intervals are respectively: the numerical ranges of the first interval, the second interval and the third interval are sequentially increased;
Wherein the first interval is (0, x), the second interval is [ x, y), and the third interval is [ y, ++), wherein 0 < x < y. In this embodiment, x is, for example, 3, and y is, for example, 50. That is, the first section is (0, 3), the second section is [3, 50), and the third section is [50, ++).
S10222, obtaining the number of neutron elements in the element;
s10223, if the number of the sub-elements in the element is in the first interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a3*n; (3)
Wherein N (E) is the number score of the sub-elements of the element, a3 is a third set value, and N is the number of the sub-elements of the element; in this embodiment, a3 is, for example, 0.1, i.e., N (E) =0.1×n.
S10224, if the number of the sub-elements in the element is in the second interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a4+(n-x)*[a5/(y-x)]; (4)
in this embodiment, a4 is, for example, 0.8, and a5 is, for example, 0.2, that is, N (E) =0.8+ (N-3) [ 0.2/(50-3) ].
S10225, if the number of the sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value.
In this embodiment, if the number of sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value of 0.5.
It should be noted that if the number of sub-elements in an element is in the first interval, it is indicated that the number of sub-elements is too small, and it is likely that this element is not a required element, so the calculation using formula N (E) =0.1×n corresponds to that when the number of sub-elements is in the first interval, an area of 10% is added to each sub-element as loss compensation.
If the number of the sub-elements in the element is in the second interval, the importance degree of the element is higher, and the sub-element number score of the element is calculated by adopting a formula (4).
If the number of sub-elements in the element is in the third interval, the number of sub-elements is too many, which means that there is a possibility that there are too many meaningful sub-elements, and there is a possibility that many unnecessary clutter elements are doped, so for equalization, in this embodiment, if the number of sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value of 0.5, and the fixed value can be adjusted according to practical situations.
S1023, calculating the subelement area variance score of the element according to the area variance of the subelement in the element.
Referring to fig. 5, step S1023 specifically includes steps S10231 to S10234.
S10231, obtaining the area of each subelement in the element;
s10232, dividing the area of each subelement by the sum of the areas of the subelements to obtain an array;
s10233, calculating the variance of the array;
s10234, calculating the subelement area variance score of the element according to the variance.
For example, the target web page has three sub-elements, the areas of which are 10,20 and 30 (unit: square of pixel), and the sum of the areas of the sub-elements is divided by the area of each sub-element to obtain the array [1/6,1/3,1/2]. Then, the array [1/6,1/3,1/2] is calculated to obtain the variance d of the array. Finally, calculating a subelement area variance score D (E) of the element by adopting the following formula:
D(E)=F-d。
It should be noted that, because the smaller the variance is, the more average the area is, the more expected the corresponding element is, i.e. the more important the corresponding element is, in order to unify the meaning with other functions (S (E), N (E), etc.), F is an empirical value, specifically a positive number, in this embodiment, the value of F is taken, i.e. D (E) =1-D, and the value of F may vary according to the specific situation.
In addition, since the variance reflects the fluctuation degree of the data, but the variance of the latter group is higher than that of the former group in consideration of the two groups of data of [10,20,30] and [100010, 100040, 100030], but the data of the latter group is more satisfactory in terms of experience, in this embodiment, the data of each group is calculated by dividing the sum of the groups, that is, the calculation accuracy can be made higher by adopting steps S10231 to S10234.
S1024, calculating the root node distance score of the element according to the distance from the element to the root node.
In this embodiment, the root node distance score of the element is calculated specifically by adopting the following formula:
L(E)=1–e(-Path(E));
Where L (E) is the root node distance score of an element, E represents the element being calculated, path (E) represents the distance of the element E from the root node, and the unit is omitted.
S1025, calculating the final score of the element according to the average occupation area score, the number score, the area variance score and the root node distance score of the element.
Wherein the final score of the element is calculated specifically using the formula:
Score(E)=S(E)*N(E)*D(E)*L(E)
the Score (E) represents the final Score of the element, and according to the calculation process of S (E), N (E), D (E), and L (E) in the above steps, the Score (E) is a value in the interval of [0,1], and the larger the value is, the more important the corresponding element is, and the sub-element in the element is the expected data. In practice, a corresponding final Score (E) needs to be calculated for each element in the target web page.
And S103, sorting the elements according to the order of the final scores from large to small, and acquiring the elements with the preset number which are sorted in the front.
In step S103, the elements may be sorted in order of the final Score from the big to the small, and the first elements of the sorting may be found, for example, the first 5 elements of the sorting (i.e., the elements of the sorting 1 st to the sorting 5 th) are found, so that the effect of automatically extracting the core element from the target web page is achieved.
According to the method for extracting the elements in the webpage, the effect of automatically extracting the core elements from the target webpage is achieved by acquiring the influence factor information of each element in the target webpage, wherein the influence factor information at least comprises the average occupied area of the elements, the number of the sub-elements in the elements, the area variance of the sub-elements in the elements, the distance between the elements and the root node, and the like.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
In an alternative embodiment, it is also possible to: and uploading the result of the element extraction method in the webpage to a blockchain.
Specifically, corresponding summary information is obtained based on the result of the element extraction method in the webpage, specifically, the summary information is obtained by hashing the result of the element extraction method in the webpage, for example, the summary information is obtained by using a sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fair transparency to the user. The user may download the summary information from the blockchain to verify that the results of the element extraction method in the web page were tampered with. The blockchain referred to in this example is a novel mode of application for computer technology such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
In an embodiment, a device for extracting elements in a web page is provided, where the device for extracting elements in a web page corresponds to the method for extracting elements in a web page in the above embodiment one by one. As shown in fig. 6, the element extraction device 100 in the web page includes a first acquisition module 10, a score calculation module 20, and a second acquisition module 30. The functional modules are described in detail as follows:
The first obtaining module 10 is configured to obtain influence factor information of each element in the target web page, where the influence factor information includes at least an average occupation area of the element, a number of sub-elements in the element, an area variance of the sub-elements in the element, and a distance between the element and a root node;
The score calculating module 20 is configured to calculate an average footprint score of the element according to the average footprint of the element, calculate a subelement number score of the element according to the number of subelements in the element, calculate a subelement area variance score of the element according to the area variance of the subelement in the element, calculate a root node distance score of the element according to the distance from the element to the root node, and calculate a final score of the element according to the average footprint score, the subelement number score, the subelement area variance score and the root node distance score of the element;
the second obtaining module 30 is configured to sort the elements in order of the final score from the top to the bottom, and obtain the elements with the preset number of the elements sorted first.
As shown in fig. 7, in this embodiment, the score calculating module 20 specifically includes:
A first calculation unit 21 for calculating an average footprint score of the element from the average footprint of the element;
a second calculation unit 22 for calculating a subelement number score of the element according to the number of subelements in the element;
a third calculation unit 23 for calculating a sub-element area variance score of the element from the area variance of the sub-element of the element;
a fourth calculation unit 24 for calculating a root node distance score of the element from the element to the root node;
A fifth calculation unit 25 for calculating a final score of the element based on the average footprint score, the subelement number score, the subelement area variance score, and the root node distance score of the element.
In this embodiment, the first computing unit 21 is specifically configured to:
acquiring the area of each subelement in the element;
judging the size relation between the sum of the areas of all the subelements in the element and a preset area reference value;
if the sum of the areas of all the subelements in the element is larger than or equal to a preset area reference value, calculating the average occupation area score of the element by adopting the following formula;
S(E)=a1+(S/V)*b1;
wherein S (E) represents an average occupation area score of the element, S represents a sum of areas of all sub-elements in the element, V represents an area reference value, a1 represents a first set value, and b1 represents a first scale factor;
If the sum of the areas of all the subelements in the element is smaller than a preset area reference value, calculating the average occupation area score of the element by adopting the following formula:
S(E)=a2+(S/V)*b2;
wherein a2 represents a second set value, and b2 represents a second scaling factor;
Judging whether the calculated average area score of the elements is larger than 1;
if the calculated average area score of the element is larger than 1, taking 1 as the final average area score of the element;
If the calculated average area score of the element is less than or equal to 1, the actual calculation result is taken as the final average area score of the element.
In this embodiment, the second computing unit 22 is specifically configured to:
Three number intervals are obtained, and the three number intervals are respectively: the numerical ranges of the first interval, the second interval and the third interval are sequentially increased, wherein the first interval is (0, x), the second interval is [ x, y ], the third interval is [ y, ++), and x is more than 0 and less than y;
Acquiring the number of neutron elements in the element;
if the number of the sub-elements in the element is in the first interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a3*n;
wherein N (E) is the number score of the sub-elements of the element, a3 is a third set value, and N is the number of the sub-elements of the element;
If the number of the sub-elements in the element is in the second interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a4+(n-x)*[a5/(y-x)];
wherein a4 is a fourth set value, and a5 is a number of coefficients;
And if the number of the sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value.
In this embodiment, the third computing unit 23 is specifically configured to:
acquiring the area of each subelement in the element;
Dividing the area of each subelement by the sum of the areas of the subelements to obtain an array;
calculating the variance of the array;
And calculating the subelement area variance score of the element according to the variance, wherein the calculation formula is as follows:
D(E)=F-d;
wherein D (E) is the subelement area variance score of the element, D is the variance of the array, and F is the empirical value.
In this embodiment, the fourth calculating unit 24 is specifically configured to calculate the root node distance score of the element by using the following formula:
L(E)=1–e(-Path(E));
Where L (E) is the root node distance score of the element, E represents the element being calculated, and Path (E) represents the distance of the element E to the root node.
In this embodiment, the fifth calculating unit 25 is specifically configured to calculate the final score of the element by using the following formula:
Score(E)=S(E)*N(E)*D(E)*L(E)
Where Score (E) represents the final Score of the element, S (E) represents the average footprint Score of the element, N (E) represents the number of sub-elements Score of the element, D (E) represents the sub-element area variance Score of the element, and L (E) represents the root node distance Score of the element.
According to the device for extracting the elements in the webpage, the effect of automatically extracting the core elements from the target webpage is achieved by acquiring the effect factor information of each element in the target webpage, wherein the effect factor information at least comprises the average occupied area of the elements, the number of the sub-elements in the elements, the area variance of the sub-elements in the elements and the distance between the elements and the root node, then the final score of each element is calculated according to the effect factor information of each element, the elements are ordered according to the order of the final score from large to small, the preset number of the elements with the front ordering are acquired, and compared with the traditional manual labeling and extraction method, the efficiency is higher.
The meaning of "first" and "second" in the above modules/units is merely to distinguish different modules/units, and is not used to limit which module/unit has higher priority or other limiting meaning. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or modules that may not be expressly listed or inherent to such process, method, article, or apparatus, and the partitioning of such modules by means of any other means that may be implemented by such means.
For specific limitation of the device for extracting elements in the web page, reference may be made to the above limitation of the method for extracting elements in the web page, which is not repeated here. The above-mentioned respective modules in the element extraction means in the web page may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external server via a network connection. The computer program is executed by a processor to implement a method of element extraction in a web page.
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the element extraction method in the web page in the above embodiment, such as steps S101 to S103 shown in fig. 1 and other extensions of the method and related steps. Or the processor when executing the computer program implements the functions of the modules/units of the element extraction device in the web page in the above embodiments, such as the functions of the modules 10 to 30 shown in fig. 6. In order to avoid repetition, a description thereof is omitted.
The Processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer device, connecting various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer program and/or modules, and the processor may implement various functions of the computer device by running or executing the computer program and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.
The memory may be integrated in the processor or may be provided separately from the processor.
In one embodiment, a readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the steps of the method for extracting elements in a web page in the above embodiment, such as steps S101 to S103 shown in fig. 1, and other extensions of the method and extensions of related steps. Or the computer program when executed by a processor, implements the functions of the modules/units of the element extraction apparatus in the web page in the above embodiments, such as the functions of the modules 10 to 30 shown in fig. 6. In order to avoid repetition, a description thereof is omitted.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.
Claims (6)
1. A method for extracting elements from a web page, the method comprising:
obtaining influence factor information of each element in a target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of sub-elements in the element, the area variance of the sub-elements in the element and the distance from the element to a root node;
Calculating an average occupation area score of the element according to the average occupation area of the element, calculating a subelement number score of the element according to the number of subelements in the element, calculating a subelement area variance score of the element according to the area variance of the subelements in the element, calculating a root node distance score of the element according to the distance from the element to the root node, and calculating a final score of the element according to the average occupation area score, the subelement number score, the subelement area variance score and the root node distance score of the element;
sequencing the elements according to the sequence from the high score to the low score, and acquiring the elements with the preset quantity which are sequenced at the front;
the step of calculating an average footprint score for the element from the average footprint of the element comprises:
acquiring the area of each subelement in the element;
judging the size relation between the sum of the areas of all the subelements in the element and a preset area reference value;
if the sum of the areas of all the subelements in the element is larger than or equal to a preset area reference value, calculating the average occupation area score of the element by adopting the following formula;
S(E)= a1 + (S/V) * b1;
wherein S (E) represents an average occupation area score of the element, S represents a sum of areas of all sub-elements in the element, V represents an area reference value, a1 represents a first set value, and b1 represents a first scale factor;
If the sum of the areas of all the subelements in the element is smaller than a preset area reference value, calculating the average occupation area score of the element by adopting the following formula:
S(E)= a2 + (S/V) * b2;
wherein a2 represents a second set value, and b2 represents a second scaling factor;
The step of calculating a subelement number score for an element based on the number of subelements in the element comprises:
Three number intervals are obtained, and the three number intervals are respectively: the numerical ranges of the first interval, the second interval and the third interval are sequentially increased, wherein the first interval is (0, x), the second interval is [ x, y ], the third interval is [ y, ++), and x is more than 0 and less than y;
Acquiring the number of neutron elements in the element;
if the number of the sub-elements in the element is in the first interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)=a3*n;
wherein N (E) is the number score of the sub-elements of the element, a3 is a third set value, and N is the number of the sub-elements of the element;
If the number of the sub-elements in the element is in the second interval, calculating a sub-element number score of the element by adopting the following formula:
N(E)= a4 + (n - x) * [a5/(y - x)];
wherein a4 is a fourth set value, and a5 is a number of coefficients;
If the number of the sub-elements in the element is in the third interval, the sub-element number score of the element takes a fixed value;
The step of calculating the sub-element area variance score of the element according to the area variance of the sub-element of the element comprises the following steps:
acquiring the area of each subelement in the element;
Dividing the area of each subelement by the sum of the areas of the subelements to obtain an array;
calculating the variance of the array;
And calculating the subelement area variance score of the element according to the variance, wherein the calculation formula is as follows:
D(E)=F-d;
Wherein D (E) is the subelement area variance score of the element, D is the variance of the array, and F is the empirical value;
in the step of calculating the root node distance score of the element according to the distance between the element and the root node, the root node distance score of the element is calculated by adopting the following formula:
L(E)= 1– e(-Path(E));
Where L (E) is the root node distance score of the element, E represents the element being calculated, and Path (E) represents the distance of the element E to the root node.
2. The method for extracting elements from a web page according to claim 1, wherein after calculating the average area-to-area score of the elements, the method further comprises:
Judging whether the calculated average area score of the elements is larger than 1;
if the calculated average area score of the element is larger than 1, taking 1 as the final average area score of the element;
If the calculated average area score of the element is less than or equal to 1, the actual calculation result is taken as the final average area score of the element.
3. The method for extracting elements from a web page according to claim 1, wherein in the step of calculating the final score of the element based on the average occupation area score, the number of sub-elements score, the sub-element area variance score, and the root node distance score, the final score of the element is calculated using the following formula:
Score(E)= S(E) * N(E) * D(E) * L(E)
Where Score (E) represents the final Score of the element, S (E) represents the average footprint Score of the element, N (E) represents the number of sub-elements Score of the element, D (E) represents the sub-element area variance Score of the element, and L (E) represents the root node distance Score of the element.
4. A device for extracting elements from a web page, the device being configured to implement a method for extracting elements from a web page according to any one of claims 1 to 3, the device comprising:
the first acquisition module is used for acquiring influence factor information of each element in the target webpage, wherein the influence factor information at least comprises the average occupied area of the element, the number of sub-elements in the element, the area variance of the sub-elements in the element and the distance between the element and the root node;
The score calculating module is used for calculating the average occupation area score of the element according to the average occupation area of the element, calculating the subelement quantity score of the element according to the quantity of subelements in the element, calculating the subelement area variance score of the element according to the area variance of the subelement in the element, calculating the root node distance score of the element according to the distance from the element to the root node, and calculating the final score of the element according to the average occupation area score, the subelement quantity score, the subelement area variance score and the root node distance score of the element;
And the second acquisition module is used for sequencing the elements according to the sequence from the final score to the low score and acquiring the elements with the preset quantity which are sequenced at the front.
5. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the element extraction method in a web page as claimed in any one of claims 1 to 3.
6. A readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the method for extracting elements in a web page according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111437278.1A CN114139078B (en) | 2021-11-29 | 2021-11-29 | Method and device for extracting elements from web page, computer equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111437278.1A CN114139078B (en) | 2021-11-29 | 2021-11-29 | Method and device for extracting elements from web page, computer equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114139078A CN114139078A (en) | 2022-03-04 |
CN114139078B true CN114139078B (en) | 2024-05-24 |
Family
ID=80389248
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111437278.1A Active CN114139078B (en) | 2021-11-29 | 2021-11-29 | Method and device for extracting elements from web page, computer equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114139078B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605783A (en) * | 2013-11-29 | 2014-02-26 | 优视科技有限公司 | Webpage display method and device |
CN104794118A (en) * | 2014-01-17 | 2015-07-22 | 腾讯科技(深圳)有限公司 | Webpage information processing method, device and system |
CN109783355A (en) * | 2018-12-14 | 2019-05-21 | 深圳壹账通智能科技有限公司 | Page elements acquisition methods, system, computer equipment and readable storage medium storing program for executing |
-
2021
- 2021-11-29 CN CN202111437278.1A patent/CN114139078B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605783A (en) * | 2013-11-29 | 2014-02-26 | 优视科技有限公司 | Webpage display method and device |
CN104794118A (en) * | 2014-01-17 | 2015-07-22 | 腾讯科技(深圳)有限公司 | Webpage information processing method, device and system |
CN109783355A (en) * | 2018-12-14 | 2019-05-21 | 深圳壹账通智能科技有限公司 | Page elements acquisition methods, system, computer equipment and readable storage medium storing program for executing |
Also Published As
Publication number | Publication date |
---|---|
CN114139078A (en) | 2022-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019223154A1 (en) | Single-page high-load image recognition method, device, computer apparatus, and storage medium | |
WO2019148669A1 (en) | Method and apparatus for generating machine learning model, computer device, and storage medium | |
CN109242002A (en) | High dimensional data classification method, device and terminal device | |
CN110399487B (en) | Text classification method and device, electronic equipment and storage medium | |
CN109376318A (en) | A kind of page loading method, computer readable storage medium and terminal device | |
JP5717921B2 (en) | System and method for recommending fonts | |
CN106855952A (en) | Computational methods and device based on neutral net | |
CN110825977A (en) | Data recommendation method and related equipment | |
CN114359563A (en) | Model training method and device, computer equipment and storage medium | |
CN112232933A (en) | House source information recommendation method, device, equipment and readable storage medium | |
CN110085292B (en) | Medicine recommendation method and device and computer-readable storage medium | |
CN112561644B (en) | Commodity recommendation method and device based on link prediction and related equipment | |
CN112667754B (en) | Big data processing method and device, computer equipment and storage medium | |
CN114139078B (en) | Method and device for extracting elements from web page, computer equipment and readable storage medium | |
CN106649748B (en) | Information recommendation method and device | |
CN112801134A (en) | Gesture recognition model training and distributing method and device based on block chain and image | |
CN111914199B (en) | Page element filtering method, device, equipment and storage medium | |
CN104899232A (en) | Cooperative clustering method and cooperative clustering equipment | |
WO2019091443A1 (en) | Neural network-based adjustment method, apparatus and device | |
CN116596617A (en) | Insurance product cross recommendation method and device, computer equipment and storage medium | |
DE112015004968T5 (en) | SYSTEM AND METHOD FOR RECOMMENDING A PACKAGE OF ELEMENTS BASED ON ELEMENT / USER TAGS AND CO INSTALLATION GRAPH | |
CN117009621A (en) | Information searching method, device, electronic equipment, storage medium and program product | |
CN113360744B (en) | Media content recommendation method, device, computer equipment and storage medium | |
CN112259239A (en) | Parameter processing method and device, electronic equipment and storage medium | |
CN110264306B (en) | Big data-based product recommendation method, device, server and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |