CN113065600A

CN113065600A - Page element classification method, parser, medium and device

Info

Publication number: CN113065600A
Application number: CN202110378864.7A
Authority: CN
Inventors: 游海涛; 梁兴通; 王琳; 杨丰佳
Original assignee: Ylz Information Technology Co ltd
Current assignee: Ylz Information Technology Co ltd
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2021-07-02

Abstract

The invention relates to the technical field of page element classification, in particular to a page element classification method, a resolver, a medium and equipment, wherein the page element classification method comprises the steps of extracting and classifying known page elements, and extracting element features of the classified known page elements; establishing a logistic regression model, and carrying out classification training on the logistic regression model; classifying the page elements on the page file according to the trained logistic regression model; the classification of the known page elements and the extraction of the element characteristics are firstly carried out, so that the logistic regression model is conveniently subjected to classification training, the trained logistic regression model is used for classifying the page elements on the page file, the accuracy of page element classification is improved, the classification is closer to the essential characteristics of the page elements, developers can obtain more reasonable classification results without reading source codes in upgrading and transformation, and the labor cost is reduced.

Description

Page element classification method, parser, medium and device

Technical Field

The invention relates to the technical field of page element classification, in particular to a page element classification method, a resolver, a medium and equipment.

Background

With the popularization of computer technology, people's lives have gradually entered the intelligent era nowadays. Not only computer, cell-phone, PAD, people's clothing and eating the square of walking all begin to use the intelligent technology that appears soon, smart television, intelligent navigation, intelligent house etc. and the intelligent technology will provide convenient and fast service in each aspect of people's life. For example, intelligent voice interaction is a new generation of interaction mode based on voice input, and feedback results can be obtained by speaking.

And the internet products are complex and various, and the design on the page is more different. The text and the picture are two most basic elements forming a webpage. The characters are the content of the webpage, and the pictures are the beautiful appearance of the webpage. Elements of a web page include, among other things, animation, music, programs, and the like. Through the extraction and classification of the page elements, the interaction behavior of the user can be analyzed, so as to help subsequent optimization of products and operation, for example, chinese patent application (publication number CN111310044A) discloses a method, an apparatus, a device and a storage medium for extracting page element information, but the patent application does not mention how the page elements are classified. The page element classification commonly used in the prior art is mainly based on html tag classification, classification by tag name, or page rendering by a page parser.

The html tag-based classification comprises element categories such as block-level elements and in-line elements, but the classification scheme is only suitable for layout and structure construction and cannot be used for accurate operation scheme design; the operation logic contained in the elements cannot be truly and comprehensively reflected only by the label names, and ambiguity is easy to form; the page parser is also only suitable for implementing the fixed scenes of the presentation and cannot make effective classification. If the page elements are operated in a unified and standard manner, the elements on the page file need to be manually marked one by one, so that a large amount of labor cost is undoubtedly needed, and the implementation of upgrading and transformation is not facilitated.

Disclosure of Invention

In order to solve the defect that the efficiency of manually labeling elements on a page file one by one in the prior art is low, the page element classification method provided by the invention can improve the accuracy of page element classification and reduce the labor cost.

The invention provides a page element classification method, which comprises the following steps:

s100: extracting and classifying known page elements, and extracting element features of the classified known page elements;

s200: establishing a logistic regression model, and carrying out classification training on the logistic regression model;

s300: and classifying the page elements on the page file according to the trained logistic regression model.

Further, classifying according to the functional characteristics of the known page elements, wherein the known page elements include but are not limited to presentation elements, operable elements, list elements or external elements;

determining the element type by judging the influence factors contained in the element features, comparing the content, the sequence or the proportion of the influence factors, and extracting the element features according to the element type, wherein the element features comprise but are not limited to tags, structures, naming habits or attribute events.

Further, the Logistic regression model is established based on a Logistic distribution function which is

Where μ is a positional parameter and γ >0 is a shape parameter.

Further, performing feature coding by using one-hot coding to extract the element features, wherein when the influence factors corresponding to the element features are known influence factors, the element features form a classification sample; and carrying out classification training on the logistic regression model through the classification samples, fitting a decision boundary to establish a relation between the decision boundary and the classification training probability, and enabling the logistic regression model to obtain the classification probability of the page elements.

Further, when the influence factor corresponding to the element feature is an unknown influence factor, feature screening is performed by using random logistic regression in a stability selection method, then screened supplementary element features are added to the logistic regression model, and the element feature and the corresponding influence factor in the logistic regression model are propagated and corrected in a reverse direction.

Further, classifying document page elements on a page document includes the steps of:

s301: extracting page elements on the page file;

s302: inputting the extracted page elements to the trained logistic regression model;

s303: the logistic regression model outputs the classified page element groups.

Further, in step 301, a fuzzy search of XPath is used on the page file, and the Dom nodes are analyzed layer by layer based on document, so as to extract page elements.

The invention also provides a page element classification resolver, which comprises:

the element extraction module is used for extracting and classifying known page elements and extracting element features of the classified known page elements;

the modeling training module is used for establishing a logistic regression model and carrying out classification training on the logistic regression model;

and the element classification module is used for classifying the page elements on the page file according to the trained logistic regression model.

The invention also provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement a page element classification method as described in any one of the above.

The present invention also provides a computer device comprising at least one processor, and a memory communicatively coupled to the processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the processor to perform a method of page element classification as described in any one of the above.

Compared with the prior art, the page element classification method provided by the invention has the advantages that the known page elements are classified and the element characteristics are extracted, so that classification samples are formed to perform classification training on the logistic regression model, and the trained logistic regression model is used for classifying the page elements on the page file; the accuracy of page element classification is improved, classification is closer to essential characteristics of page elements, and therefore developers can obtain more reasonable classification results without reading source codes in upgrading and transformation, and labor cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a page element classification method provided by the present invention;

FIG. 2 is a functional diagram of a logistic regression model provided by the present invention;

FIG. 3 is a flowchart of the classification of document page elements provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. Furthermore, the technical features designed in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the page element classification method provided by the present invention includes the following steps:

step 1, extracting and classifying known page elements, and extracting element features of the classified known page elements; step 2, establishing a logistic regression model, and carrying out classification training on the logistic regression model; and 3, classifying the page elements on the page file according to the trained logistic regression model.

Step 1, extracting and classifying known page elements, and extracting element features of the classified known page elements.

In specific implementation, as shown in fig. 1, known page elements are extracted and classified, in this embodiment, the known page elements are extracted based on document layer-by-layer analysis of Dom nodes by using fuzzy lookup of XPath, and after the known page elements are extracted, the known page elements can be classified according to the functional characteristics of the known page elements, and the classified known page elements include, but are not limited to, presentation elements, operable elements, list elements, or external elements;

the display elements can be partial elements which are not operable on the page, such as pictures, document characters, icons and the like which are only used for display; the operable element can be an element which can trigger the logic method by the element pointing through corresponding operations such as clicking, checking, sliding and the like of a user; the list elements can be sub-elements with indexes and embedded, the list is sorted sequentially, and if a list exists, the list is not needed, elements in a pull-down selection box and the like belong to the list elements; external elements may be elements that are not actually exposed on the page body, typically as elements that control some special operations of the entire application, such as return, close, etc.

Then, element feature extraction is carried out on the classified known page elements, when the element features are extracted, the element types are determined by comparing the content, the sequence, the proportion and the like of the influence factors through judging the influence factors contained in the element features, then the element features are extracted according to the element types, the element types can be display elements, operable elements, list elements or external elements, the element features can be extracted by using unique hot codes for feature coding, and the element features include but are not limited to labels, structures, naming habits or attribute events and the like.

Specifically, tags in the element feature, including influence factors such as < ul >, < ol > and the like, can be regarded as such elements can be regarded as list elements, and including text tags such as < span >, < div >, < h1>, < p > and the like, can be usually regarded as presentation elements; the tag internally adds an onclick attribute to become a link element, and is one of operable elements.

Therefore, it is necessary to determine the influence factors included in the element features, compare the contents, the order, the specific gravity, and the like of the influence factors to determine the element types, and extract the element features according to the element types; labels in the element features can be used as sufficiently unnecessary conditions for element classification, which are not listed, and important weighting is realized through model training.

Similarly, the structure in the element feature, except the tag in the element feature, the document context structure where the page element is located and the embedded structure of the page element can usually distinguish whether the element type is a list element, and the list element has an important feature that a series of indexed sub-elements with the same level are arranged inside the list element.

Naming habits in element characteristics can find some common Chinese texts in various page elements through common semantic analysis of self-built application page operation, for example, buttons are usually matched with characters such as 'confirm', 'save', 'OK' and the like, and classification of page elements can be defined through key Chinese character search on documents.

Attribute event features attached to the interior of the element tag can be considered to energize the elements, and the same elements can be considered to belong to different categories if carrying different attribute event features; for example, a general common presentation element, such as carrying a contextmenu, may be considered as an actionable element, and an event with an onclick, an onchange, etc. may also be considered as an actionable element.

And 2, establishing a logistic regression model, and performing classification training on the logistic regression model.

In specific implementation, as shown in fig. 1 and 2, the Logistic regression model is established based on a Logistic distribution function, the Logistic distribution is a continuous distribution defined by the position and scale parameters thereof, the shape of the Logistic distribution is similar to that of a normal distribution, but the tail of the Logistic distribution is longer, so that the Logistic distribution is used for modeling in the embodiment, and the Logistic distribution has a data distribution with a longer tail and a higher peak than the normal distribution, and the Logistic distribution function is

Wherein mu is a position parameter, and gamma >0 is a shape parameter;

in general, given that the element features of the page element data set are hundreds of thousands, in this embodiment, the feature coding is performed by using the one-hot coding to extract the element features, and the element features are classification values; therefore, it is necessary to select element features which obviously affect the classification result for further modeling training;

and the influence factors corresponding to the element characteristics which obviously influence the classification result are known influence factors, the element characteristics form classification samples, the logistic regression model is subjected to classification training through the separation samples, and then the decision boundary is fitted to establish the relation between the decision boundary and the classification training probability, so that the logistic regression model obtains the classification probability of the page elements.

In the embodiment, obvious and predictable element characteristics such as tags, structures, naming habits or attribute events are selected as classification bases to classify known page elements; the element features are extracted using one-hot encoding as follows,

the label is [ "div", "span", "a", "button". N ],

the structure is [ "no substructure", "with substructure", "is substructure", "not substructure" ],

naming conventions [ "confirm", "cancel", "xx list". N ],

attributes, events [ "onClick", "onShow", "onBlur", "contextmenu".. N ],

wherein Y0 is a presentation element, Y1 is an operational element, Y2 is a list element, and Y3 is an external element;

there are then the following samples of the samples,

y1(< i class ═ reduce "onclick ═ countminus (index)" > decrease [ "i", "no substructure", "decrease", "onclick" ] is denoted as [10000.. N100010000.. N,100000N ];

y1(< span class ═ underfold "@ click ═ openscroll dialog (index)") [ "span", "no substructure", "expanded", "onclick" ] is denoted as [00010.. N10000000.1.. N,100000N ];

y0(< imgv-if ═ key ═ sign.bs '″ "src ═ 9./assets/images/xuetang.png'/>) [" img "," no substructure "," "," "] is denoted as [01000.. N100001000.. N,0100000N ];

y2(< div v-for ═ "(item, index) in eatData": key: "index" class ═ cars ">) [" div "," substructures "," food list "," v-for "] is denoted as [00010.. N010000000.1.. N,000010N ];

preferably, each element feature contains various different influence factors, and when the influence factor corresponding to the element feature is an unknown influence factor, F values of P values of each original feature are given through F test (F _ regression test is abbreviated as F test), so that variables can be screened, that is, the element features with small P values when the F values reach the receipt;

and adding the screened supplementary element characteristics into the logistic regression model, and reversely propagating and correcting the element characteristics and the corresponding influence factors in the logistic regression model when the logistic regression model is subjected to classification training, so that the classification accuracy of the logistic regression model is improved.

And 3, classifying the page elements on the page file according to the trained logistic regression model.

In specific implementation, as shown in fig. 1 and 3, when classifying the page elements on the page file after the logistic regression model is trained, firstly, extracting the page elements of the file on the page file, specifically, using XPath fuzzy search on the page file, analyzing the Dom nodes layer by layer based on document, and extracting the page elements of the file;

then, the extracted document page elements are input to the trained logistic regression model, and finally, the logistic regression model outputs the classified page element groups.

The invention also provides a page element classification analyzer, which comprises an element extraction module, a modeling training module and an element classification module, wherein the element extraction module, the modeling training module and the element classification module can realize the page element classification method, the realization principle and the technical effect are similar, and the description is omitted.

In specific implementation, the element extraction module is used for extracting and classifying known page elements and extracting element features of the classified known page elements; the modeling training module is used for establishing a logistic regression model and carrying out classification training on the logistic regression model; and the element classification module is used for classifying the page elements on the page file according to the trained logistic regression model.

The present invention also provides a computer readable storage medium storing computer instructions, which when executed by a processor implement a page element classification method as described in any one of the above.

In this embodiment, the computer-readable storage medium is a magnetic Disk, an optical Disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid-State Drive (SSD), or the like; the computer readable storage medium may also include a combination of memories of the above kinds.

In specific implementation, in this embodiment, the number of the processors may be one or more, and the processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be communicatively coupled to the processors via a bus or otherwise, the memory storing instructions executable by the at least one processor to cause the processor to perform a page element classification method as described in any of the above embodiments.

Compared with the prior art, the page element classification method, the resolver, the medium and the equipment provided by the invention have the advantages that the classification of known page elements and the extraction of element characteristics are carried out, so that classification samples are formed to carry out classification training on the logistic regression model, and then the trained logistic regression model is utilized to classify the page elements on the page file; the accuracy of page element classification is improved, classification is closer to essential characteristics of page elements, and therefore developers can obtain more reasonable classification results without reading source codes in upgrading and transformation, and labor cost is reduced.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A page element classification method is characterized by comprising the following steps:

2. The method for classifying page elements according to claim 1, wherein: classifying according to the functional characteristics of the known page elements, wherein the known page elements comprise but are not limited to presentation elements, operable elements, list elements or external elements;

3. The method for classifying page elements according to claim 2, wherein: establishing the Logistic regression model based on a Logistic distribution function which is

Where μ is a positional parameter and γ >0 is a shape parameter.

4. The method for classifying page elements according to claim 3, wherein: performing feature coding by using one-hot coding to extract the element features, wherein when the influence factors corresponding to the element features are known influence factors, the element features form a classification sample;

and carrying out classification training on the logistic regression model through the classification samples, fitting a decision boundary to establish a relation between the decision boundary and the classification training probability, and enabling the logistic regression model to obtain the classification probability of the page elements.

5. The method for classifying page elements according to claim 4, wherein: and when the influence factor corresponding to the element feature is an unknown influence factor, performing feature screening by using random logistic regression in a stability selection method, adding screened supplementary element features into the logistic regression model, and reversely propagating and correcting the element feature and the corresponding influence factor in the logistic regression model.

6. The method of claim 1, wherein classifying the document page elements on the page document comprises:

s301: extracting page elements on the page file;

s303: the logistic regression model outputs the classified page element groups.

7. The method for classifying page elements according to claim 6, wherein: in step 301, a fuzzy search of XPath is used on the page file, and the Dom nodes are analyzed layer by layer based on document to extract page elements.

8. A page element classification parser, comprising:

9. A computer-readable storage medium characterized by: the computer-readable storage medium stores computer instructions which, when executed by a processor, implement a page element classification method according to any one of claims 1 to 7.

10. A computer device, characterized by: comprising at least one processor, and a memory communicatively coupled to the processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to cause the processor to perform a method of page element classification as claimed in any one of claims 1 to 7.