CN108446136B - Element code extraction method and system - Google Patents

Element code extraction method and system Download PDF

Info

Publication number
CN108446136B
CN108446136B CN201810239789.4A CN201810239789A CN108446136B CN 108446136 B CN108446136 B CN 108446136B CN 201810239789 A CN201810239789 A CN 201810239789A CN 108446136 B CN108446136 B CN 108446136B
Authority
CN
China
Prior art keywords
style
target
data
attribute
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810239789.4A
Other languages
Chinese (zh)
Other versions
CN108446136A (en
Inventor
沈科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sohu New Media Information Technology Co Ltd
Original Assignee
Beijing Jiaodian Xinganxian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaodian Xinganxian Information Technology Co ltd filed Critical Beijing Jiaodian Xinganxian Information Technology Co ltd
Priority to CN201810239789.4A priority Critical patent/CN108446136B/en
Publication of CN108446136A publication Critical patent/CN108446136A/en
Application granted granted Critical
Publication of CN108446136B publication Critical patent/CN108446136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/74Reverse engineering; Extracting design information from source code

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an element code extraction method, which comprises the steps of obtaining elements to be extracted in a document object model of a target website; analyzing target metadata contained in the element to be extracted; carrying out style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data; and converting the target style data into an extractable target code, and extracting the target code. According to the method, after the target metadata of the target element is obtained, the pattern deduplication processing is carried out on the target metadata, and the problems that in the existing one-key extraction process of the element code, the extraction process is not processed, so that patterns still have the problems of redundancy, omission and inaccurate value, and cannot be used as basic codes for secondary development are solved.

Description

Element code extraction method and system
Technical Field
The invention relates to the technical field of Web front ends, in particular to a method and a system for extracting element codes.
Background
Front-end page elements are part of a document that is written and compiled by a front-end engineer using a variety of computer languages (including, but not limited to, HTML, CSS, JavaScript) in conjunction with user visual interaction design and programming, and ultimately rendered and run on a browser. It can be said that HTML determines the structure of the element, CSS determines the appearance of the element, and JS selectively controls after the element is loaded. Thus, CSS development is the primary work in the early stages of element development. Due to the complexity of the cascading style sheet and some engineering factors, the style of the element is easy to generate complex cross influence, along with the reconstruction of the code and the iteration of the engineering, the multiplexing difficulty of the source code of the element is continuously increased, and the progress of software engineering development is greatly slowed down. In order to realize high reusability of the elements, SnapsnapSnappet in the form of Chrome browser developer tool extension realizes one-key extraction of element codes of the selected elements.
The inventor researches the extraction process of the element codes, and finds that the process of extracting the element codes by using Snapsopippet is one-key extraction, and the extraction process is not processed, so that the patterns still have the problems of redundancy, omission and inaccurate values and cannot be used as basic codes for secondary development.
Disclosure of Invention
In view of this, the present invention provides an element code extraction method, so as to solve the problems that in the prior art, a process of extracting an element code is a key extraction process, and the extraction process is not processed, so that patterns still have problems of redundancy, omission, and inaccurate value, and cannot be used as a basic code for secondary development. The specific scheme is as follows:
an element code extraction method, comprising:
acquiring elements to be extracted in a document object model of a target website;
analyzing target metadata contained in the element to be extracted;
carrying out style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data;
and converting the target style data into an extractable target code, and extracting the target code.
In the foregoing method, optionally, analyzing the target metadata included in the element to be extracted includes:
acquiring each node contained in the target element and a label attribute data set and a style sheet data set corresponding to each node;
comparing the data in the tag attribute data set and the style sheet data set with all necessary data in a preset necessary attribute and style database;
and deleting data which do not exist in the preset necessary attribute and style database in the tag attribute dataset and the style sheet dataset to obtain a first tag attribute dataset and a first style sheet dataset, wherein the first tag attribute dataset and the first style sheet dataset form target metadata contained in the element to be extracted.
Optionally, in the method, performing style deduplication on the target metadata according to a preset style deduplication processing algorithm includes:
filtering default style data contained in the target metadata to obtain first target style data;
restoring style attribute values different from the style attribute values defined by the webpage developer in the first target metadata to obtain second target style data;
counting and extracting the second target style data to obtain public style data;
and integrating the public style data and the second target style data to obtain target style data.
The above method, optionally, further includes:
judging whether the element to be extracted contains a pseudo element or not;
if yes, acquiring first pattern data of the pseudo element, and storing the first pattern data into the target pattern data.
The above method, optionally, further includes:
judging whether an external link file exists in the element to be extracted;
if so, analyzing second style data corresponding to the external link file, and storing the second style data into the target style data.
An extraction system of element codes, comprising:
the acquisition module is used for acquiring elements to be extracted in the document object model of the target website;
the analysis module is used for analyzing the target metadata contained in the element to be extracted;
the duplication removing module is used for carrying out pattern duplication removing processing on the target metadata according to a preset pattern duplication removing processing algorithm to obtain target pattern data;
and the conversion module is used for converting the target style data into an extractable target code and extracting the target code.
In the above system, optionally, the parsing module includes:
the acquisition unit is used for acquiring each node contained in the target element and a label attribute dataset and a style sheet dataset corresponding to the node;
the comparison unit is used for comparing the data in the tag attribute data set and the style sheet data set with all the preset necessary attributes and necessary data in a style database;
and a deleting unit, configured to delete data that is not present in the preset necessary attribute and style database in the tag attribute data set and the style sheet data set, so as to obtain a first tag attribute data set and a first style sheet data set, where the first tag attribute data set and the first style sheet data set constitute target metadata included in the element to be extracted.
The above system, optionally, the de-weighting module includes:
the filtering unit is used for filtering default style data contained in the target metadata to obtain first target style data;
the restoring unit is used for restoring a style attribute value different from a style attribute value defined by a webpage developer in the first target metadata to obtain second target style data;
the extraction unit is used for counting and extracting the second target style data to obtain public style data;
and the integration unit is used for integrating the public style data and the second target style data to obtain target style data.
The above system, optionally, further includes:
the first judgment module is used for judging whether the element to be extracted contains a pseudo element;
and if so, acquiring first pattern data of the pseudo element, and storing the first pattern data into the target pattern data.
The above system, optionally, further includes:
the second judgment module is used for judging whether the element to be extracted has an external link file;
and the second storage module is used for analyzing second style data corresponding to the external link file and storing the second style data into the target style data if the external link file is the target style data.
Compared with the prior art, the invention has the following advantages:
the invention discloses an element code extraction method, which comprises the steps of obtaining elements to be extracted in a document object model of a target website; analyzing target metadata contained in the element to be extracted; carrying out style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data; and converting the target style data into an extractable target code, and extracting the target code. According to the method, after the target metadata of the target element is obtained, the pattern deduplication processing is carried out on the target metadata, and the problems that in the existing one-key extraction process of the element code, the extraction process is not processed, so that patterns still have the problems of redundancy, omission and inaccurate value, and cannot be used as basic codes for secondary development are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an element code extraction method disclosed in an embodiment of the present application;
FIG. 2 is a flowchart of another method for extracting element codes according to the embodiment of the present application;
FIG. 3 is a flowchart of another method for extracting element codes according to the embodiment of the present application;
FIG. 4 is a schematic diagram of an extraction method of an element code disclosed in an embodiment of the present application;
fig. 5 is a block diagram of an extraction system of element codes according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses an element code extraction method, wherein in the embodiment of the invention, an element refers to an HTML element in particular, and the element comprises at least one node. The method is applied to the extraction process of element codes of corresponding elements in the field of Web front ends, and the Web front ends adopt the technology of taking a website foreground as a core, control the display and operation logic of a website at a browser end, and generate and manage all resource files such as HTML (hypertext markup language) document files, CSS (cascading style sheets) files, JavaScript (JavaScript) script files and the like which are operated at the browser end of a user. The execution subject of the method is a processor or a controller, a resolver and the like comprising the extraction method, and the extraction method is applied to a browser end. The execution flow of the extraction method is shown in fig. 1, and comprises the following steps:
s101, obtaining elements to be extracted in a document object model of a target website;
in the embodiment of the present invention, the target website is a website currently performing element code extraction of the element to be extracted, where a Web front-end page of each website corresponds to a Document Object Model (Document Object Model) DOM, the DOM is defined by a W3C standard organization, and a programming interface Object of an HTML/XML Document implemented by a browser is an Object in a DOM tree from the whole HTML Document to an element attribute, and is generally used to refer to a corresponding element. Wherein the element to be extracted is extracted in the DOM tree
S102, analyzing target metadata contained in the element to be extracted;
in the embodiment of the invention, the element to be extracted is analyzed, and the target metadata in the element to be extracted is obtained, wherein the target metadata comprises necessary attribute data and style data of each node in the element to be extracted.
S103, carrying out pattern deduplication processing on the target metadata according to a preset pattern deduplication processing algorithm to obtain target pattern data;
in the embodiment of the present invention, the target metadata may further include a default style, a hidden style, and a style redundant with the element similar to the element to be extracted, and the preset style deduplication processing algorithm is adopted to perform style deduplication processing, so as to obtain target style data.
And S104, converting the target style data into an extractable target code, and extracting the target code.
In the embodiment of the present invention, the element to be extracted is converted into a corresponding object code according to the object style data, and preferably, the object code is written by using HTML/CSS/SASS codes.
The invention discloses an element code extraction method, which comprises the steps of obtaining elements to be extracted in a document object model of a target website; analyzing target metadata contained in the element to be extracted; carrying out style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data; and converting the target style data into an extractable target code, and extracting the target code. According to the method, after the target metadata of the target element is obtained, the pattern deduplication processing is carried out on the target metadata, and the problems that in the existing one-key extraction process of the element code, the extraction process is not processed, so that patterns still have the problems of redundancy, omission and inaccurate value, and cannot be used as basic codes for secondary development are solved.
In the embodiment of the present invention, the element to be extracted further includes target structure data, where the target structure data refers to a DOM tree of the element to be extracted, and the DOM tree performs recursive access by using the target element as a root node, and is essentially a depth-first traversal. When accessing the labels and the attributes, the scheme designs a black-and-white list mechanism, and defines rules needing to be ignored and rules needing to be reserved, so that the metadata quality is improved. For example, for annotations and scripts, the element structure and style may be chosen to be filtered since they are not affected. For example, the tag attribute is selectively retained and the attribute outside the HTML standard is filtered according to the type of the node. In particular, SVG elements do not belong to HTML elements, but are also DOM-interfaced and often are important elements affecting the presentation of a page. And the SVG elements are subjected to compatible processing, and the analysis of the SVG labels is supported.
In the embodiment of the present invention, an execution flow for analyzing the target metadata included in the element to be extracted is shown in fig. 2, and includes the steps of:
s201, acquiring each node contained in the target element and a label attribute data set and a style sheet data set corresponding to each node;
in the embodiment of the invention, the DOM tree is a programming interface of an HTML/XML document realized by a browser and is also intermediate data in a memory. Parsing out the necessary data of a page document from the DOM tree is a necessary step to simplify the static content of the document. And accessing each node to analyze the tag attribute and the calculation style of the DOM tree with a target element as a root node, performing recursive access on the child nodes of the node, finally analyzing to obtain a JavaScript object isomorphic with the DOM tree, and acquiring the tag attribute data set and the style sheet data set of the element to be extracted in the JavaScript object, wherein the tag attribute data set contains the structural data of the element to be extracted.
S202, comparing the data in the tag attribute data set and the style sheet data set with all necessary data in a preset necessary attribute and style database;
in the embodiment of the invention, a necessary attribute and style database is preset for each element of different types, and a data set of necessary attributes and styles forming the target metadata is stored in the necessary attribute and style data table.
S203, deleting data which do not exist in the preset necessary attribute and style sheet database in the tag attribute data set and the style sheet data set to obtain a first tag attribute data set and a first style sheet data set, wherein the first tag attribute data set and the first style sheet data set form target metadata contained in the element to be extracted.
In the embodiment of the present invention, the method for extracting elements further includes: judging whether the element to be extracted contains a pseudo element or not; wherein, the pseudo-elements are mainly: before and after elements. The dummy elements can be regarded as special nodes, parasitize in general elements, and can have independent style rules. The pseudo-element style of the element is identified and is a component of the node metadata. If yes, acquiring first pattern data of the pseudo element, and storing the first pattern data into the target pattern data.
In the embodiment of the present invention, the method for extracting an element further includes:
judging whether an external link file exists in the element to be extracted; if so, analyzing second style data corresponding to the external link file, and storing the second style data into the target style data.
Preferably, when the external link file is a custom font, the font is customized through an @ font-face module of the cascading style sheet CSS3, and the @ font-face module custom font is a technology commonly used in modern front ends and can be used for displaying personalized fonts and icons. This part of the CSS rule is not present in the style rule of any element, but is defined inside the style tag, or in the CSS file of the link tag outer chain. In order to analyze the special external link file CSS rule, the contents of all style sheets in the Web front-end page need to be analyzed. Because the CSS file is introduced as a style sheet into a page without the problem of cross-domain access, the Web front-end page needs to access the CSS file by means of an HTTP request to acquire the content of the CSS file, but the resource file of the style data often belongs to a different domain from the page. In order to effectively avoid the problems, the embodiment of the invention realizes that the server side analyzes the online file and returns the analysis result to the browser side.
Traversing all style sheets at the browser end, and sending all the style sheets of the internal link and the external link to the server end; the inline style sheet is obtained by analyzing target style data obtained through S101-S104, and after receiving a request, a server downloads resources one by one for url of a CSS file, then splices the resource to a complete large style sheet according to the sequence of the style sheets, and then performs style rule analysis; the font rule is obtained by matching the character strings from the style sheet through the server side, the font rule is returned to the browser side after being collected, and the style sheet data corresponding to the custom font is added to the target style data.
Compared with the method for searching the rule matched with the element to be extracted from the CSS rule, the method for searching the element to be extracted in the embodiment of the invention has the advantages that the calculation value of the complete style of the element to be processed is directly obtained by calculating the style getComputedStyle, and the method is more complete, accurate and efficient. However, the calculation style has two distinct disadvantages: the size of the complete style of the element is far beyond the size of the actual declaration, and the calculated value of the partial style attribute does not conform to the value of the actual declaration. For the two defects, in the embodiment of the present invention, a pattern deduplication processing algorithm is further preset to perform deduplication processing on target metadata in the element to be extracted to obtain target pattern data, and a specific execution flow of the deduplication processing process is shown in fig. 3, and includes the steps of:
s301, filtering default style data contained in the target metadata to obtain first target style data;
in embodiments of the present invention, the target style sheet data may contain default styles that are not the style declared by the web site developer, and therefore have no retained value in the metadata and need to be identified and filtered. In the embodiment of the invention, an iframe is adopted to manufacture a style isolation environment, the same type of labels are created in the iframe, the style data of the same type of labels are acquired, and then the same type of labels are compared with the style data of the original element, so that all the style attributes with the default values are effectively distinguished.
However, element styles have default values, which is not a sufficient condition for the style to be undeclared because some styles have inheritability. If a style is not inheritable, when its value is equal to the value in the default state, the style declaration may be ignored; if a style is inheritable, the style declaration may be ignored when its value is equal to the value of the style attribute corresponding to the parent element.
S302, restoring style attribute values different from the style attribute values defined by the webpage developer in the first target metadata to obtain second target style data
In the embodiment of the present invention, the style attribute values of each style in the target style sheet data in the first target data are not all accurate, because the calculated style is substantially intermediate data calculated by the browser for rendering the DOM tree. Some property values are only formally transformed, such as color properties; and some attribute values are changed from default auto values to specific values, such as style attributes width, right and the like related to rendering typesetting, and in such a case, a special method needs to be adopted to restore the original style values. In the embodiment of the invention. And indirectly acquiring an original statement of the element style related to the rendering and typesetting by utilizing the hiding of the elements, and acquiring a style value in the original statement to obtain second target style data. Wherein the second pattern data may have partially identical pattern data to the first pattern data.
S303, counting and extracting the second target style data to obtain public style data;
in the embodiment of the present invention, the second target style data includes many repeated style declarations, because there are many structurally close elements, and the elements are considered to be the same type elements when the web page is designed, and the same CSS rule is applied, so that a shared style inevitably occurs. Aiming at the problem of element sharing style and considering the complexity of a cascading style sheet, an algorithm for identifying the same type in the same-level elements is adopted, the style of the same-level elements is counted, the sharing style of the same-level elements is extracted into a single same-level rule as far as possible, and the common style data corresponding to each same-level element is reserved.
S304, integrating the public style data and the second target style data to obtain target style data.
In the embodiment of the present invention, the data in the second target style data that is the same as the common style is replaced with the common style data, and the target style data is finally obtained.
When the label attribute is accessed, when the url with the attribute value of HTTP or HTTPS protocol is encountered, the program can repair according to the address of the current page. The purpose of repairing the link is mainly to ensure that the finally extracted HTML code can be normally displayed after being separated from the original page. If for a media class tag, such as an IMG tag, the src attribute defines the address of the picture, and if the address of the picture resource is url of http(s) protocol, the complete url should be reserved, adding to the metadata of this picture node.
In the embodiment of the present invention, a complete execution diagram corresponding to the element code extraction method is shown in fig. 4, where: acquiring a DOM tree corresponding to a webpage, selecting a target element in the DOM tree through a developer tool element panel, wherein the target element is an element to be extracted, analyzing the target element by an embedded script through a developer tool extension function to obtain DOM tree metadata corresponding to the target element, and processing the DOM tree metadata according to an element style duplication elimination algorithm module to obtain a reconstructed DOM tree and style data, wherein the reconstructed DOM tree and style data respectively correspond to the target structure data and the target style data. And if the target style data also comprises a style external linked list, sending the url corresponding to the style external linked list to CSS file analysis service of a server, transmitting the analyzed supplementary style data to the main line module according to an @ font-face rule, and converting the data in the main line module into element codes corresponding to the target elements according to the developer tool extension.
The element code extraction method analyzes target element metadata from a front-end page, and implements an element style deduplication algorithm through data cleaning and repairing, so that an HTML structure and a CSS style of a target element are greatly restored and optimized; in addition, the external style analysis service analyzes font information in the cross-domain CSS file, and makes up for the deficiency of pure browser-side analysis. Finally, the technical scheme comprises two parts of browser extension and resource analysis service, wherein the browser extension can extract target elements by one key at the front end, and high-quality HTML/CSS codes are generated under the coordination of the browser extension and the resource analysis service, so that the method is suitable for secondary development and page reconstruction.
In the embodiment of the present invention, corresponding to the above method for extracting an element code, the present invention further provides an element code extraction system, where a structure of the extraction system is shown in fig. 5, and the extraction system includes:
an obtaining module 401, an analyzing module 402, a deduplication module 403 and a converting module 404.
Wherein,
the obtaining module 401 is configured to obtain an element to be extracted in a document object model of a target website;
the parsing module 402 is configured to parse target metadata included in the element to be extracted;
the deduplication module 402 is configured to perform style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data;
the conversion module 404 is configured to convert the target style data into an extractable target code, and extract the target code.
The invention discloses an element code extraction system, which comprises an element to be extracted in a document object model of a target website; analyzing target metadata contained in the element to be extracted; carrying out style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data; and converting the target style data into an extractable target code, and extracting the target code. According to the system, after the target metadata of the target element is obtained, the pattern deduplication processing is carried out on the target metadata, so that the problems that in the existing one-key extraction process of the element code, the extraction process is not processed, so that patterns still have many redundancy, omission and inaccurate values, and cannot be used as basic codes for secondary development are solved.
In this embodiment of the present invention, the parsing module 402 includes:
an acquisition unit 405, a comparison unit 406 and a deletion unit 407.
Wherein,
the obtaining unit 405 is configured to obtain each node included in the target element and a tag attribute dataset and a style sheet dataset corresponding to the node;
the comparison unit 406 is configured to compare the tag attribute data set and the data in the style sheet data set with each preset necessary attribute and each preset necessary data in the style database;
the deleting unit 407 is configured to delete data that is not present in the preset necessary attribute and style database in the tag attribute data set and the style sheet data set, so as to obtain a first tag attribute data set and a first style sheet data set, where the first tag attribute data set and the first style sheet data set constitute target metadata included in the element to be extracted.
In this embodiment of the present invention, the duplication elimination module 403 includes:
a filtration unit 408, a reduction unit 409, an extraction unit 410 and an integration unit 411.
Wherein,
the filtering unit 408 is configured to filter default style data included in the target metadata to obtain first target style data;
the restoring unit 409 is configured to restore a style attribute value different from a style attribute value defined by a web page developer in the first target metadata to obtain second target style data;
the extracting unit 410 is configured to count and extract the second target style data to obtain common style data;
the integration unit 411 is configured to integrate the common style data and the second target style data to obtain target style data.
In this embodiment of the present invention, the element code extraction system further includes:
a first judging module 412 and a first storing module 413.
Wherein,
the first determining module 412 is configured to determine whether the element to be extracted includes a dummy element;
the first storing module 413 is configured to, if yes, obtain first pattern data of the dummy element, and store the first pattern data in the target pattern data.
In this embodiment of the present invention, the element code extraction system further includes:
a second decision module 414 and a second storage module 415.
Wherein,
the second determining module 414 is configured to determine whether an out-link file exists in the element to be extracted;
the second storage module 415 is configured to, if yes, parse second style data corresponding to the external link file, and store the second style data in the target style data.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should be further noted that, in the present application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The method and the system for extracting the element code provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (4)

1. An element code extraction method, comprising:
acquiring elements to be extracted in a document object model of a target website; analyzing the target metadata contained in the element to be extracted, comprising: acquiring each node contained in a target element and a label attribute data set and a style sheet data set corresponding to each node; wherein the target elements further comprise SVG elements; comparing the data in the tag attribute data set and the style sheet data set with all necessary data in a preset necessary attribute and style database;
deleting data which do not exist in the preset necessary attribute and style database in the tag attribute dataset and the style sheet dataset to obtain a first tag attribute dataset and a first style sheet dataset, wherein the first tag attribute dataset and the first style sheet dataset form target metadata contained in the element to be extracted;
performing style deduplication processing on the target metadata according to a preset style deduplication processing algorithm to obtain target style data, including:
filtering the default style data contained in the target metadata to obtain first target style data, specifically: adopting an iframe to manufacture a style isolation environment, creating tags of the same type in the iframe, acquiring style data of the tags, comparing the style data with the style data of original elements, judging a style attribute with a default value, and filtering;
restoring style attribute values different from style attribute values defined by a webpage developer in the first target style data to obtain second target style data;
counting and extracting the second target style data to obtain public style data;
integrating the public style data and the second target style data to obtain target style data;
converting the target style data into an extractable target code, and extracting the target code;
judging whether an external link file exists in the element to be extracted; if the external link file is a custom font, analyzing a second style data corresponding to the external link file, storing the second style data in the target style data, and when the external link file is the custom font, analyzing an @ font-face module of a cascading style sheet CSS3 to obtain a supplementary style as the second style data.
2. The method of claim 1, further comprising:
judging whether the element to be extracted contains a pseudo element or not;
if yes, acquiring first pattern data of the pseudo element, and storing the first pattern data into the target pattern data.
3. An extraction system of element codes, comprising:
the acquisition module is used for acquiring elements to be extracted in the document object model of the target website;
the analysis module is used for analyzing the target metadata contained in the element to be extracted;
the parsing module includes:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring each node contained in a target element and a label attribute dataset and a style sheet dataset corresponding to the node; wherein the target elements further comprise SVG elements;
the comparison unit is used for comparing the data in the tag attribute data set and the style sheet data set with all the preset necessary attributes and necessary data in a style database;
a deleting unit, configured to delete data that is not present in the preset necessary attribute and style database in the tag attribute dataset and the style sheet dataset to obtain a first tag attribute dataset and a first style sheet dataset, where the first tag attribute dataset and the first style sheet dataset constitute target metadata included in the element to be extracted;
the duplication removing module is used for carrying out pattern duplication removing processing on the target metadata according to a preset pattern duplication removing processing algorithm to obtain target pattern data;
the de-weighting module comprises:
a filtering unit, configured to filter default style data included in the target metadata to obtain first target style data, specifically: adopting an iframe to manufacture a style isolation environment, creating tags of the same type in the iframe, acquiring style data of the tags, comparing the style data with the style data of original elements, judging a style attribute with a default value, and filtering;
the restoring unit is used for restoring a style attribute value different from a style attribute value defined by a webpage developer in the first target style data to obtain second target style data;
the extraction unit is used for counting and extracting the second target style data to obtain public style data;
an integration unit for integrating the common style data and the second target style data to obtain target style data
The conversion module is used for converting the target style data into an extractable target code and extracting the target code;
the second judgment module is used for judging whether the element to be extracted has an external link file;
and the second storage module is used for analyzing second style data corresponding to the external link file if the external link file is a custom font, and when the external link file is a custom font, a supplementary style is obtained through analysis of an @ font-face module of a cascading style sheet CSS3 and is used as the second style data.
4. The system of claim 3, further comprising:
the first judgment module is used for judging whether the element to be extracted contains a pseudo element;
and if so, acquiring first pattern data of the pseudo element, and storing the first pattern data into the target pattern data.
CN201810239789.4A 2018-03-22 2018-03-22 Element code extraction method and system Active CN108446136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810239789.4A CN108446136B (en) 2018-03-22 2018-03-22 Element code extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810239789.4A CN108446136B (en) 2018-03-22 2018-03-22 Element code extraction method and system

Publications (2)

Publication Number Publication Date
CN108446136A CN108446136A (en) 2018-08-24
CN108446136B true CN108446136B (en) 2021-10-15

Family

ID=63196553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810239789.4A Active CN108446136B (en) 2018-03-22 2018-03-22 Element code extraction method and system

Country Status (1)

Country Link
CN (1) CN108446136B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492201A (en) * 2018-11-08 2019-03-19 大连瀚闻资讯有限公司 Document format conversion method applied to magnitude comparison
CN109815455B (en) * 2019-02-02 2023-05-05 天津字节跳动科技有限公司 Project file processing method and device
CN109783139B (en) * 2019-02-21 2020-12-04 四川大学 Software interface feature extraction method and device and electronic equipment
CN113741822B (en) * 2021-11-05 2022-02-15 腾讯科技(深圳)有限公司 Data storage method, data reading method and related device
CN114676369A (en) * 2022-03-10 2022-06-28 平安国际智慧城市科技股份有限公司 Webpage embedding method, device, equipment and computer readable storage medium
CN114661279A (en) * 2022-04-11 2022-06-24 平安资产管理有限责任公司 Method, system and computer equipment for extracting source codes of page components

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8397159B2 (en) * 2009-01-14 2013-03-12 International Business Machines Corporation Method and apparatus for solving UI style conflicts in web application composition
CN103336812A (en) * 2013-06-27 2013-10-02 优视科技有限公司 Webpage resource caching method and device for improving secondary loading efficiency
CN103500118A (en) * 2013-10-24 2014-01-08 北京奇虎科技有限公司 Method and device for optimizing cascading style sheet
CN106528068A (en) * 2015-09-15 2017-03-22 中国电信股份有限公司 Webpage content reconstruction method and system
CN106886398A (en) * 2016-06-20 2017-06-23 阿里巴巴集团控股有限公司 The extracting method and equipment of a kind of CSS
CN107783764A (en) * 2017-09-29 2018-03-09 厦门集微科技有限公司 Remove the method and device of front end pattern redundancy

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8819819B1 (en) * 2011-04-11 2014-08-26 Symantec Corporation Method and system for automatically obtaining webpage content in the presence of javascript
US9430583B1 (en) * 2011-06-10 2016-08-30 Salesforce.Com, Inc. Extracting a portion of a document, such as a web page
CN103336690B (en) * 2013-06-28 2017-02-08 优视科技有限公司 HTML (Hypertext Markup Language) 5-based text-element drawing method and device
CN105373567B (en) * 2014-09-01 2019-12-20 北京奇虎科技有限公司 Page generation method and client
CN106980497A (en) * 2017-02-10 2017-07-25 九次方大数据信息集团有限公司 Webpage and website performance optimization method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8397159B2 (en) * 2009-01-14 2013-03-12 International Business Machines Corporation Method and apparatus for solving UI style conflicts in web application composition
CN103336812A (en) * 2013-06-27 2013-10-02 优视科技有限公司 Webpage resource caching method and device for improving secondary loading efficiency
CN103500118A (en) * 2013-10-24 2014-01-08 北京奇虎科技有限公司 Method and device for optimizing cascading style sheet
CN106528068A (en) * 2015-09-15 2017-03-22 中国电信股份有限公司 Webpage content reconstruction method and system
CN106886398A (en) * 2016-06-20 2017-06-23 阿里巴巴集团控股有限公司 The extracting method and equipment of a kind of CSS
CN107783764A (en) * 2017-09-29 2018-03-09 厦门集微科技有限公司 Remove the method and device of front end pattern redundancy

Also Published As

Publication number Publication date
CN108446136A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
CN108446136B (en) Element code extraction method and system
Schiavone et al. An extensible environment for guideline-based accessibility evaluation of dynamic Web applications
US9146712B2 (en) Extensible code auto-fix framework based on XML query languages
US9753699B2 (en) Live browser tooling in an integrated development environment
CN106293675B (en) System static resource loading method and device
US20120331375A1 (en) Dynamically updating a running page
CN111045678A (en) Method, device and equipment for executing dynamic code on page and storage medium
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN105205080A (en) Redundant file clearing method, device and system
CN113407678B (en) Knowledge graph construction method, device and equipment
CN111459537A (en) Redundant code removing method, device, equipment and computer readable storage medium
CN113868568A (en) Webpage keyword highlighting method, device, equipment and storage medium
RU2632149C2 (en) System, method and constant machine-readable medium for validation of web pages
US8799256B2 (en) Incorporated web page content
CN108694043A (en) page decoration method and system
WO2016114965A1 (en) Storage and retrieval of structured content in unstructured user-editable content stores
EP2691874A1 (en) Textual analysis system
CN116009863B (en) Front-end page rendering method, device and storage medium
CN115905759A (en) Barrier-free webpage generation method, device, medium and equipment
CN115454382A (en) Demand processing method and device, electronic equipment and storage medium
Mazinanian Eliminating code duplication in cascading style sheets
YesuRaju et al. A language independent web data extraction using vision based page segmentation algorithm
CN114637505A (en) Page content extraction method and device
CN118502773B (en) Method and device for improving real-time response performance of cloud pipe platform
CN105224539B (en) Page file processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231115

Address after: 100190 901-1, Floor 9, Building 3, No. 2 Academy South Road, Haidian District, Beijing

Patentee after: Beijing Bodian Zhihe Technology Co.,Ltd.

Address before: 100086 20 / F, block C, No.2, south academy of Sciences Road, Haidian District, Beijing

Patentee before: BEIJING JIAODIAN XINGANXIAN INFORMATION TECHNOLOGY CO.,LTD.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240812

Address after: Room 1201, 12th Floor, Building 3, No. 2 Science Academy South Road, Haidian District, Beijing, 100084

Patentee after: BEIJING SOHU NEW MEDIA INFORMATION TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: 100190 901-1, Floor 9, Building 3, No. 2 Academy South Road, Haidian District, Beijing

Patentee before: Beijing Bodian Zhihe Technology Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right