CN110020236B - Webpage parsing method, device, storage medium, processor and equipment - Google Patents

Webpage parsing method, device, storage medium, processor and equipment Download PDF

Info

Publication number
CN110020236B
CN110020236B CN201710758003.5A CN201710758003A CN110020236B CN 110020236 B CN110020236 B CN 110020236B CN 201710758003 A CN201710758003 A CN 201710758003A CN 110020236 B CN110020236 B CN 110020236B
Authority
CN
China
Prior art keywords
template
analysis
webpage
url
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710758003.5A
Other languages
Chinese (zh)
Other versions
CN110020236A (en
Inventor
袁园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710758003.5A priority Critical patent/CN110020236B/en
Publication of CN110020236A publication Critical patent/CN110020236A/en
Application granted granted Critical
Publication of CN110020236B publication Critical patent/CN110020236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses a webpage analyzing method, a device, a storage medium, a processor and equipment, wherein the webpage analyzing method comprises the following steps: acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed; searching templates which are matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules, and different templates have different analysis rules; and analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result. The invention can complete the configuration of the analysis rule without restarting the on-line program, thereby improving the working efficiency.

Description

Webpage parsing method, device, storage medium, processor and equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a storage medium, a processor, and a device for web page parsing.
Background
The webpage analysis refers to the analysis and extraction of the really desired information from the webpage source code. Web page parsing techniques are a very important ring in search engine development.
Different web sites and different pages usually correspond to different parsing rules. To realize the analysis of various web pages with different websites and different layouts on the same platform, the currently adopted web page analysis method is as follows: when each webpage is analyzed, the configuration of the analysis rule corresponding to the webpage is completed firstly, then the webpage can be analyzed by using the analysis rule, and the next webpage is analyzed after the webpage is analyzed. When a new analysis rule is configured, the new configuration rule is written in first, and then the on-line program is restarted to enable the newly written analysis rule to take effect.
However, since the on-line program needs to be restarted each time to complete the configuration of the new parsing rule, when the number of the parsing rules that need to be newly configured is large, the on-line program is restarted one time, which inevitably affects the work efficiency.
Disclosure of Invention
In view of the above, the present invention is proposed to provide a web page parsing method, apparatus, storage medium, processor and device that overcome or at least partially solve the above problems, and the solution is as follows:
a webpage parsing method comprises the following steps:
acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed;
searching templates which are matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules, and different templates have different analysis rules;
and analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result.
Optionally, before searching for a template that matches the service scene and the URL at the same time from pre-configured templates, the method for webpage parsing further includes: the method comprises the steps that all templates are uniformly configured in a database in advance in a preset storage format, wherein the storage format of each template in the database adopts a column type storage format supporting a nested structure, the storage columns of the templates are divided into domain names, service scenes and template objects, and the template objects specifically comprise URL regular matching rules of the templates and template contents.
Wherein, the searching the template matched with the service scene and the URL at the same time from each pre-configured template comprises:
taking the domain name in the URL as a keyword, retrieving the template in the database, and screening out the template corresponding to the domain name in the URL;
taking the service scene as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene;
and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
Optionally, after the templates are uniformly configured in the database in advance in the preset storage format, the webpage parsing method further includes:
a cache pool is created locally in advance, and a background thread is started at the back end; the background thread is used for periodically updating the template in the database to the cache pool.
Optionally, the template content further includes a call instruction;
correspondingly, after the webpage to be analyzed is analyzed by using the searched analysis rule in the template to obtain an analysis result, the webpage analysis method further comprises the following steps:
calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result;
the common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
A web page parsing apparatus, comprising:
the system comprises a preprocessing unit, a processing unit and a processing unit, wherein the preprocessing unit is used for configuring each template in advance, the template content of each template comprises an analysis rule, and different templates have different analysis rules;
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a webpage analysis request, and the webpage analysis request carries a URL of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed;
the searching unit is used for searching templates which are matched with the service scene and the URL at the same time from all the pre-configured templates;
and the first analysis unit is used for analyzing the webpage to be analyzed by utilizing the analysis rule in the template searched by the search unit to obtain an analysis result.
Optionally, the template content further includes a call instruction;
correspondingly, the web page parsing device further comprises: the second analysis unit is used for calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result;
the common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
A storage medium having stored thereon a program which, when executed by a processor, implements any of the web page parsing methods disclosed above.
A processor for running a program which when run performs any of the web page parsing methods disclosed above.
An apparatus comprising a processor, a memory, and a program stored on the memory and executable on the processor, the processor implementing any of the web page parsing methods disclosed above when executing the program.
By means of the technical scheme, the webpage analysis method, the webpage analysis device, the storage medium, the processor and the equipment provided by the invention can be used for configuring the analysis rules in advance, so that when the webpages of different websites and different layouts are analyzed on the same platform, the analysis rule matched with each webpage can be directly called from the preset analysis rules to analyze the webpage. The invention can complete the configuration of the matched analysis rule without restarting the on-line program, thereby improving the working efficiency.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a method for parsing a web page according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a webpage parsing method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a further webpage parsing method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram illustrating a web page parsing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram illustrating another web page parsing apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As shown in fig. 1, a method for web page parsing according to an embodiment of the present invention includes:
step S01: acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed.
Specifically, the URL is a compact representation of the location and access method of a resource available from the internet, and is an address of a standard resource on the internet, commonly referred to as a "web address". The URL includes information such as a domain name.
The content to be parsed in different business scenarios is different for the same webpage. For example, to analyze a hundred-degree search page, in a service scenario a, all advertisement contents that need to be recommended by a hundred-degree search need to be analyzed, and in a service scenario B, contents that need to be searched normally need to be analyzed. In the webpage parsing request, the service scenario carried by the webpage parsing request can be represented by an identifier uniquely corresponding to the service scenario.
Step S02: and searching templates matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules, and different templates have different analysis rules.
Specifically, according to the URL of the web page to be analyzed and the service scenario in which the web page to be analyzed is located when the web page to be analyzed is analyzed, the analysis rule corresponding to the web page to be analyzed can be uniquely determined. Based on this, in this embodiment, parsing rules corresponding to different websites and different pages are configured in advance (for example, parsing rules corresponding to different websites and different pages are configured in a database in a unified manner in advance), the parsing rules are stored in different templates, and when a current webpage needs to be parsed, a template corresponding to the current webpage can be directly found from the pre-configured templates according to the URL of the current webpage and a service scenario in which the current webpage is parsed, so as to obtain the parsing rules in the template.
In the preprocessing process, in order to implement uniform configuration of templates in the database, it is necessary to define an analysis rule of the templates and determine the meaning of each node name, for example, as follows: the parsing rule of the template may be defined as a Key-Value form string rule in Json (JS Object Notation), where Key represents a field type, and Value is an Xpath (Xml Path Language) Value used when parsing a web page. The Json type node Key mainly comprises two representation types, wherein one of the two representation types is a node attribute value type, the other one is a service field type, the range mainly represented by the node attribute value type is provided with the hierarchy of the current node and an identifier which needs special processing of the analyzed content, the service field type is closely related to the specific field representation of each webpage to be analyzed, and the node attribute value type and the service field type are complementary and can not be obtained.
Next, an application example is given in which the above example is used to define the parsing rule of the template and to determine the meaning of each node name.
Taking the web page to be analyzed as the video web page as an example, the analysis rule in the template corresponding to the web page may adopt the following example 1-1.
Figure BDA0001392618690000051
Figure BDA0001392618690000061
Example 1-1
In example 1-1, each node name means: the Xpath field is a node attribute value type, the Video, Title and ViewCount are specific service field types, and the attribute result value matched by the Video is the content in an array form conforming to the Xpath.
In addition, when each template is configured in the database, the storage format of each template in the database needs to be determined. In order to facilitate the rapid management of each template in the database and achieve the purpose of rapidly searching the template, the storage format of each template in the database may adopt a column-type storage format supporting a nested structure, the storage column of each template is divided into three columns, namely, a domain name, a service scene and a template object, wherein the template object specifically includes a URL regular matching rule of the template and template content. That is to say, the storage format of each Template in the database is Dictionary < string, List < Template > >, and the explanation: dictionary < domain name, Dictionary < scene, List < template object > >, wherein the template object specifically comprises a URL regular matching rule of the template and template content.
Based on this storage format, when searching for a template matching both the service scenario and the URL from the database, for the purpose of quickly querying the template, as shown in fig. 2, the step S02 specifically includes:
step S021: searching the template in the database by taking the domain name in the URL carried in the webpage analysis request as a keyword, and screening out the template corresponding to the domain name in the URL;
step S022: taking the service scene carried in the webpage analysis request as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene;
step S023: and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
In short, the searching method shown in fig. 2 is to sequentially use the domain name and the URL of the web page to be analyzed and the service scene where the web page is analyzed as the key word, screen out the template set under the domain name from the database, screen out the template set under the service scene from the template set, and screen out the uniquely matched template from the template set obtained by the secondary screening according to the URL.
On the basis of the search method shown in fig. 2, in order to further improve the speed of querying the template, a cache pool may be locally created in advance, and a background thread is started at the backend, where the background thread is used to periodically (for example, every one minute) update the template in the database to the cache pool. The cache pool has the advantages of local storage and high searching speed. The starting of a background thread at the back end is to achieve the purpose of updating the template in the cache pool in real time, and specifically comprises the following steps: the template is stored in the database, the template called externally is the template in the cache pool, if the template in the database is modified, the template is dynamically updated to the cache pool in real time, the corresponding operation is that a background thread is started at the back end, and then the template in the database is periodically updated to the cache pool (for example, every minute). The storage format of each Template in the cache pool is also Dictionary < string, List < Template > >.
Step S03: and analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result.
Taking example 1-1 as an example, an existing sports program video webpage is parsed by using example 1-1, and the output parsing result is shown in example 1-2.
Figure BDA0001392618690000071
Examples 1 to 2
And after the webpage to be analyzed is analyzed by utilizing the searched analysis rule in the template, directly feeding back an analysis result to the calling party. The web page parsing method disclosed in this embodiment is a stateless service, that is, the web page parsing method disclosed in this embodiment is not changed according to the change of the caller.
As can be seen from the above description of the embodiment, the webpage parsing method provided in this embodiment can pre-configure each parsing rule, so that when different websites and different layouts of webpages are parsed on the same platform, for each webpage, the parsing rule matched with each webpage can be directly called out from the pre-configured parsing rules to parse the webpage, and an online program does not need to be restarted to complete configuration of the parsing rule matched with the webpage, thereby improving the work efficiency.
It should be noted that, both the web page parsing method adopted in the prior art and the web page parsing method disclosed in the embodiment of the present invention may be applied to a problem, that is, after the web page is parsed by using the parsing rule, some parsed fields may not be the final desired content, and may need to be specially processed to obtain the final desired parsing result. In order to achieve the purpose, the processing measures adopted in the prior art are to perform hard coding of a program on fields needing special processing in an analysis result after a webpage is analyzed by using an analysis rule each time, that is, each field needing special processing in the analysis result needs to be specially written with an analyzer to obtain a final desired analysis result, and the workload of a compiler is too large. In order to obtain a final desired parsing result and avoid bringing too much compiler workload, the embodiment of the present invention provides another web page parsing method based on the foregoing disclosed web page parsing method, and as shown in fig. 3, the method specifically includes:
step S01: acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed.
Step S02: and searching templates matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules and calling instructions, and different templates have different analysis rules.
Step S03: and analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result.
Step S04: and calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result. The common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities. The common parsing component can be configured in the same database as the templates.
The web page parsing method shown in fig. 3 is proposed based on the web page parsing method disclosed above, and the improvement point is that: the template content also comprises a calling instruction, and after a webpage is analyzed by utilizing the analysis rule each time, the corresponding public analysis component is also called to process each field needing special processing in the analysis result so as to obtain the final desired analysis result. The public analysis components are configured in advance in the embodiment, and any public analysis component can be called by a plurality of templates, but is not special for a field in an analysis result of a webpage, so that the workload of a compiler is greatly reduced.
For example, in the analysis results shown in the foregoing example 1-2, "NBA star competition" and "badminton world competition" have blank spaces, and the desired analysis result is a field without blank spaces, for this embodiment, a common analysis component is defined in advance, and the common capability of the common analysis component is trimetransmfonation, that is, an operation of removing blank spaces from the analyzed field is performed. For another example, for "23 times" and "45 times" in the parsing results shown in example 1-2, a common parsing component is predefined, and its common capability is IntegerExtractTransformation, i.e., a regular match is made for the number type, so that "23" and "45" are selected as the final output result. At this time, the newly obtained template contents and the final output result are as follows 1 to 3.
The newly obtained template content is as follows:
Figure BDA0001392618690000091
Figure BDA0001392618690000101
and finally outputting a result:
Figure BDA0001392618690000102
examples 1 to 3
It is clear that the output results in examples 1-3 are the final desired parsing results.
Corresponding to the embodiment of the method, the invention also provides a webpage analyzing device.
As shown in fig. 4, an apparatus for parsing a web page according to an embodiment of the present invention includes:
a preprocessing unit 100, configured to pre-configure each template, where the template content of the template includes an analysis rule, and different templates have different analysis rules;
an obtaining unit 200, configured to obtain a web page parsing request, where the web page parsing request carries a URL of a web page to be parsed and a service scenario in which the web page to be parsed is located when the web page to be parsed is parsed;
a searching unit 300, configured to search templates that are matched with the service scene and the URL at the same time from pre-configured templates;
the first parsing unit 400 is configured to parse the web page to be parsed by using the parsing rule in the template found by the finding unit 200, so as to obtain a parsing result.
Optionally, the preprocessing unit 100 is specifically configured to uniformly configure each template in a database according to a preset storage format, where the storage format of each template in the database adopts a column-type storage format supporting a nested structure, a storage column of each template is divided into a domain name, a service scene, and a template object, and the template object specifically includes a URL regular matching rule of the template and a template content.
Optionally, the searching unit 300 is specifically configured to search the templates in the database by using the domain name in the URL as a keyword, and filter out the template corresponding to the domain name in the URL; secondly, with the service scene as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene; and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
Optionally, the preprocessing unit 300 is further configured to create a cache pool locally in advance, and start a background thread at the back end at the same time; the background thread is used for periodically updating the template in the database to the cache pool.
Optionally, the template content further includes a call instruction; correspondingly, as shown in fig. 5, the web page parsing apparatus further includes: and a second parsing unit 500, configured to call a pre-configured public parsing component according to the found call instruction in the template, and process a field that needs to be secondarily parsed in the parsing result by using the public parsing component, so as to obtain a secondary parsing result. The common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
The webpage analyzing device comprises a processor and a memory, the preprocessing unit 100, the obtaining unit 200, the searching unit 300, the first analyzing unit 400, the second analyzing unit 500 and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the webpage analysis is realized by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The above-described device embodiments are described relatively simply, since they correspond substantially to the method embodiments, and reference may be made to the relevant description of the method embodiments for the relevant points.
An embodiment of the present invention provides a storage medium, on which a program is stored, and when the program is executed by a processor, the program implements any of the above-disclosed web page parsing methods.
The embodiment of the invention provides a processor, which is used for running a program, wherein any one of the disclosed webpage analyzing methods is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:
acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed;
searching templates which are matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules, and different templates have different analysis rules;
and analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result.
Optionally, before searching for a template that matches the service scene and the URL at the same time from pre-configured templates, the method for webpage parsing further includes: the method comprises the steps that all templates are uniformly configured in a database in advance in a preset storage format, wherein the storage format of each template in the database adopts a column type storage format supporting a nested structure, the storage columns of the templates are divided into domain names, service scenes and template objects, and the template objects specifically comprise URL regular matching rules of the templates and template contents.
Optionally, the searching for a template that matches the service scene and the URL at the same time from pre-configured templates specifically includes:
taking the domain name in the URL as a keyword, retrieving the template in the database, and screening out the template corresponding to the domain name in the URL;
taking the service scene as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene;
and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
Optionally, after the templates are uniformly configured in the database in advance in the preset storage format, the method further includes: a cache pool is created locally in advance, and a background thread is started at the back end; the background thread is used for periodically updating the template in the database to the cache pool.
Optionally, the template content further includes a call instruction;
correspondingly, after the webpage to be analyzed is analyzed by using the searched analysis rule in the template to obtain an analysis result, the method further comprises the following steps:
calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result;
the common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
An embodiment of the present invention further provides a computer program product, which, when executed on a data processing apparatus, is adapted to execute a program that initializes the following method steps:
acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed;
searching templates which are matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules, and different templates have different analysis rules;
and analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result.
Optionally, before searching for a template that matches the service scene and the URL at the same time from pre-configured templates, the method for webpage parsing further includes: the method comprises the steps that all templates are uniformly configured in a database in advance in a preset storage format, wherein the storage format of each template in the database adopts a column type storage format supporting a nested structure, the storage columns of the templates are divided into domain names, service scenes and template objects, and the template objects specifically comprise URL regular matching rules of the templates and template contents.
Optionally, the searching for a template that matches the service scene and the URL at the same time from pre-configured templates specifically includes:
taking the domain name in the URL as a keyword, retrieving the template in the database, and screening out the template corresponding to the domain name in the URL;
taking the service scene as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene;
and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
Optionally, after the templates are uniformly configured in the database in advance in the preset storage format, the method further includes: a cache pool is created locally in advance, and a background thread is started at the back end; the background thread is used for periodically updating the template in the database to the cache pool.
Optionally, the template content further includes a call instruction;
correspondingly, after the webpage to be analyzed is analyzed by using the searched analysis rule in the template to obtain an analysis result, the method further comprises the following steps:
calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result;
the common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for webpage parsing, comprising:
acquiring a webpage analysis request, wherein the webpage analysis request carries a Uniform Resource Locator (URL) of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed;
searching templates which are matched with the service scene and the URL at the same time from all pre-configured templates, wherein the template contents of the templates comprise analysis rules, and different templates have different analysis rules; based on the preset analysis rules, when different webpages are analyzed on the same platform, for each webpage, the analysis rule matched with the webpage is directly called out from the preset analysis rules to analyze the webpage without restarting an online program to complete the configuration of the analysis rule matched with the webpage;
analyzing the webpage to be analyzed by utilizing the searched analysis rule in the template to obtain an analysis result;
before searching for a template which is matched with the service scene and the URL at the same time from each pre-configured template, the webpage analyzing method further comprises the following steps: the method comprises the steps that templates are uniformly configured in a database in advance in a preset storage format, wherein the storage format of each template in the database adopts a column type storage format supporting a nested structure, the storage column is divided into a domain name, a service scene and a template object, and the template object specifically comprises a URL regular matching rule of the template and template content;
wherein, the searching the template matched with the service scene and the URL at the same time from each pre-configured template comprises:
taking the domain name in the URL as a keyword, retrieving the template in the database, and screening out the template corresponding to the domain name in the URL;
taking the service scene as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene;
and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
2. The method for webpage parsing according to claim 1, wherein after the templates are uniformly configured in the database in advance in a preset storage format, the method further comprises:
a cache pool is created locally in advance, and a background thread is started at the back end; the background thread is used for periodically updating the template in the database to the cache pool.
3. The web page parsing method according to any one of claims 1-2, wherein the template content further comprises a call instruction;
correspondingly, after the webpage to be analyzed is analyzed by using the searched analysis rule in the template to obtain an analysis result, the webpage analysis method further comprises the following steps:
calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result;
the common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
4. A web page parsing apparatus, comprising:
the system comprises a preprocessing unit, a processing unit and a processing unit, wherein the preprocessing unit is used for configuring each template in advance, the template content of each template comprises an analysis rule, and different templates have different analysis rules; based on the preset analysis rules, when different webpages are analyzed on the same platform, for each webpage, the analysis rule matched with the webpage is directly called out from the preset analysis rules to analyze the webpage without restarting an online program to complete the configuration of the analysis rule matched with the webpage;
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a webpage analysis request, and the webpage analysis request carries a URL of a webpage to be analyzed and a service scene where the webpage to be analyzed is located when the webpage to be analyzed is analyzed;
the searching unit is used for searching templates which are matched with the service scene and the URL at the same time from all the pre-configured templates;
the first analysis unit is used for analyzing the webpage to be analyzed by utilizing the analysis rule in the template searched by the search unit to obtain an analysis result;
before searching for a template matching the service scene and the URL from the pre-configured templates, the web page parsing apparatus further includes: the method comprises the steps that templates are uniformly configured in a database in advance in a preset storage format, wherein the storage format of each template in the database adopts a column type storage format supporting a nested structure, the storage column is divided into a domain name, a service scene and a template object, and the template object specifically comprises a URL regular matching rule of the template and template content;
wherein, the searching the template matched with the service scene and the URL at the same time from each pre-configured template comprises:
taking the domain name in the URL as a keyword, retrieving the template in the database, and screening out the template corresponding to the domain name in the URL;
taking the service scene as a keyword, carrying out secondary retrieval on the template corresponding to the domain name in the screened URL, and screening out the template corresponding to the service scene;
and matching the URL with the URL regular matching rule of the template corresponding to the screened service scene, and finding out the template successfully matched.
5. The web page parsing apparatus according to claim 4, wherein the template content further comprises a call instruction;
correspondingly, the web page parsing device further comprises: the second analysis unit is used for calling a pre-configured public analysis component according to the found calling instruction in the template, and processing the field needing secondary analysis in the analysis result by using the public analysis component to obtain a secondary analysis result;
the common analysis component refers to an analyzer, and different common analysis components have different analysis capabilities.
6. A storage medium having a program stored thereon, wherein the program, when executed by a processor, implements the web page parsing method of any one of claims 1 to 3.
7. A processor for executing a program, wherein the program executes to perform the web page parsing method of any one of claims 1 to 3.
8. An electronic device comprising a processor, a memory, and a program stored on the memory and executable on the processor, wherein the processor implements the web page parsing method of any one of claims 1 to 3 when executing the program.
CN201710758003.5A 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment Active CN110020236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710758003.5A CN110020236B (en) 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710758003.5A CN110020236B (en) 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment

Publications (2)

Publication Number Publication Date
CN110020236A CN110020236A (en) 2019-07-16
CN110020236B true CN110020236B (en) 2021-11-30

Family

ID=67186156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710758003.5A Active CN110020236B (en) 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment

Country Status (1)

Country Link
CN (1) CN110020236B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125565A (en) * 2019-11-01 2020-05-08 上海掌门科技有限公司 Method and equipment for inputting information in application
CN112597410A (en) * 2020-12-10 2021-04-02 北京明朝万达科技股份有限公司 Method and device for performing structured extraction on webpage content based on rule configuration library
CN113867881B (en) * 2021-10-19 2023-01-03 创优数字科技(广东)有限公司 Application home page dynamic display method, device, equipment and medium
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630839A (en) * 2014-11-07 2016-06-01 阿里巴巴集团控股有限公司 Webpage information acquisition method and device
CN106055585A (en) * 2016-05-20 2016-10-26 北京神州绿盟信息安全科技股份有限公司 Log analysis method and apparatus

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916285B (en) * 2010-08-20 2016-06-08 北京新岸线移动多媒体技术有限公司 A kind of method for analyzing internet web page contents and device
CN102254046A (en) * 2011-08-18 2011-11-23 深圳市融创天下科技股份有限公司 Webpage data acquiring method and system
CN103793461B (en) * 2013-12-02 2017-05-31 北京奇虎科技有限公司 The analysis method and device of info web
CN103761330A (en) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 System and method for achieving automatic Internet information extraction based on template configuration
US9747556B2 (en) * 2014-08-20 2017-08-29 Vertafore, Inc. Automated customized web portal template generation systems and methods
CN104572874B (en) * 2014-12-19 2019-03-05 北京锐安科技有限公司 A kind of abstracting method and device of webpage information
WO2017120360A1 (en) * 2016-01-05 2017-07-13 Quixey, Inc. Computer-automated generation of application deep links

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630839A (en) * 2014-11-07 2016-06-01 阿里巴巴集团控股有限公司 Webpage information acquisition method and device
CN106055585A (en) * 2016-05-20 2016-10-26 北京神州绿盟信息安全科技股份有限公司 Log analysis method and apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《2009 International Conference on Management and Service Science》;J. Hu 等;《2009 International Conference on Management and Service Science》;20091030;1-4 *
基于模板化网络爬虫技术的Web网页信息抽取;乔峰;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20130115(第1期);I139-246 *
网络舆情分析中网页信息预处理方案的实现;李舒晨 等;《电脑与电信》;20081010(第10期);30-33 *

Also Published As

Publication number Publication date
CN110020236A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN107038207B (en) Data query method, data processing method and device
CN107562467B (en) Page rendering method, device and equipment
CN109582909B (en) Webpage automatic generation method and device, electronic equipment and storage medium
CN110020236B (en) Webpage parsing method, device, storage medium, processor and equipment
EP3172680B1 (en) Fast rendering of websites containing dynamic content and stale content
CN110442330B (en) List component conversion method and device, electronic equipment and storage medium
CN106909361B (en) Web development method and device based on template engine
US20130185429A1 (en) Processing Store Visiting Data
CN110969022B (en) Semantic determining method and related equipment
CN109977312B (en) Knowledge base recommendation system based on content tags
JP2020170538A (en) Method, apparatus and program for processing search data
CN110968314A (en) Page generation method and device
CN107391528A (en) Front end assemblies Dependency Specification searching method and equipment
CN104899217A (en) Method and apparatus for implementing customized function
CN103914479A (en) Resource request matching method and device
CN108121712B (en) Keyword storage method and device
CN111125087B (en) Data storage method and device
CN110851746B (en) Crawler seed generation method and device
CN115437930A (en) Identification method of webpage application fingerprint information and related equipment
CN109710833B (en) Method and apparatus for determining content node
CN109635175B (en) Page data splicing method and device, readable storage medium and electronic equipment
CN108009171B (en) Method and device for extracting content data
CN105635236A (en) Page rendering method, device and system
CN111078905A (en) Data processing method, device, medium and equipment
CN110955429B (en) Data analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant