CN112231481A

CN112231481A - Website classification method and device, computer equipment and storage medium

Info

Publication number: CN112231481A
Application number: CN202011155971.5A
Authority: CN
Inventors: 邹安宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-01-15

Abstract

The application relates to a website classification method, a website classification device, computer equipment and a storage medium. The method comprises the following steps: acquiring a website to be classified, and performing field segmentation on the website to be classified to obtain a target field corresponding to the first node relation; acquiring a target clustering structure tree; matching each target field with each node in the target clustering structure tree layer by layer according to the first node relation; when each target field is matched with each node in the target path, determining the website category of the website to be classified according to the target path; the target path is a path from a root node to a leaf node in the target clustering structure tree. According to the scheme, parallel computation can be directly realized based on the target clustering structure tree, so that the websites to be classified are classified, the website classification result of the websites to be classified is rapidly output, and the website classification efficiency is effectively guaranteed.

Description

Website classification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of network technologies, and in particular, to a method and an apparatus for classifying websites, a computer device, and a storage medium.

Background

In order to continuously optimize the speed and experience of Web (World Wide Web, also called World Wide Web) applications, HTTP (hypertext Transfer Protocol) requests and the like of users are generally collected and analyzed, so that short boards existing in the applications are better determined, and further targeted optimization is performed.

For this reason, it is necessary to classify web addresses such as URLs (Uniform Resource locators). In the traditional technology, the clustering of the websites is realized through a K-means algorithm (a K-means clustering algorithm) and the like. Generally, the websites are converted into a matrix to calculate similarity, and clustering of the websites is realized according to the similarity.

However, if the number of websites is huge, the conventional technology will cause a large performance loss in the conversion process, resulting in low efficiency of website classification.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device and a storage medium for classifying websites, which can improve the efficiency of website classification.

A method for classifying a web address, the method comprising:

acquiring a website to be classified, and performing field segmentation on the website to be classified to obtain a target field corresponding to a first node relation;

acquiring a target clustering structure tree; the target clustering structure tree is obtained by combining nodes of a first target level, the number of which meets the condition, in a sample structure tree, wherein the sample structure tree is a structure tree constructed according to fields corresponding to sample websites;

matching each target field with each node in the target clustering structure tree layer by layer according to the first node relation;

when each target field is matched with each node in a target path, determining the website category of the website to be classified according to the target path; the target path is a path from a root node to a leaf node in the target clustering structure tree.

An apparatus for classifying web addresses, the apparatus comprising:

the system comprises a to-be-classified website acquisition module, a to-be-classified website acquisition module and a to-be-classified website acquisition module, wherein the to-be-classified website acquisition module is used for acquiring a website to be classified and performing field segmentation on the website to be classified to obtain a target field corresponding to a first node relation;

the structure tree acquisition module is used for acquiring a target clustering structure tree; the target clustering structure tree is obtained by combining nodes of a first target level, the number of which meets the condition, in a sample structure tree, wherein the sample structure tree is a structure tree constructed according to fields corresponding to sample websites;

the layer-by-layer matching module is used for matching each target field with each node in the target clustering structure tree layer by layer according to the first node relation;

the website category determining module is used for determining the website category of the website to be classified according to the target path when each target field is matched with each node in the target path; the target path is a path from a root node to a leaf node in the target clustering structure tree.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the website classification method, the website classification device, the computer equipment and the storage medium, the target field corresponding to the first node relation is obtained after the field segmentation is carried out on the website to be classified; acquiring a target clustering structure tree; matching each target field with each node in the target clustering structure tree layer by layer according to the first node relation; and when each target field is matched with a target path from the root node to the leaf node in the target clustering structure tree, determining the website category of the website to be classified according to the target path. According to the technical scheme, the websites to be classified can be directly classified based on the pre-constructed target clustering structure tree, the website classification result of the websites to be classified is rapidly output, and the website classification efficiency is effectively guaranteed.

Drawings

FIG. 1 is a diagram illustrating an exemplary environment in which the method for classifying web addresses may be implemented;

FIG. 2 is a flowchart illustrating a method for classifying web addresses according to an embodiment;

FIG. 3 is a flowchart illustrating web site clustering according to an embodiment;

FIG. 4 is a schematic diagram illustrating a process of website clustering in another embodiment;

FIG. 5 is a flowchart illustrating web site clustering in yet another embodiment;

FIG. 6 is a flowchart illustrating website clustering in yet another embodiment;

FIG. 7 is a flowchart illustrating a process for a server to obtain a web address according to an embodiment;

FIG. 8 is a flowchart illustrating website clustering in accordance with yet another embodiment;

FIG. 9 is a diagram illustrating an exemplary window for retrieving website search information;

FIG. 10 is a diagram illustrating an interface of analysis results of website address response status in one embodiment;

FIG. 11 is a diagram illustrating an interface of analysis results of website address response status in another embodiment;

FIG. 12 is a block diagram showing the structure of a website address classifying device according to an embodiment;

FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The website classification method, device, computer equipment and storage medium can be realized based on cloud technology. The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Various embodiments of the present application may be further implemented based on cloud computing. Among them, cloud computing (cloud computing) is a computing mode that distributes computing tasks over a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

The website classification method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The server 102 receives the websites to be classified sent by the terminal 101, classifies the websites to be classified based on a target clustering structure tree which is constructed in advance, and outputs the website categories of the websites to be classified. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In an embodiment, as shown in fig. 2, a method for classifying web addresses is provided, and this embodiment is illustrated by applying the method to a server, it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. The method comprises the following steps:

s201, acquiring a website to be classified, and performing field segmentation on the website to be classified to obtain a target field corresponding to a first node relation.

The website address is also called a web page address, and refers to an address of a web page on the internet. In one embodiment, the web address may refer to a Uniform Resource Locator (URL), i.e., Uniform Resource Locator, location address, URL address). A uniform resource locator is the address of a standard resource on the internet, like a house number on a network. It was originally invented by tim banus-lie as an address for the world wide web and it is now being compiled by the world wide web consortium as internet standard RFC 1738.

For the website category, a character string formed by aggregating a plurality of websites together may be referred to as a clustered website, and one clustered website may be referred to as one website category. A web site category may correspond to at least one web site that matches it. When the website of a certain category needs to be analyzed, the matched website corresponding to the website can be found out through the website category, and the responsiveness analysis is performed on the matched websites, so that the responsiveness analysis result of the category can be obtained.

The website to be classified may refer to a website that is not yet classified. The website to be classified may be a website extracted based on a request for accessing a certain application platform in a historical time period or at the current time. The number of the websites to be classified can be one or more than one. When the number of the websites to be classified is more than one, the websites to be classified can be synchronously classified, and the classification results of the websites to be classified are output together. An application platform may refer to a platform capable of implementing specific application functions, such as: and a video playing platform.

In one embodiment, the structure of the website to be classified may be as follows:

[ protocol type ]:/[ server address ]/[ port number ]/[ resource level UNIX file path ] [ file name ]? [ Inquiry ] # [ fragment ID ]

Here, the [ protocol type ] may correspond to a protocol such as an HTTP protocol conforming to the above-described structure. [ query ] optionally, for passing parameters to the dynamic web page, the number of passed parameters may be multiple, the parameters are separated by "&" symbol, and the name and value of each parameter are separated by "&".

In one embodiment, the obtaining of the target field corresponding to the node relationship after performing field segmentation on the website to be classified includes: acquiring a segmentation identifier; and segmenting the website to be classified into at least one field by taking the segmentation identifier as a separator to obtain the target field. Namely, the website to be classified is divided into at least one character string by separators, and each character string is used as a target field. When the number of the websites to be classified is more than one, the websites to be classified can be respectively subjected to field segmentation to obtain a target field corresponding to each website to be classified.

Wherein the split identifier may be "/", "&" and "? ". The resulting target fields may form a list of target fields and are separated by commas.

Taking https:// demo.qq.com/api/adddata/1key1 ═ abc & key2 ═ def as an example, the list of target fields obtained after segmentation is: https, demo.qq.com, api, adddata, 1, key1 ═ abc, key2 ═ def. Where "https:" is the first field, i.e., the field of the first level, "demo.

In computer science, a tree is an abstract data type or data structure that implements such an abstract data type, used to model a collection of data that has the nature of a tree structure. It is a set with hierarchy relationship composed of n (n >0) finite nodes. It is called a "tree" because it looks like an inverted tree, i.e., it is root up and leaf down. The structure tree refers to a tree-based data structure, and may be a binary tree.

In one embodiment, each target field may have a correspondence with a node in the structure tree, which is referred to as a node relationship. For example: according to the position of the field (which can be the front-back sequencing position in the website), the first field corresponds to the root node in the structure tree, the subsequent fields sequentially correspond to the child nodes of the existing node, and the last field corresponds to the leaf node in the structure tree. Therefore, based on the node relationship, the target field can be mapped into the structure tree and the matching judgment can be performed with the structure tree.

S202, acquiring a target clustering structure tree; the target clustering structure tree is obtained by combining the nodes of the first target level with the number of the nodes meeting the conditions in the sample structure tree, and the sample structure tree is a structure tree constructed according to the fields corresponding to the sample websites.

The sample web address refers to a web address used to construct the sample structure tree. In one embodiment, the sample web address may be a web address extracted based on a request to access an application platform within a historical period of time.

The web address extracted from the access request may be stored in a pre-constructed set of web addresses. When the sample structure tree needs to be constructed, a set number of websites in the website set can be determined as sample websites. In one embodiment, the remaining websites can be sequentially used as the websites to be classified from the website collection, and the websites to be classified are classified according to the matching relationship between the target field of the websites to be classified and the target clustering structure tree. The process of adding websites to the website collection may be: the front-end monitoring system monitors websites aiming at a specific application platform in real time and sends the websites to the server; the server stores the information into the website collection according to a specific structure.

For the sample structure tree, as previously described, it is the structure tree constructed from the sample fields. That is, the fields in the sample field are sequentially used as nodes of the structure tree according to the node relationship, and the data structure constructed by the nodes is the sample structure tree.

In one embodiment, the sample structure tree includes at least one level, each level having a node. According to the embodiment of the invention, the target clustering structure tree is obtained by merging the nodes of the first target level in the sample structure tree. The first target level can be more than one. Specifically, the method may scan downward layer by layer from a root node of the sample structure tree, determine the number M1 of nodes in a level L1 when a certain level L1 is scanned, determine that the level L1 is a first target level when the number M1 satisfies a condition, merge the nodes in the level L1 into a node P, and adjust subsequent nodes: regarding the child nodes of each node in the level L1 as child nodes of the node P, it can be considered that the node merging of the level L1 is completed at this time; the next level L2 of level L1 is scanned, and the merging of nodes is done in the same way as level L1 until all levels in the sample structure tree have been scanned. It should be noted that, in the embodiment of the present invention, a process of merging nodes may be referred to as a pruning process. After the pruning process is completed, it may be considered to cluster a particular number (often a larger number) of web sites into another particular number (often a smaller number) of web sites.

In one embodiment, the condition that the number of nodes of the first target hierarchy needs to satisfy may be that the number of nodes is greater than a threshold number of nodes. The node number threshold may be a fixed value, e.g., 2, 5, 10, etc. The node number threshold may not be a fixed value, but may be positively correlated with the number of sample websites, that is, may change with the change of the number of sample websites. Specifically, the product of the number of the sample websites and the specific scaling factor is determined as a node number threshold, and when the node of the first target level is greater than or equal to the product, the node number of the first target level is considered to satisfy the condition.

The target clustering structure tree is a structure tree obtained by carrying out node clustering processing on a pre-constructed structure tree. Specifically, the structure tree obtained after the node clustering process is performed on the sample structure tree is determined as a target clustering structure tree.

In one embodiment, the process of obtaining the target clustering structure tree may be implemented at least as follows: 1. directly determining a pre-constructed structure tree as a target clustering structure tree; 2. and after the websites to be classified are obtained, clustering based on the sample structure tree to obtain a target clustering structure tree.

S203, matching each target field with each node in the target clustering structure tree layer by layer according to the first node relation.

In the step, the nodes matched with all the target fields are determined in the target clustering structure tree according to the first node relation in sequence and are matched layer by layer. Wherein matching may refer to determining whether a field completely matches a character string corresponding to a node or whether the field contains a character string corresponding to a node. The embodiment of the invention operates aiming at the original character string, does not need to construct a matrix to calculate the similarity, and has extremely high node matching performance.

In one embodiment, a field hierarchy corresponding to each target field (a hierarchy in which the target field is located may be referred to as a field hierarchy) may be determined, a node of a structure tree hierarchy corresponding to the field hierarchy (each hierarchy in the structure tree may be referred to as a structure tree hierarchy) is determined in the target clustering structure tree, whether the corresponding target field and the node match or not is determined, and when the corresponding target field and the node match, a matching determination of a next hierarchy is performed until a target field of a last field hierarchy is reached. It should be noted that the field hierarchy and the structure tree hierarchy may be collectively referred to as a hierarchy, and in some embodiments, the field hierarchy and the structure tree hierarchy are not distinguished for convenience of description, and it should be understood that the hierarchy corresponding to a field refers to a field hierarchy, and the hierarchy associated with a structure tree refers to a structure tree hierarchy.

In one embodiment, if there are a plurality of websites to be classified, matching judgment is performed on a target field corresponding to each website to be classified and a target clustering structure tree in sequence, so as to obtain matching judgment results of the websites to be classified, and then website categories corresponding to the websites to be classified are obtained based on the matching judgment results.

S204, when each target field is matched with each node in a target path, determining the website category of the website to be classified according to the target path; the target path is a path from a root node to a leaf node in the target clustering structure tree.

When each target field can be sequentially matched with a leaf node from a root node of the target clustering structure tree, determining the path from the root node to the leaf node as a target path, judging that each target field is matched with each node in the target path at the moment, and further determining the website category of the website to be classified according to the target path.

In one embodiment, the implementation process of determining the website category of the website to be classified according to the target path may be as follows: and determining a target clustering website corresponding to the target path, and determining the target clustering website as a target clustering website of the websites to be classified.

In the website classification method, a target field corresponding to a first node relation is obtained after field segmentation is carried out on the website to be classified; acquiring a target clustering structure tree; matching each target field with each node in the target clustering structure tree layer by layer according to the first node relation; and when each target field is matched with a target path from the root node to the leaf node in the target clustering structure tree, determining the website category of the website to be classified according to the target path.

In one embodiment, the target clustering structure tree is a pre-clustered structure tree based on the sample structure tree. The traditional clustering method has the problem of forgetting disasters: the clustering algorithm processes data in batch, that is, data required to be trained is given at one time, and once new data comes in after model training is completed, all data need to be merged and retrained at this time, which causes the loss of information of the old model trained before by the new model, and causes performance loss. The embodiment of the invention adopts a progressive classification strategy, directly classifies the websites to be classified based on the target clustering structure tree, and directly participates the historical clustering result into the current classification, thereby reducing the operation and effectively ensuring the efficiency of website classification.

In an embodiment, the step of matching, layer by layer, each of the target fields with each of the nodes in the target clustering structure tree according to the first node relationship includes: determining nodes matched with the target fields layer by layer in the target clustering structure tree according to the first node relation to obtain target nodes; and when each target node forms a path from a root node to a leaf node in the target clustering structure tree, judging that each target field is matched with each node in the target path.

In one embodiment, the process of layer-by-layer matching may be: acquiring a first field to be matched in the target field; determining a first to-be-matched structure tree level corresponding to the level of the first to-be-matched field in the target clustering structure tree according to the first node relation; when the first field to be matched is determined to be matched with a node in the first structure tree hierarchy to be matched, determining a second structure tree hierarchy to be matched corresponding to the hierarchy of a second field to be matched in the target clustering structure tree according to the first node relation; the second field to be matched is a field next to the first field to be matched; and when the second field to be matched is matched with a node in the second structure tree level to be matched and the second structure tree level to be matched is a level where a leaf node is located, judging that each target field is matched with each node in a target path.

Specifically, the process of performing the first-level matching judgment may be: and acquiring a first target field, and performing matching judgment on the first target field and a root node in a target clustering structure tree. And when the first target field is matched with the root node, performing second-level matching judgment: and obtaining a second target field, carrying out matching judgment on the second target field and the child nodes of the root node in the target clustering structure tree, and carrying out matching judgment on a third level when the second target field is matched with the child nodes of the root node in the target clustering structure tree, and so on until the target field of the last level is judged. When the result of the matching judgment of a certain level is not matched, the matching judgment of the next level is not performed any more, for example, when the second target field is not matched with the child node of the root node, the matching judgment process is ended, and at this time, the website to be classified can be used as a new website category, and indication information of 'classification unsuccessful' of the website to be classified is output.

According to the embodiment, the target fields are matched with the nodes of the target clustering structure tree layer by layer, the target paths matched with all the target fields can be accurately determined, the websites to be classified are classified into the website categories corresponding to the target paths, and accurate website classification results can be obtained.

In an embodiment, after determining, layer by layer, nodes matched with the target fields in the target clustering structure tree according to the first node relationship, and obtaining target nodes, the method further includes: when each target node cannot form a path from a root node to a leaf node in the target clustering structure tree, each target field is superposed into the target clustering structure tree according to the first node relation to obtain a structure tree to be merged; when a clustering instruction is obtained, merging nodes of a second target level of the structure tree to be merged, wherein the number of the nodes meets the condition, so as to obtain a clustering updating structure tree; and the clustering updating structure tree is used for classifying the new website to be classified when the new website to be classified is obtained.

As described in the foregoing embodiment, when a node path (i.e., a target path) matching each target field exists in the target clustering structure tree, the website category of the website to be classified is obtained according to the website category corresponding to the node path. For another situation, part of the target fields cannot be matched with the nodes in the target clustering structure tree, for example, the first d target fields are matched with the nodes in the target clustering structure tree, and the (d + 1) th target field is not matched with the nodes in the target clustering structure tree, at this time, the subsequent level matching judgment is not performed, but each target field is added into the target clustering structure tree, and the structure tree to be merged is obtained. The structure tree to be merged may have at least one more branch path (i.e., a path formed by a target field that cannot match a node in the target clustering structure tree) compared to the target clustering structure tree.

Further, when the clustering instruction is obtained, the merging structure tree may be clustered according to a process of obtaining the target clustering structure tree from the sample structure tree. Namely, the nodes of the second target level in the structure tree to be merged are merged, and the structure tree obtained after the merging of the nodes of all levels is the cluster updating structure tree.

In one embodiment, the clustering instructions may be trigger instructions that are automatically generated by the server when the clustering conditions are met. Specifically, the server may generate the clustering instruction under at least one of the following conditions: 1. when the number of websites corresponding to the nodes which are not subjected to the node merging processing in the cluster update structure tree is greater than a preset website number threshold (which can be determined according to actual conditions, but is not limited in this embodiment of the present invention), it is considered that a clustering condition is satisfied, and a clustering instruction is generated; 2. when the time period from the last clustering exceeds a preset time period threshold (which can be determined according to actual conditions, but is not limited in this embodiment of the present invention), a clustering condition is considered to be satisfied, and a clustering instruction is generated; 3. and when the number of websites which are not subjected to clustering processing in the website set is greater than a preset website number threshold value, determining that a clustering condition is met, and generating a clustering instruction. In some embodiments, the clustering instructions may not be generated by the server. Further, the obtaining process of the clustering instruction may also be: the clustering instructions sent by the input device to which the server receiver is connected, for example: the input device is a user terminal, and the user terminal sends a clustering instruction to the server when receiving clustering trigger operation of a user through the interface.

In one embodiment, when each target node cannot form a path from the root node to a leaf node in the target clustering structure tree, the website to be classified may be added to the website set, but not to the target clustering structure tree. And when the number of the un-clustered websites in the website set meets the condition, the server considers that the clustering condition is met, and generates a clustering instruction.

According to the embodiment, when the classification of the websites to be classified fails, the websites to be classified are recorded, the next clustering is waited for, the websites to be classified can be classified as much as possible, and the clustering structure tree is continuously updated, so that the accurate classification of the follow-up websites to be classified is ensured.

In one embodiment, before the obtaining the target clustering structure tree, the method further includes: acquiring a sample website, and performing field segmentation on the sample website to obtain a sample field corresponding to a second node relation; constructing a structure tree for each sample field according to the second node relation to obtain the sample structure tree; performing node merging operation on the sample structure tree layer by layer so as to merge nodes in the first target level into any matching node when the first target level is accessed, and merge sub-nodes of each node in the first target level into sub-nodes of the any matching node; the arbitrary matching node can be matched with an arbitrary node; and taking the sample structure tree after the nodes are combined as the target clustering structure tree.

Any matching node in the target clustering structure tree can be represented by a symbol such as an "+".

In one embodiment, when the websites to be classified are classified according to the target clustering structure tree, when whether a certain target field is matched with any matching node in the target clustering structure tree is judged according to needs, the matching is directly judged, and at this time, the matching of the next level can be judged.

In an embodiment, the field segmentation of the sample website may be similar to the segmentation of the website to be classified, that is, the field segmentation of the sample website to obtain the sample field includes: acquiring a segmentation identifier; and cutting the sample website into at least one field by taking the cutting identifier as a separator to obtain a sample field corresponding to the second node relation.

In an embodiment, the constructing the structure tree for each sample field according to the second node relationship to obtain the sample structure tree includes: sequentially configuring each sample field as a node in the sample structure tree according to the second node relationship; wherein the same sample field shares a node in the sample structure tree.

Specifically, for a sample field of a certain sample website W1, a first sample field may be used as a root node of a sample structure tree, a second sample field may be used as a child node of the root node, and subsequent sample fields are sequentially added to the sample structure tree as child nodes of existing nodes to obtain an existing sample structure tree T. Comparing a first sample field of a sample field of another sample website W2 with a root node of an existing sample structure tree T; if not, the first sample field is used as a new node, and the subsequent sample fields are respectively used as child nodes of the new node; and if the two are consistent, sharing the root node, continuously comparing the second sample field with the node of the second level in the existing sample structure tree T, if the second sample field is not consistent, using the second sample field as a new node of the second level, and if the second sample field is consistent, sharing the node with the existing node of the second level. And repeating the steps until the sample fields corresponding to the sample websites are added into the sample structure tree.

In an embodiment, the constructing the structure tree for each sample field according to the second node relationship to obtain the sample structure tree includes: and when the sample field contains the query field, taking each parameter value of the query field as a leaf node parallel to the sample structure tree.

In one embodiment, the query field is used to deliver parameters to the dynamic web page, and the parameters are often in a parallel relationship, so all parameter names in the query field can be set as sub-stages of the last path, and the parameter values are set as sub-nodes of the parameter names.

In one embodiment, the following sample web addresses are assumed:

https://demo.qq.com/api/adddata/1key1＝abc&key2＝def

https://demo.qq.com/api/adddata/2key1＝ghi&key2＝jkm

the method comprises the following steps: the number of nodes in the sample structure tree needs to be greater than the threshold 2 for the number of nodes when the nodes are merged.

The implementation process of constructing the target clustering structure tree based on the sample website is described as follows:

1. the sample website is segmented to obtain sample fields, all the sample fields are arranged together and separated by commas, and the obtained results are as follows:

https:，demo.qq.com，api，adddata，1，key1＝abc，key2＝def

https:，demo.qq.com，api，adddata，2，key1＝ghi，key2＝jkm

2. and constructing a structure tree based on the sample field, wherein the obtained sample structure tree is shown in FIG. 3.

3. Pruning stage and derivation stage

And traversing the sample structure tree layer by layer, wherein only one of https: ', ' demo.qq.com ', ' api ' and ' adddata ' is smaller than a node number threshold value 2, and pruning is not needed.

When traversing to child nodes "1" and "2" of the "adddata" node, it is found that the "adddata" has two child nodes, and the number of the child nodes is greater than or equal to the threshold 2 of the node number, and pruning can be performed, then the two nodes "1" and "2" are merged into one node ". star", and the child nodes of the two nodes are adjusted (the same child nodes are integrated into one child node), and the obtained first pruning result is shown in fig. 4.

Continuing to scan the subsequent levels of the sample structure tree, if the number of the nodes of "key 1" and "key 2" is found to be 2, and is greater than or equal to the threshold value of the number of nodes 2, and pruning can be performed, then the nodes of "key 1" and "key 2" are merged into a node ". times", and the obtained second pruning result is shown in fig. 5.

Continuing to scan the sample structure tree, finding that the number of nodes of the last hierarchy is 4, and is greater than or equal to the threshold value 2 of the number of nodes, if pruning is possible, combining the nodes into a node "×", and obtaining a third pruning result as shown in fig. 6.

4. Determining the structure tree in fig. 6 as a target clustering structure tree, and according to the target clustering structure tree, determining that the above process clusters two sample websites into one clustering website (website category):

https://demo.qq.com/api/adddata/*？*＝*

in the above embodiment, the sample structure tree is constructed based on the sample website, and the nodes in the sample structure tree are pruned layer by layer to obtain the clustered target clustering structure tree.

In one embodiment, the obtaining a sample website includes: receiving a website aiming at a target dynamic webpage sent by a front-end monitoring system; adding the website aiming at the target dynamic webpage into a website set, and performing repeated website removal processing on the website set; when the number of the websites in the website set is greater than or equal to a website number threshold value, judging that a clustering condition is reached, and determining the websites in the website set as the sample websites;

the front-end monitoring system (which may be referred to as a monitoring system for short) may be a website monitoring platform. The front-end monitoring system can acquire an access request of a specific application platform in real time. In some embodiments, the front-end monitoring system may be hosted in an application platform; or the system can be independent of the application platform, and the access request in the application platform is obtained in a bypass mode, so that the corresponding website is obtained. In one embodiment, the front-end monitoring system may be implemented by a terminal or a server.

A dynamic web page may refer to a web page that is capable of interacting with a user, and performing a targeted response based on the user's interaction, such as: and when a video playing triggering instruction of a user is received, playing the corresponding video in the new window. In one embodiment, the dynamic web Page is a web Page manufactured by using CGI (Common Gateway Interface), ISAPI (Internet Application Programming Interface), PHP (hypertext preprocessor), JSP (Java Server Pages), ASP (Active Server Page), ASP.

In an embodiment, the manner of acquiring the website by the server may be as shown in fig. 7, and the specific implementation process is as follows: the terminal sends an interactive request to the application platform so as to access a certain dynamic webpage in the application platform; the front-end monitoring system extracts a website from the interactive request sent to the application platform and sends the extracted website to the server; and the server stores the received website into a website set positioned in the cache database.

In one embodiment, the web site collection may store the acquired web sites in a Set data structure. Wherein, the Set data structure is a common data structure in computer programming. This data structure uses the same mathematical concept as the finite set and is applied to the data structure of a computer. It consists of a set of unordered and unique items, typically providing functions to add items, delete items, count data for items, etc. Furthermore, the website is stored in the website set at the server, and the website set can be automatically subjected to deduplication processing so as to ensure the uniqueness of the website in the website set.

In one embodiment, the server monitors the number of websites in the website set in real time, and when the number of websites is greater than or equal to the website number threshold, it is determined that the clustering condition is reached, and at this time, all websites in the website set can be used as new sample websites for clustering operation. If a clustered target clustering structure tree exists in the cache database, the sample fields corresponding to the new sample websites can be added into the target clustering structure tree to obtain a new clustering structure tree, and nodes of the hierarchy meeting the conditions in the new clustering structure tree are merged to update the clustering structure tree, so that the obtained clustering structure tree carries more website information as much as possible, and the websites to be classified are classified more quickly and effectively.

The website number threshold (maxURLCount) may be an empirical value, and an appropriate value is selected according to different application scenarios. In the scenario of front-end monitoring, the number of URLs requested by an item will not exceed 1000, so maxURLCount may take 1000 in this scenario.

Traditional clustering of URLs uses a K-means algorithm. The purpose of the K-means algorithm is: dividing n points into k clusters, so that each point belongs to a cluster corresponding to a nearest mean value (namely a cluster center) of each point, and taking the cluster as a clustering standard. The main drawback of this algorithm is the need to specify the final number of classes n of the cluster in advance. Generally, in a monitoring system, the number of URLs is generally transparent for a monitoring platform, the monitoring platform cannot know the final cluster number in advance, and in order to achieve the optimal clustering effect, different n values need to be repeatedly tried to search for the optimal cluster number, which causes waste in performance.

The embodiment stores the acquired websites into the website set, automatically triggers clustering according to the number of the websites in the website set, does not need to appoint the number of clustering results in advance, and automatically explores the number of the clustering results in clustering.

In one embodiment, said merging nodes in said first target hierarchy into an arbitrary matching node upon accessing said first target hierarchy comprises: acquiring identification carrying information of each sample field; the identifier carrying information is used for indicating whether the corresponding sample field carries a merging prohibition identifier or not; when it is determined that the sample field corresponding to each node in the current access level does not carry the merge inhibition identifier according to the identifier carrying information, determining the number of nodes in the current access level; the current access level is a node level in the target clustering structure tree which is currently accessed; when the number of the nodes of the current access level is larger than or equal to a threshold of the number of the nodes, the current access level is judged to be the first target level, and all the nodes in the current access level are combined into an arbitrary matching node.

When clustering is performed on the sample structure tree, the nodes of each level are visited layer by layer. For the current access level, obtaining sample fields corresponding to each node in the level and obtaining identifier carrying information corresponding to the sample fields, determining whether the node carries a merging prohibition identifier according to the identifier carrying information, and if the node carries the identifier (namely, the node is marked as being non-clustered, or marked as being non-pruned), indicating that the node cannot be merged.

In one embodiment, if there are more than one nodes in a hierarchy and only some of the nodes are marked as non-clusterable, other nodes (nodes not marked as non-clusterable) in the hierarchy may be clustered, and when the number of nodes is greater than or equal to the threshold number of nodes, the nodes are merged into an arbitrary matching node. And then, combining the subsequent levels layer by layer, and so on. When merging the subsequent levels, the child nodes of the nodes carrying the merging prohibition identifications may or may not participate.

Where the important fields may be marked as non-clusterable. For the important field, on one hand, the important field may be determined according to the number of occurrences of a certain field in each website, and when the number of occurrences of the field in each website is greater than a set number threshold, the field is considered as the important field, for example: the HTTPs field appears in all websites suitable for the HTTP protocol and is an important field; conversely, when the occurrence number is less than or equal to the set number threshold, the non-important field is considered; of course, if the web addresses corresponding to the infrequent fields are to be analyzed, the fields with the occurrence frequency less than or equal to the set frequency threshold may be determined as important fields. On the other hand, the important field may be a field that is important from the viewpoint of web site analysis: 1, fields such as protocol types, server addresses, port numbers and the like which have an identification effect on the web pages can be important fields; 2, when the response of the user request needs to be analyzed, fields corresponding to the parameter name, parameter value, etc. related to the user interaction are important fields. In one embodiment, for a URL containing a query field, since the query field is often an important indicator of whether the current operation request is a query operation, if the query request of the user is to be analyzed, the query field is an important field.

In an embodiment, before the obtaining of the identifier carrying information of each sample field, the method further includes: acquiring a historical website set in a preset historical time period; determining the access amount ratio of each website in the historical website set; when a target website with the access amount ratio larger than an access amount ratio threshold exists in the historical website set, determining a first target sample website corresponding to the target website in the sample websites; and determining the identifier carrying information corresponding to each sample field of the first target sample website as carrying the merging prohibition identifier.

The inventor finds that when the URL access amount is high, clustering the URL through the traditional URL similarity loses the data. The embodiment of the invention marks all nodes of the path corresponding to the target website with the access amount ratio larger than the specific threshold value as non-pruneable, thereby realizing the retention of the data. Namely, the embodiment of the invention can independently classify the websites with high enough access amount to ensure that the websites cannot be submerged in other websites in the website analysis process, and can more accurately analyze the response state of the websites in such a way, so as to timely process the websites in case of problems and better ensure the stable operation of the whole system.

There are two main ways for a URL to introduce parameters:

1. distinguished by the value of the parameter in [ query ]. Such as:

www.getWeather.com/today/tempturecheryme ═ Beijing

www.getWeather.com/today/tempreturnName (Shenzhen)

Parameter distinction is carried out through parameter values of 'Beijing' and 'Shenzhen'.

2. Distinguished by [ resource level UNIX file path ]. Such as:

www.getWeather.com/20201013/tempreturnyName Beijing

www.getWeather.com/20201014/tempreturnyName Beijing

Parameter differentiation is done by [ resource level UNIX file path ].

Where the [ resource level UNIX file path ] and [ query ] are built together into a URL structure tree. In practice, the format of [ query ] and UNIX file paths are different.

And (4) for the scene needing the website analysis based on the parameters. Resource level UNIX file path parameters are introduced by path, the order of the parameters in the path determines the key (parameter name) of this parameter, and it may not be necessary to mark the parameters as non-clusterable. The key of the parameter in the query determines the actual meaning of the parameter, the key of the query fragment is directly given, the sequence is not affected, and therefore the key of the query fragment needs to be set to be non-clustering, so that the obtained website category comprises the field of the parameter.

In an embodiment, before the obtaining the identifier carrying information, the method further includes: determining a website containing the query field in the sample website as a second target sample website; acquiring a target sample field corresponding to the query field of the second target sample website in each sample field; and determining the identification carrying information of the target sample field as carrying the merging prohibition identification.

The query field is often an important field for representing the user interaction parameters, the query field is marked as non-clustering, the important information can be effectively retained, and further the reliability of the website classification result is ensured, namely, the website is analyzed based on the interaction operation of the user, the response state of the application platform to the user operation request can be determined in a targeted manner, and the orderliness of the interaction with the user can be effectively ensured.

In one embodiment, there are cases where nodes in the sample structure tree are marked as non-clusterable, such as the sample structure tree shown in FIG. 3, where the sample fields corresponding to the "key 1" and "key 2" nodes are marked as non-clusterable. The process of clustering based on the sample structure tree may be as follows:

When traversing to the child node of the adddata node, it is found that the adddata node has two child nodes, the number of the child nodes is greater than or equal to the threshold 2 of the node number, and pruning can be performed, then the two nodes of "1" and "2" are merged into one node ". star", and the child nodes of the two nodes are adjusted, and the obtained first pruning result is shown in fig. 4.

Continuing to scan the subsequent levels of the sample structure tree, the number of nodes of "key 1" and "key 2" is 2, which is originally clustering-possible, but the sample fields corresponding to the two nodes are marked as non-clustering-possible, so that the level is not clustered.

Continuing to scan the subsequent levels of the sample structure tree, finding that the number of the child nodes of the "key 1" and the "key 2" nodes is 2, pruning can be performed, the child nodes of the "key 1" and the "key 2" nodes are respectively merged into one node ". x", and the obtained second pruning result is shown in fig. 8.

Continuing to scan the sample structure tree, finding that no child node exists, determining the structure tree in fig. 8 as a target clustering structure tree at this time, wherein the website categories corresponding to the target clustering structure tree are as follows:

https://demo.qq.com/api/adddata/*？key1＝*&key2＝*

in the embodiment, the sample structure tree is constructed, the pruning processing is performed on the sample structure tree layer by layer, some important performance data are brought into the index range of clustering, the important nodes marked as non-clustering nodes are reserved during clustering, the weighted convergence pruning strategy is realized, the simplified target clustering structure tree is obtained, and the clustering result has more practical reference significance.

In one embodiment, the degree of clustering is controlled by a node number threshold (maxChlidCount), and if the number of nodes of the current visited hierarchy is greater than or equal to the node number threshold, the number of nodes of the hierarchy is considered to be excessive, the nodes are merged into an arbitrary matching node.

The determination manner of the node number threshold may be implemented by the following embodiments:

in one embodiment, when the number of nodes of the current access level is greater than or equal to the threshold number of nodes, determining that the current access level is before the first target level further includes: acquiring a preset proportionality coefficient; acquiring the website number of the sample website, and calculating the product of the website number and the proportionality coefficient; determining the product as the node number threshold.

The scaling factor may be set according to the requirement of the website clustering degree, for example, when the websites need to be clustered into as few website categories as possible, the scaling factor may be set to a larger value, for example, 20%, and when the websites need to be clustered into as accurate website categories as possible, the scaling factor may be set to a smaller value, for example, 10%.

In the former embodiment, the threshold node number varies with the number of sample sites. In some embodiments, the node number threshold may also be a fixed value, such as: the number of the pins was 20.

In the above embodiment, the node number threshold is set, when a node of a certain hierarchy is not marked as non-clustering and the node number is greater than or equal to the node number threshold, the nodes in the hierarchy are aggregated into one node, and in the subsequent website classification process, the clustered node can be matched with all the nodes.

In one embodiment, the obtaining the target clustering structure tree includes: determining the expiration time of a pre-constructed clustering structure tree to be selected; when the expiration time is not reached, determining the clustering structure tree to be selected as the target clustering structure tree; and when the expiration time is reached, acquiring a new sample structure tree, and merging nodes of a third target level in the new sample structure tree, wherein the number of the nodes meets the condition, so as to obtain the target clustering structure tree.

In one embodiment, when the server obtains the target clustering structure tree through clustering, the target clustering structure tree can be stored in the cache database, and multi-machine sharing is facilitated. At this time, expiration time can be set for the target clustering structure tree, the cache database stores the target clustering structure tree when the expiration time is not reached, and the target clustering structure tree is released when the expiration time is reached. If the website to be classified is received and the expiration time is not reached, the server can classify the website based on the target clustering structure tree in the cache database, and if the website to be classified is received and the expiration time is reached, the clustering structure tree can be regenerated. The method for regenerating the clustering structure tree is the same as the method for generating the target clustering structure tree, and is not described herein again.

The embodiment sets the expiration time for the target clustering structure tree, and can effectively reduce the storage pressure of the cache database.

In one embodiment, the process of generating the clustering structure tree, the website classification, and the website analysis may be performed in parallel. For example: and the server receives the websites sent by the front-end monitoring system and stores the websites into a website set. During clustering, the server firstly acquires the websites, and clusters the websites to obtain a target clustering structure tree. Then, website classification and website analysis can be performed based on the target clustering structure tree: on one hand, if new websites exist in the website set, the websites are read one by one and serve as the websites to be classified, and the websites to be classified are matched with the target clustering structure tree to obtain website categories corresponding to the websites to be classified. On the other hand, when receiving a website response state analysis instruction sent by the terminal, the server may analyze the matching websites of the corresponding website category according to the target clustering structure tree to output a corresponding website response state analysis result.

The embodiment can quickly finish the clustering of the websites in a few minutes. After the clustering is finished, the rest websites to be classified can be immediately classified according to the clustering structure tree, and meanwhile, the clustering structure tree can participate in the statistics and analysis of data. The efficiency of website classification and website analysis can be greatly improved.

In one embodiment, after determining the website category of the website to be classified according to the target path, the method further includes: when a website response state analysis instruction aiming at a target public gateway interface is received, determining a path identifier corresponding to the target public gateway interface; determining a target website category corresponding to the path identifier; determining a target matching website matched with the target website category; acquiring response state data corresponding to the target matching website; and generating a website response state analysis result of the target public gateway interface according to the response state data. Wherein, the common gateway interface is a CGI interface.

In one embodiment, the website response status may refer to a response status of a request corresponding to a website, and may refer to a request success rate, a request speed, a response time consumption, and the like.

In one embodiment, the terminal displays a window as shown in fig. 9 in the interface, and the user can input the web site search information in the window. Specifically, the user may enter a search keyword of the URL in the "search URL" option, such as: the weather of Beijing can also be directly input into the website categories, such as: www.getWeather.com/today/tempturementayname ═ Beijing. In addition, the user can select a time range and a time granularity in the window so as to perform specific fine-grained analysis on the website in a specific time period in a targeted manner. When the terminal receives the website searching information input by the user through the window, the website searching information is carried in the website response state analysis instruction and is sent to the server. The server determines a path identifier corresponding to the Beijing weather interface according to the path identifier; determining a target website category corresponding to the path identifier; determining a target matching website matched with the target website category; acquiring response state data corresponding to the target matching website; and generating a website response state analysis result of the target public gateway interface according to the response state data, and returning the website response state analysis result to the terminal.

The server can control the terminal to display the website response state analysis result on the terminal page in various visual graph modes. The display may be in the form of a fluctuating graph, a bar graph, a pie graph, or the like.

Specifically, a schematic diagram of gradually clustering and classifying the websites and outputting the website response state analysis result may be as shown in fig. 10. In fig. 10, the number of samples 1001 is gradually increased, that is, the websites are gradually read from the website set and analyzed; in addition, the average speed 1002 of response to these web sites remains in a more stable state.

In addition, the website response status analysis result may also be displayed by a graph as shown in fig. 11, specifically, the distribution of the request speed is displayed in a form of a bar graph, the distribution of the response return code is displayed in a form of a pie graph, and the request success rate may also be displayed in an interface.

It should be noted that, along with the change of the read website, the website response state analysis result in the interface may change in real time, so that the user can obtain the corresponding website response state analysis result in time.

When a website response state analysis instruction for a certain CGI interface is received, the target matching website is acquired in a targeted manner based on the previous website classification result, and retrieval is not required to be performed by one website, so that the analysis efficiency of the website response state can be greatly improved.

The application also provides an application scene, and the application scene applies the classification method of the website. Specifically, taking the application of the method to the server in fig. 1 as an example for explanation, the application of the web address classification method to the application scenario is as follows:

1. for sending a request for acquiring weather information of a specific city to an application platform, due to the introduction of a citylame variable parameter, the website uploads 100 ten thousand URLs with different parameters one day. When these 100 ten thousand URLs need to be analyzed, the computation amount is very large if the 100 ten thousand URLs are directly analyzed, and therefore, it is necessary to classify the URLs and analyze the URLs based on the classified URLs.

2. Assume that the following two sample URLs are stored in the website address Set by means of a Set data structure:

www.getWeather.com/today/tempturecheryme ═ Beijing

www.getWeather.com/today/vindcytenyl name ═ Shenzhen

maxURLCount is set to 2 (for ease of description, to a smaller value, and may be a larger value in an actual application scenario). In addition, the preset scaling factor is 1, and the product 2 of the scaling factor and the number of sample URLs is defined as maxChlidCount.

And when the number of the websites in the website set is greater than or equal to maxURLCount, determining that the URL clustering analysis is required, and entering the next stage.

3. Stage of constructing URL structure tree

All the URLs in the set of web sites are fetched and given "/", "&" and "? "cut each URL as a delimiter, resulting in the following sample fields:

www.getWeather.com today, temperature, citylame ═ Beijing

www.getWeather.com today, wind, cityanme Shenzhen

After segmentation, traversing the cut fields and constructing the URL structure tree. The first field is the root node and the next field is the child node of the previous field. If the subsequent fields are the same, no new node is generated, and the subsequent fields are directly set as child nodes of the existing fields. For the [ query ] string (citycame field), all parameter names are set as the sub-stages of the last path, the parameter values are set as the sub-nodes of the parameter names, and the parameter name node "citycame" is marked as not being pruned.

4. Pruning stage

And after the tree is successfully built, depth-first scanning is carried out on the structure tree. When the number of the child nodes of a certain node is larger than or equal to maxChlidCount and the child nodes can be pruned, all the child nodes of the node are merged into a single node and marked by x, and the next child nodes of the child nodes are also merged into the x node.

5. Derivation phase

And repeating the step 4 until the number of the nodes in the structure tree exceeds the level of the maxCildCount. At this time, the clustered URLs can be obtained, and the following two website categories are obtained:

www.getWeather.com/today/*？cityName＝*

and then storing the clustering structure tree into a cache database, facilitating the sharing of multiple machines, setting expiration time, and recalculating after expiration.

6. Classification phase

After clustering is finished, when new URLs need to be classified, the clustering structure tree is synchronized from the cache database at regular time, after the new URLs are segmented through the zone bits, the segmented fields are directly matched with the clustering structure tree, and the nodes are matched with any fields. And when the last field is matched with a leaf node of the structure tree, the matching is considered to be successful, the classification result of the time is obtained, otherwise, the classification is failed, the new URL is continuously inserted into the constructed tree, and the next clustering is waited.

In the embodiment, based on the tree data structure, the structure tree is constructed and pruned for the user request URLs collected in the front-end monitoring system, so that the URL clustering is realized, and after the clustering is completed, the subsequent URL analysis can be directly performed by using the structure tree for matching and classifying.

taking Web front-end monitoring as an example, in order to continuously optimize the speed and experience of Web applications, a monitoring platform is used for collecting and analyzing HTTP request data of a user. Taking a CGI request as an example, after all original CGI request data of a user are collected, the request data are analyzed according to the dimension of the URL, so that a short board existing in the application is better determined, and targeted optimization is performed. The specific implementation process can be as follows:

1. the terminal sends various types of CGI requests to the weather website, the front-end monitoring system monitors the CGI requests in real time, selects the CGI requests for obtaining weather information, extracts the website, and sends the extracted website to the server. And the server stores the received website into a website Set according to a Set data structure mode. According to analysis of CGI requests, it can be found that the number of requests of this city is much larger than that of other cities, and it is desired to analyze performance data in a citycame-beijing scenario separately.

2. Assume that the following four sample URLs are stored in the web site set:

www.getWeather.com/today/tempturecheryme ═ Beijing

www.getWeather.com/today/tempreturnName (Shenzhen)

www.getWeather.com/today/windctyname (Shanghai)

www.getWeather.com/today/widcintyname ═ Guangzhou

Where maxURLCount is set to 2 and maxChlidCount is set to 2.

And the number of the websites in the website set is 2, which is equal to maxURLCount, and the URL clustering analysis is considered to be needed, and the next stage is entered.

3. Stage of constructing URL structure tree

All URLs in the website collection are taken out, and each URL is segmented by separators, so that the following sample fields are obtained:

www.getWeather.com today, temperature, citylame ═ Beijing

www.getWeather.com, today, temperature, cityName ═ Shenzhen

www.getWeather.com today, wind, cityanme, shanghai

www.getWeather.com today, wind, citylame, guangzhou

And traversing the fields obtained by cutting and constructing a clustering structure tree after cutting. Wherein "cityName" and "Beijing" are marked as not being pruneable.

4. Pruning stage

After the structure tree is successfully built, depth-first scanning is carried out on the structure tree, when the number of child nodes of a certain node is greater than or equal to 2 and the node is not marked as being not pruned, all the child nodes of the node are merged into a single node and marked with x, and the next-level child nodes of the child nodes are also merged onto the node uniformly.

5. Derivation phase

And repeating the step 4 until the number of the nodes in the structure tree exceeds the level of the maxCildCount. At this time, the clustering structure tree can be obtained through depth traversal. And then storing the clustering structure tree into a cache database.

The website categories corresponding to the clustering structure tree may be as follows:

www.getWeather.com/today/? cityanme ═ beijing

www.getWeather.com/today/*？cityName＝*

6. Classification phase

And after clustering is finished, when a new URL needs to be classified, regularly synchronizing the clustering structure tree from the cache database.

Assume that the new URL is:

www.getWeather.com/today/humidiitycyaname ═ Beijing

After field segmentation, the following target fields are obtained:

www.getWeather.com today, Humidity, citylame ═ Beijing

And carrying out matching judgment on each target field and the nodes in the clustering structure tree layer by layer. According to the judgment result, each target field is determined to be matched with each node in the path of 'www.getWeather.com → today → citycame ═ beijing'.

The new URL can therefore be categorized into "www.getWeather.com/today/? cityName ═ beijing, "this website category.

7. URL analysis phase

If a website response state analysis instruction for Beijing weather sent by a terminal is received, an AND operation is obtained, wherein the AND operation is' www.getWeather.com/today/? cityName ═ beijing "matching target matching web sites. And determining the request success rate, the request speed and the response consumed time corresponding to the target matching websites. And returning the determined request success rate, request speed and response consumed time to the terminal as a website response state analysis result so that the terminal displays the website response state analysis result on an interface.

The above embodiment, classifying the new URL based on the pre-constructed clustering structure tree, has at least the following effects:

1) the clustering time can be automatically judged without the need of pre-configuration of parameters.

2) Due to the adoption of the progressive clustering strategy, when mass URL time sequence data are faced, the scheme can finish the construction of the URL classification tree in short cold start time, begin to cluster the subsequently arrived URLs, and continuously perfect the URL structure tree, thereby improving the classification precision.

3) When the URL classification task is faced with a large amount of URL classification tasks, the performance is very good. For example, when a test is performed on a cloud virtual machine of the Tencent cloud 8 core 16G, the scheme can complete the classification task of more than 50 ten thousand URLs within 1 second after the URL structure tree is built.

4) Since the user access to the Web application is continuous, the scheme is particularly suitable for user monitoring in the front-end project.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

Based on the same idea as the website classification method in the above embodiment, the present invention further provides a website classification device, which can be used to execute the above website classification method. For convenience of explanation, the structure of the embodiments of the website classifying device is only shown in the schematic diagram, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the device, and may include more or less components than those illustrated, or combine some components, or arrange different components.

In one embodiment, as shown in fig. 12, there is provided an apparatus 1200 for classifying web addresses, which may be a part of a computer device by using a software module or a hardware module, or a combination of the two modules, and specifically includes: a website to be classified acquiring module 1201, a structure tree acquiring module 1202, a layer-by-layer matching module 1203 and a website category determining module 1204, wherein:

a to-be-classified website acquisition module 1201, configured to acquire a to-be-classified website, and perform field segmentation on the to-be-classified website to obtain a target field corresponding to the first node relationship.

A structure tree obtaining module 1202, configured to obtain a target clustering structure tree; the target clustering structure tree is obtained by combining the nodes of the first target level with the number of the nodes meeting the conditions in the sample structure tree, and the sample structure tree is a structure tree constructed according to the fields corresponding to the sample websites.

A layer-by-layer matching module 1203, configured to perform layer-by-layer matching on each target field and each node in the target clustering structure tree according to the first node relationship.

A website category determining module 1204, configured to determine, according to the target path, a website category of the website to be classified when each target field matches each node in the target path; the target path is a path from a root node to a leaf node in the target clustering structure tree.

According to the website classification device, the websites to be classified can be directly classified based on the pre-constructed target clustering structure tree, the website classification result of the websites to be classified is rapidly output, and the website classification efficiency is effectively guaranteed.

In one embodiment, a layer-by-layer matching module comprises: the layer-by-layer matching submodule is used for determining nodes matched with the target fields in the target clustering structure tree layer by layer according to the first node relation to obtain target nodes; and the matching judgment sub-module is used for judging that each target field is matched with each node in the target path when each target node forms a path from a root node to a leaf node in the target clustering structure tree.

In one embodiment, the layer-by-layer matching module further comprises: a field superposition submodule, configured to, when each target node cannot form a path from a root node to a leaf node in the target clustering structure tree, superpose each target field to the target clustering structure tree according to the first node relationship, so as to obtain a structure tree to be merged; the structure tree updating submodule is used for merging the nodes of the second target level, the number of which meets the condition, in the structure tree to be merged when the clustering instruction is obtained, so as to obtain a clustering updating structure tree; and the clustering updating structure tree is used for classifying the new website to be classified when the new website to be classified is obtained.

In one embodiment, the apparatus for classifying web addresses further comprises: the system comprises a sample website acquisition module, a second node relation acquisition module and a third node relation acquisition module, wherein the sample website acquisition module is used for acquiring a sample website and performing field segmentation on the sample website to obtain a sample field corresponding to the second node relation; the structure tree construction module is used for constructing a structure tree for each sample field according to the second node relation to obtain the sample structure tree; a node merging module, configured to perform node merging operation on the sample structure tree layer by layer, so as to merge nodes in the first target hierarchy into an arbitrary matching node when the first target hierarchy is accessed, and merge child nodes of each node in the first target hierarchy into child nodes of the arbitrary matching node; the arbitrary matching node can be matched with an arbitrary node; and the structure tree determining module is used for taking the sample structure tree after the nodes are combined as the target clustering structure tree.

In one embodiment, the sample website acquisition module includes: the website receiving submodule is used for receiving a website aiming at the target dynamic webpage sent by the front-end monitoring system; the website adding submodule is used for adding the website of the target dynamic webpage into a website set and performing repeated website removing processing on the website set; and the sample website determining submodule is used for judging that the clustering condition is reached when the number of the websites in the website set is greater than or equal to the website number threshold value, and determining the websites in the website set as the sample websites.

In an embodiment, the structure tree construction module is further configured to sequentially configure each sample field as a node in the sample structure tree according to the second node relationship; wherein the same sample field shares a node in the sample structure tree.

In one embodiment, the node merge module includes: the information acquisition submodule is used for acquiring the identification carrying information of each sample field; the identifier carrying information is used for indicating whether the corresponding sample field carries a merging prohibition identifier or not; the node number obtaining submodule is used for determining the number of the nodes of the current access level when the sample fields corresponding to the nodes in the current access level are determined not to carry the merging prohibition identification according to the identification carrying information; the current access level is a node level in the target clustering structure tree which is currently accessed; and the node merging submodule is used for judging that the current access level is the first target level and merging each node in the current access level into any matching node when the number of the nodes in the current access level is greater than or equal to a threshold of the number of the nodes.

In one embodiment, the node merging module further includes: the website set acquisition submodule is used for acquiring a historical website set in a preset historical time period; the access amount ratio obtaining sub-module is used for determining the access amount ratio of each website in the historical website set; a first target website determining sub-module, configured to determine, when a target website whose access amount proportion is greater than an access amount proportion threshold exists in the historical website set, a first target sample website corresponding to the target website in the sample websites; and the first information determining submodule is used for determining the identifier carrying information corresponding to each sample field of the first target sample website as carrying the merging prohibition identifier.

In one embodiment, the node merging module further includes: a second target website determining submodule, configured to determine a website containing the query field in the sample website as a second target sample website; a field determination submodule, configured to obtain, in each sample field, a target sample field corresponding to a query field of the second target sample website; and the second information determining submodule is used for determining the identification carrying information of the target sample field as carrying the merging prohibition identification.

In one embodiment, the node merging module further includes: the scale factor obtaining submodule is used for obtaining a preset scale factor; the product calculation submodule is used for acquiring the website number of the sample website and calculating the product of the website number and the proportionality coefficient; a number threshold determination submodule for determining the product as the node number threshold.

In one embodiment, the structure tree acquisition module includes: the expiration time determining submodule is used for determining the expiration time of a pre-constructed clustering structure tree to be selected; a first structure tree determining submodule, configured to determine the to-be-selected clustering structure tree as the target clustering structure tree when the expiration time is not reached; and the second structure tree determining submodule is used for acquiring a new sample structure tree when the expiration time is reached, and merging nodes of a third target level in the new sample structure tree, wherein the number of the nodes meets the condition, so as to obtain the target clustering structure tree.

In one embodiment, the apparatus for classifying web addresses further comprises: the path identifier determining module is used for determining a path identifier corresponding to a target public gateway interface when a website response state analysis instruction aiming at the target public gateway interface is received; the target website category determining module is used for determining a target website category corresponding to the path identifier; the matching website determining module is used for determining a target matching website matched with the target website category; the state data acquisition module is used for acquiring response state data corresponding to the target matching website; and the analysis result determining module is used for generating a website response state analysis result of the target public gateway interface according to the response state data.

For the specific definition of the classification device of the web address, reference may be made to the above definition of the classification method of the web address, and details are not described herein again. All or part of each module in the website classification device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data such as a structure tree, a website set and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for classifying web addresses.

Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for classifying web addresses, the method comprising:

2. The method of claim 1, wherein said matching each of said object fields with each of the nodes in said object senone in said first node relationship layer-by-layer comprises:

determining nodes matched with the target fields layer by layer in the target clustering structure tree according to the first node relation to obtain target nodes;

and when each target node forms a path from a root node to a leaf node in the target clustering structure tree, judging that each target field is matched with each node in the target path.

3. The method according to claim 2, wherein after determining the nodes matching with the object fields in the object clustering structure tree layer by layer according to the first node relationship to obtain object nodes, the method further comprises:

when each target node cannot form a path from a root node to a leaf node in the target clustering structure tree, each target field is superposed into the target clustering structure tree according to the first node relation to obtain a structure tree to be merged;

when a clustering instruction is obtained, merging nodes of a second target level of the structure tree to be merged, wherein the number of the nodes meets the condition, so as to obtain a clustering updating structure tree; and the clustering updating structure tree is used for classifying the new website to be classified when the new website to be classified is obtained.

4. The method of claim 1, wherein before obtaining the target senone, further comprising:

acquiring a sample website, and performing field segmentation on the sample website to obtain a sample field corresponding to a second node relation;

constructing a structure tree for each sample field according to the second node relation to obtain the sample structure tree;

performing node merging operation on the sample structure tree layer by layer so as to merge nodes in the first target level into any matching node when the first target level is accessed, and merge sub-nodes of each node in the first target level into sub-nodes of the any matching node; the arbitrary matching node can be matched with an arbitrary node;

and taking the sample structure tree after the nodes are combined as the target clustering structure tree.

5. The method of claim 4, wherein obtaining the sample website comprises:

receiving a website aiming at a target dynamic webpage sent by a front-end monitoring system;

adding the website aiming at the target dynamic webpage into a website set, and performing repeated website removal processing on the website set;

and when the number of the websites in the website set is greater than or equal to a website number threshold value, judging that a clustering condition is reached, and determining the websites in the website set as the sample websites.

6. The method according to claim 4, wherein the constructing the structure tree for each sample field according to the second node relationship to obtain the sample structure tree comprises:

sequentially configuring each sample field as a node in the sample structure tree according to the second node relationship; wherein the same sample field shares a node in the sample structure tree.

7. The method of claim 4, wherein said merging nodes in said first target hierarchy into an arbitrary matching node upon accessing said first target hierarchy, comprises:

acquiring identification carrying information of each sample field; the identifier carrying information is used for indicating whether the corresponding sample field carries a merging prohibition identifier or not;

when it is determined that the sample field corresponding to each node in the current access level does not carry the merge inhibition identifier according to the identifier carrying information, determining the number of nodes in the current access level; the current access level is a node level in the target clustering structure tree which is currently accessed;

when the number of the nodes of the current access level is larger than or equal to a threshold of the number of the nodes, the current access level is judged to be the first target level, and all the nodes in the current access level are combined into an arbitrary matching node.

8. The method according to claim 7, wherein before obtaining the identifier carrying information of each of the sample fields, further comprising:

acquiring a historical website set in a preset historical time period;

determining the access amount ratio of each website in the historical website set;

when a target website with the access amount ratio larger than an access amount ratio threshold exists in the historical website set, determining a first target sample website corresponding to the target website in the sample websites;

and determining the identifier carrying information corresponding to each sample field of the first target sample website as carrying the merging prohibition identifier.

9. The method of claim 7, wherein before the obtaining the identifier carrying information, further comprising:

determining a website containing the query field in the sample website as a second target sample website;

acquiring a target sample field corresponding to the query field of the second target sample website in each sample field;

and determining the identification carrying information of the target sample field as carrying the merging prohibition identification.

10. The method of claim 7, wherein when the number of nodes of the current access level is greater than or equal to a threshold number of nodes, determining that the current access level is before the first target level, further comprising:

acquiring a preset proportionality coefficient;

acquiring the website number of the sample website, and calculating the product of the website number and the proportionality coefficient;

determining the product as the node number threshold.

11. The method according to any one of claims 1 to 10, wherein the obtaining the target clustering structure tree comprises:

determining the expiration time of a pre-constructed clustering structure tree to be selected;

when the expiration time is not reached, determining the clustering structure tree to be selected as the target clustering structure tree;

and when the expiration time is reached, acquiring a new sample structure tree, and merging nodes of a third target level in the new sample structure tree, wherein the number of the nodes meets the condition, so as to obtain the target clustering structure tree.

12. The method according to any one of claims 1 to 10, wherein after determining the website category of the website to be classified according to the target path, the method further comprises:

when a website response state analysis instruction aiming at a target public gateway interface is received, determining a path identifier corresponding to the target public gateway interface;

determining a target website category corresponding to the path identifier;

determining a target matching website matched with the target website category;

acquiring response state data corresponding to the target matching website;

and generating a website response state analysis result of the target public gateway interface according to the response state data.

13. An apparatus for classifying web addresses, the apparatus comprising:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 12.