CN117792700A - Interface asset classification method, device, electronic equipment and medium - Google Patents

Interface asset classification method, device, electronic equipment and medium Download PDF

Info

Publication number
CN117792700A
CN117792700A CN202311685294.1A CN202311685294A CN117792700A CN 117792700 A CN117792700 A CN 117792700A CN 202311685294 A CN202311685294 A CN 202311685294A CN 117792700 A CN117792700 A CN 117792700A
Authority
CN
China
Prior art keywords
information
combination
url
piece
url addresses
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311685294.1A
Other languages
Chinese (zh)
Inventor
段璨然
常力元
佟欣哲
薛萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Safety Technology Co Ltd
Original Assignee
Tianyi Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Safety Technology Co Ltd filed Critical Tianyi Safety Technology Co Ltd
Priority to CN202311685294.1A priority Critical patent/CN117792700A/en
Publication of CN117792700A publication Critical patent/CN117792700A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an interface asset classification method, an interface asset classification device, electronic equipment and a medium, and relates to the technical field of network security, wherein the interface asset classification method comprises the following steps: acquiring a plurality of Uniform Resource Locator (URL) addresses from network traffic at an interface; for each combination, determining the similarity between the URL addresses in the combination according to each piece of information in the URL addresses in the combination; if the similarity between the URL addresses in the combination exceeds a preset value, determining that the interfaces corresponding to the URL addresses in the combination are of the same type. According to the embodiment of the invention, the similarity of the URL addresses can be determined through each piece of information in the URL addresses, whether the interfaces corresponding to the URL addresses are of the same type or not is determined through the similarity of the URL addresses, and the working efficiency is improved.

Description

Interface asset classification method, device, electronic equipment and medium
Technical Field
The present invention relates to the field of network security technologies, and in particular, to a method and apparatus for classifying interface assets, an electronic device, and a medium.
Background
In the aspect of API asset management, the problems of huge API quantity, unclear responsibility, disordered standards and the like exist at present, so that high requirements are put on the carding of the API assets. The carding API assets can refer to the specification document of the API, but the specification document has the problems of difficult acquisition, slower updating and the like. In actual API security products, API assets are therefore typically identified from network traffic. The key to distinguishing APIs is to identify which of the network requests are parameters of the APIs, thereby aggregating network traffic to form API assets. The location of the API parameters may be in the request parameters, the header of the http, the path of the url, or the cookie, as specified by the openAPI. The request parameters, the http header and the cookie have obvious characteristics, such as XX= "AA", so that the AA behind the equal number is directly extracted to be the API, and the API asset is formed.
However, when the position of the API parameter is the url path, the format of the equal sign is not available, and the API parameter cannot be identified by identifying the format of the equal sign, so that the API cannot be classified, and human intervention is required in practical application, resulting in lower distinguishing efficiency.
Disclosure of Invention
The invention provides an interface asset classification method, an interface asset classification device, electronic equipment and a medium, which can determine the similarity of URL addresses through each piece of information in the URL addresses, and determine whether interfaces corresponding to the URL addresses are of the same type or not through the similarity of the URL addresses, so that the working efficiency is improved.
In a first aspect, an embodiment of the present invention provides an interface asset classification method, including:
acquiring a plurality of Uniform Resource Locator (URL) addresses from network traffic at an interface;
for each combination, determining the similarity between the URL addresses in the combination according to each piece of information in the URL addresses in the combination; each combination consists of any two URL addresses in a plurality of URL addresses; each piece of information in the URL address is obtained by splitting a field in the URL address;
if the similarity between the URL addresses in the combination exceeds a preset value, determining that the interfaces corresponding to the URL addresses in the combination are of the same type.
According to the method, the URL addresses are obtained from the network traffic at the interfaces, the similarity of the URL addresses is determined according to each piece of information in the URL addresses, whether the interfaces corresponding to the URL addresses are of the same type or not is determined according to the similarity of the URL addresses, and the working efficiency is improved.
In one possible implementation, determining the similarity between URL addresses in the combination from each piece of information in the URL addresses in the combination includes:
determining the similarity between every two pieces of information in each piece of information in the URL addresses in the combination;
and determining the similarity between the URL addresses in the combination according to the similarity between every two pieces of information in the URL addresses in the combination.
According to the method, the similarity between every two pieces of information in the URL address is determined, and the similarity between the two addresses is determined, so that the accuracy of the determination is improved.
In one possible implementation, determining the similarity between every two pieces of information in the URL addresses in the combination includes:
determining an array corresponding to each piece of information in the URL addresses in the combination;
and taking the cosine distance between arrays corresponding to every two pieces of information in each piece of information in the URL address in the combination as the similarity between every two pieces of information in each piece of information in the URL address in the combination.
According to the method, the array corresponding to each piece of information in the URL address is determined, the cosine distance between the arrays is used as the similarity between the arrays, and the similarity between texts is determined through the cosine distance, so that the operation efficiency is improved.
In one possible implementation manner, determining the array corresponding to each piece of information in the URL address includes:
determining an initial array corresponding to each piece of information in the URL address in the combination according to the position of the URL address in the combination in each piece of information in the URL address in the combination and the data type of each piece of information in the URL address in the combination;
determining a target index of an initial array corresponding to each piece of information in the URL address in the combination, and taking the target index of the initial array corresponding to each piece of information in the URL address in the combination as an array corresponding to each piece of information in the URL address in the combination; the target index of the initial array corresponding to one piece of information in the URL address characterizes the importance degree of the one piece of information in the URL address to the URL address.
According to the method, the initial array corresponding to each piece of information in the URL address can be determined through the position and the data type of each piece of information in the URL address, the array corresponding to each piece of information in the URL address is formed based on the importance degree of each piece of information in the URL address, so that the importance degree of each piece of information in the URL address can be determined to determine the similarity between the pieces of information, and the accuracy of similarity calculation is improved.
In one possible implementation manner, determining an initial array corresponding to each piece of information in the URL address in the combination according to the position of the URL address in the combination in each piece of information in the URL address in the combination and the data type to which each piece of information in the URL address in the combination belongs, including:
determining a generalization value corresponding to each piece of information in the URL addresses in the combination according to the position of the URL addresses in the combination in each piece of information in the URL addresses in the combination;
for each piece of information in the URL addresses in the combination, if the data type of the information in the URL addresses in the combination is a preset type, the generalized value corresponding to the information in the URL addresses in the combination and a first preset value are used as an initial array corresponding to the information in the URL addresses in the combination;
if the data type of the information in the URL address in the combination is not the preset type, the generalized value corresponding to the information in the URL address in the combination and the second preset value are used as an initial array corresponding to the information in the URL address in the combination; wherein the second preset value is greater than the first preset value.
According to the method, the initial array corresponding to the information is determined through the position of each piece of information in the URL address and the type of the information, and the content of the information is better expressed through two factors of the position and the type of the data.
In one possible implementation manner, after determining that the interfaces corresponding to URL addresses in the combination are of the same type, the method further includes:
according to whether the interfaces corresponding to the URL addresses in each combination are of the same type, re-dividing the plurality of combinations to obtain a plurality of sets; wherein the set consists of at least two URL addresses;
according to the editing distance of every two URL addresses in each set, determining the URL address to be deleted in each set; deleting the URL addresses to be deleted in each set;
and forming interface assets according to each deleted set.
According to the method, the plurality of URL addresses can be divided again to form the interface asset, and the formed interface asset is more accurate.
In one possible implementation manner, determining URL addresses to be deleted in each set according to the editing distance of every two URL addresses in each set includes:
determining a generalization value corresponding to a data type to which each piece of information in every two URL addresses in each set belongs, and converting a symbol in each piece of information in every two URL addresses in each set into a mathematical symbol;
determining the editing distance of the pairwise URL addresses in each set according to the generalized value corresponding to the data type to which each piece of information in the pairwise URL addresses in each set belongs and the mathematical symbol after each piece of information in the pairwise URL addresses in each set is converted;
The target URL address, the editing distance between each set and any one other URL address of which does not meet the preset condition, is used as the URL address to be deleted; wherein the other URL addresses in a set are URL addresses in a set other than the target URL address.
According to the method, the editing distance can be determined for generalization processing of the URL address, the URL address to be deleted is determined based on the editing distance, and the accuracy of determination is improved.
In a second aspect, an embodiment of the present invention provides an interface asset classification device, including:
the acquisition module is used for acquiring a plurality of Uniform Resource Locator (URL) addresses from network traffic at the interface;
a first determining module, configured to determine, for each combination, a similarity between URL addresses in the combination according to each piece of information in URL addresses in the combination; each combination consists of any two URL addresses in a plurality of URL addresses; each piece of information in the URL address is obtained by splitting a field in the URL address;
and the second determining module is used for determining that the interfaces corresponding to the URL addresses in the combination are of the same type if the similarity between the URL addresses in the combination exceeds a preset value.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
a processor;
a processor for executing a computer program or instructions in the memory such that the interface asset classification method according to any of the first aspects is performed.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, which when executed by a processor, causes the processor to perform the interface asset classification method according to any one of the first aspects.
In a fifth aspect, embodiments of the present invention provide a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the interface asset classification method as described in any one of the first aspects above.
In addition, the technical effects caused by any implementation manner of the second aspect to the fifth aspect may refer to the technical effects caused by different implementation manners of the first aspect, which are not described herein.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
FIG. 1 is a flow chart of an interface asset classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for determining similarity between two URL addresses according to an embodiment of the present invention;
FIG. 3 is a flow chart of a method of forming an interface asset provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a method for forming an interface asset from a plurality of URL addresses according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an interface asset classification device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Noun interpretation:
TF-IDF (term frequency-inverse document frequency) is a statistical method for evaluating how important a word is to one of a set of documents or a corpus of documents. TF-IDF has two layers of meaning: one layer is word frequency (TF) and one layer is Inverse Document Frequency (IDF). TF-idf=word frequency (TF) ×inverse document frequency (IDF). It is a common model in NLP processing.
NLP (Natural Language Processing ) is a machine learning technique that a computer can interpret, process and understand human language. It generally accomplishes natural language processing through preprocessing, training, deployment and reasoning. Common natural language processing methods include supervised NLP, unsupervised NLP, natural language understanding, and natural language generation. Wherein the non-supervised NLP predicts patterns that occur when non-label inputs are provided using a statistical language model. In the field of natural language processing, gensim is a very important tool kit. The system is an open-source third-party python toolkit, can learn the topic vector expression of a text hidden layer from the original unstructured text in an unsupervised manner, supports various topic model algorithms including TF-IDF, LSA, LDA and the like, supports streaming training, and improves some common API (Application Programming Interface ) tasks such as similarity calculation, information retrieval and the like.
The edit distance algorithm is an index for measuring the similarity of two sequences. Can be used to determine the similarity between the two words w1, w 2.
The current API security protection is realized by starting to apply a novel API firewall. The basis of the API firewall work is to accurately identify the API assets in the network, but the problems of huge API quantity, unclear responsibility, disordered standards and the like exist in the prior art of API asset management, so that high requirements are put on the carding of the API assets. The carding API assets can refer to the specification document of the API, but the specification document has the problems of difficult acquisition, slower updating and the like. In actual API security products, API assets are therefore typically identified from network traffic. The key to distinguishing APIs is to identify which of the network requests are parameters of the APIs, thereby aggregating network traffic to form API assets. The location of the API parameters may be in the request parameters, the header of the http, the path of the url, or the cookie, as specified by the openAPI. The request parameters, the http header and the cookie have obvious characteristics, such as XX= "AA", so that the AA behind the equal number is directly extracted to be the API, and the API asset is formed. However, in the case that the position of the API parameter is the url path, there is no format of the equal sign, and the existing technical solution cannot solve the problem, and human intervention is required in practical application, so that the distinguishing efficiency is low.
Aiming at the situation, in order to overcome the defects of the existing algorithm, the invention provides an API asset identification scheme based on TF-IDF and edit distance algorithm. The method comprises the steps of adopting an algorithm model, firstly carrying out theme vector conversion on API requests in network traffic, and then classifying the API requests with more identical paths into one type by using a document similarity algorithm. And then judging the similarity through an edit distance algorithm, and confirming whether the API requests can be aggregated into a unified template, so that the aim of API asset identification is fulfilled. The invention can reduce human intervention in practical application and can achieve efficient and accurate recognition effect.
Referring to fig. 1, an embodiment of the present invention provides an interface asset classification method, including:
s100: acquiring a plurality of Uniform Resource Locator (URL) addresses from network traffic at an interface;
the interface may be an API (Application Programming Interface ), among others.
In detail, the extracted network traffic is part of the content in the HTTP/HTTP2 request, which includes the URL address. The URL address contains API parameters. The URL addresses obtained are, for example, v1/app/1017/show,/v1/app/query/1234,/v1/app/12-33-44/show.
S101: for each combination, determining the similarity between the URL addresses in the combination according to each piece of information in the URL addresses in the combination; each combination consists of any two URL addresses in a plurality of URL addresses;
each piece of information in the URL address is obtained by splitting a field in the URL address;
for example, the URL addresses obtained are all/v 1/app/1017/show,/v1/app/query/1234,/v1/app/12-33-44/show, and the combination of the components may include a first combination: V1/app/1017/show,/V1/app/query/1234; second combination: V1/app/query/1234,/V1/app/12-33-44/show; third combination: V1/app/1017/show,/V1/app/12-33-44/show;
for the first combination, determining a similarity between/v 1/app/1017/show and/v 1/app/query/1234 based on each of the information in/v 1/app/1017/show,/v1/app/query/1234;
for the second combination, determining a similarity between/v 1/app/query/1234 and/v 1/app/12-33-44/show from each of the information in/v 1/app/1234 and/v 1/app/12-33-44/show;
for the third combination, determining a similarity between/v 1/app/1017/show and/v 1/app/12-33-44/show from each of the information in/v 1/app/1017/show and/v 1/app/12-33-44/show;
Each piece of information in URL addresses, for example, URL address v1/app/1017/show, each piece of information in URL addresses v1, app, 1017, show.
Similarly, the URL address is/v 1/app/query/1234, and each piece of information in the URL address is v1, app, query, 1234;
the URL addresses are v1/app/12-33-44/show, and each piece of information in the URL addresses is v1, app, 12-33-44, show.
S102: if the similarity between the URL addresses in the combination exceeds a preset value, determining that the interfaces corresponding to the URL addresses in the combination are of the same type.
The interface corresponding to the URL address is the interface accessed by the URL address.
For example, the similarity between/v 1/app/1017/show and/v 1/app/query/1234 exceeds a preset value, determining that the interface corresponding to the/v 1/app/1017/show and the interface corresponding to the/v 1/app/query/1234 are of the same type;
if the similarity between the interfaces/v 1/app/query/1234 and the interfaces/v 1/app/12-33-44/show exceeds a preset value, determining that the interfaces corresponding to the interfaces/v 1/app/query/1234 and the interfaces corresponding to the interfaces/v 1/app/12-33-44/show are of the same type;
if the similarity between the interfaces of the interfaces/v 1/app/1017/show and the interfaces/v 1/app/12-33-44/show exceeds a preset value, determining that the interfaces corresponding to the interfaces/v 1/app/1017/show and the interfaces corresponding to the interfaces/v 1/app/12-33-44/show are of the same type;
Of course, if the similarity between URL addresses in the combination exceeds a non-preset value, it is determined that the interfaces corresponding to the URL addresses in the combination are not of the same type.
The embodiment of the invention provides a specific implementation method for determining the similarity between URL addresses in the combination, which is shown in fig. 2, wherein the specific implementation method comprises the following steps:
s200: determining the similarity between every two pieces of information in each piece of information in the URL addresses in the combination;
in detail, firstly, determining an array corresponding to each piece of information in the URL address in the combination; and then taking the cosine distance between arrays corresponding to every two pieces of information in each piece of information in the URL address in the combination as the similarity between every two pieces of information in each piece of information in the URL address in the combination.
Wherein determining the array corresponding to each piece of information in the URL address in the combination comprises: determining an initial array corresponding to each piece of information in the URL address in the combination according to the position of the URL address in the combination in each piece of information in the URL address in the combination and the data type of each piece of information in the URL address in the combination; determining a target index of an initial array corresponding to each piece of information in the URL address in the combination, and taking the target index of the initial array corresponding to each piece of information in the URL address in the combination as an array corresponding to each piece of information in the URL address in the combination; the target index of the initial array corresponding to one piece of information in the URL address characterizes the importance degree of the one piece of information in the URL address to the URL address.
The process of determining the initial array corresponding to each piece of information in the URL address in the combination is as follows: determining a generalization value corresponding to each piece of information in the URL addresses in the combination according to the position of the URL addresses in the combination in each piece of information in the URL addresses in the combination;
for each piece of information in the URL addresses in the combination, if the data type of the information in the URL addresses in the combination is a preset type, the generalized value corresponding to the information in the URL addresses in the combination and a first preset value are used as an initial array corresponding to the information in the URL addresses in the combination;
if the data type of the information in the URL address in the combination is not the preset type, the generalized value corresponding to the information in the URL address in the combination and the second preset value are used as an initial array corresponding to the information in the URL address in the combination; wherein the second preset value is greater than the first preset value.
For example, using url as an example of/v 1/app/1017/show, generalization processing is performed using the position index of the original data and the number of words of each information.
The position index and calculation formula for the original data are shown in table 1:
TABLE 1
The position index of the first information v1, v1 of the url address is 0, the number of words of the information is 1, and the normalization result of v1 is 1; the position index of the app is 1, the word number of the app is 1, and the normalization result of the app is 2; the position index of the third information 1017, 1017 of the url address is 2, the number of words of the information is 1, and the normalization result of the 1017 is 3; the fourth information of url address is show, the position index of show is 3, the number of words of the information is 1, and the normalization result of show is 4; specific normalization results are:
the preset type can be pure numbers and the like, the second preset value can be 1, and the first preset value can be 2;
when an initial array corresponding to each piece of information is determined, the first element is a normalization result of each piece of information, and the second element is a weight of each piece of information. In order to reduce the sensitivity of the pure numbers (1017), the weight of the pure numbers is reduced in the preprocessing stage.
For example, v1 in url is not a preset type, and the initial array corresponding to v1 is (1, 2); the app is not of a preset type, and an initial array corresponding to the app is (2, 2); 1017 is a preset type, and the initial array corresponding to 1017 is (3, 1); the show is not of a preset type, and the initial array corresponding to the show is (4, 2).
Wherein, the importance degree of v1 in/v 1/app/1017/show is determined by the combination of (1, 2) and (1, 2), (2, 2), (3, 1), (4, 2), and the obtained importance degree is used as an array corresponding to v 1;
determining the importance degree of the app in/v 1/app/1017/show through the combination of (2, 2) and (1, 2), (2, 2), (3, 1), (4, 2), and taking the obtained importance degree as an array corresponding to the app;
determining the importance degree of 1017 in/v 1/app/1017/show through the combination of (3, 1) and (1, 2), (2, 2), (3, 1), (4, 2), and taking the obtained importance degree as an array corresponding to 1017;
determining the importance degree of the show in the/v 1/app/1017/show through the combination of (4, 2) and (1, 2), (2, 2), (3, 1), (4, 2), and taking the obtained importance degree as an array corresponding to the show;
the initial array corresponding to each information is converted into an array corresponding to each information through a TF-IDF model, namely the importance degree of the URL address of each information in the combination in the URL address in the combination is calculated through the TF-IDF model, and the importance degree is used as the array corresponding to each information.
Taking the example of similarity calculation of URL address A as/v 1/app/1017/show and URL address B as/v 1/app/12-33-44/show, the cosine distances of any two pieces of information v1, app, 1017, show of URL address A and four pieces of information v1, app, 12-33-44, show of URL address B are calculated.
S201: and determining the similarity between the URL addresses in the combination according to the similarity between every two pieces of information in the URL addresses in the combination.
Illustratively, the maximum value in the similarity between every two pieces of information in the URL addresses in the combination is taken as the similarity between the URL addresses in the combination; or alternatively
And taking the average value of the similarity between every two pieces of information in the URL addresses in the combination as the similarity between the URL addresses in the combination.
After determining that the interfaces corresponding to the URL addresses in the combination are of the same type, in conjunction with fig. 3, the method further includes:
s300: according to whether the interfaces corresponding to the URL addresses in each combination are of the same type, re-dividing the plurality of combinations to obtain a plurality of sets; wherein the collection consists of at least two URL addresses;
in detail, the corresponding interfaces are combined with the same URL address between the combinations of the same type to obtain the set. Illustratively, when the first combination comprises/v 1/app/query/1234 and/v 1/app/query/2345, the second combination comprises/v 1/app/query/2345 and/v 1/app/query/7894; the interface corresponding to/v 1/app/query/1234 and the interface corresponding to/v 1/app/query/2345 in the first combination are of the same type, the interface corresponding to the interface/v 1/app/query/2345 and the interface corresponding to the interface/v 1/app/query/7894 in the second combination are of the same type, and the first combination and the second combination contain the same URL address: v1/app/query/2345; combining the first combination and the second combination into one set: V1/app/query/1234,/V1/app/query/2345,/V1/app/query/7894.
S301: according to the editing distance of every two URL addresses in each set, determining the URL address to be deleted in each set; deleting the URL addresses to be deleted in each set;
in detail, determining a generalization value corresponding to a data type to which each piece of information in every two-by-two URL addresses in each set belongs, and converting a symbol in each piece of information in every two-by-two URL addresses in each set into a mathematical symbol;
determining the editing distance of the pairwise URL addresses in each set according to the generalized value corresponding to the data type to which each piece of information in the pairwise URL addresses in each set belongs and the mathematical symbol after each piece of information in the pairwise URL addresses in each set is converted;
the target URL address, the editing distance between each set and any one other URL address of which does not meet the preset condition, is used as the URL address to be deleted; wherein the other URL addresses in a set are URL addresses in a set other than the target URL address.
The editing distance between URL addresses in the combination satisfies the preset condition may be that a ratio between the editing distance between URL addresses in the combination and the data length is smaller than a threshold value.
For example, a generalized value corresponding to the letter and number in the URL address is set to 1, and a symbol in the URL address, for example "-", is converted into a minus number. As shown in table 3:
TABLE 3 Table 3
URL address Normalization of results
/v1/app/query/1234 /1/1/1/1
/v1/app/query/2345 /1/1/1/1
/v1/app/18sj-uujf-4308/show /1/1/1-1-1/1
/v1/app/1g28-d83h-5f43-yuir/show /1/1/1-1-1-1/1
As shown in table 3, URL address is/v 1/app/query/1234, the data type to which v1 belongs is a character string composed of letters and groups, and the generalization value corresponding to v1 is 1; the data type to which the app belongs is a letter, and the generalization value corresponding to the app is 1; the data type to which the query belongs is a letter, and the generalization value corresponding to the query is 1; the data type to which 1234 belongs is a number, the generalization value corresponding to 1234 is 1, and the normalization result is/1/1/1/1;
the URL address is/v 1/app/query/2345, the data type to which v1 belongs is a character string consisting of letters and a plurality of groups, and the generalization value corresponding to v1 is 1; the data type to which the app belongs is a letter, and the generalization value corresponding to the app is 1; the data type to which the query belongs is a letter, and the generalization value corresponding to the query is 1; the data type of 2345 is a number, the generalization value corresponding to 2345 is 1, and the generalization result is/1/1/1/1;
the combination of/v 1/app/query/1234 and/v 1/app/query/2345, if the similarity between/v 1/app/query/1234 and/v 1/app/query/2345 exceeds a predetermined value, calculate the edit distance of the normalized result/1/1/1 of/v 1/app/query/2345 and the normalized result/1/1/1 of/v 1/app/query/2345, obtaining an edit distance of 0 between them;
the data length of/v 1/app/query/1234 is 4 and the data length of/v 1/app/query/2345 is 4; dividing the edit distance between them by the data length, i.e. 0/4=0, if 0 is smaller than the threshold value, determining that the edit distance between them meets the preset condition, and determining that the interfaces corresponding to them are of the same type.
As shown in Table 3, the URL address is/v 1/app/18sj-uujf-4308/show, the data type to which v1 belongs is a character string of a combination of letters and numbers, and the generalization value corresponding to v1 is 1; the data type to which the app belongs is a letter, and the generalization value corresponding to the app is 1; the 18sj-uujf-4308 comprises a character string consisting of 1 character, 1 number, 1 character and number, and the generalization value corresponding to the 18sj-uujf-4308 is 1-1-1; the data type to which the show belongs is letters, the generalization value corresponding to the show is 1, and the normalization result is/1/1/1-1-1/1. Wherein "-" in 18sj-uujf-4308 is modified to be the mathematical sign minus sign.
The URL address is/v 1/app/1g28-d83h-5f 43-yur/show, the data type of v1 is a character string of a combination of letters and numbers, and the generalization value corresponding to v1 is 1; the data type to which the app belongs is a letter, and the generalization value corresponding to the app is 1;1g28-d83h-5f43-yuir comprises a character string consisting of 1 character, 1 number, 2 characters and numbers, and the generalization value corresponding to 1g28-d83h-5f43-yuir is 1-1-1; the data type to which the show belongs is letters, the generalization value corresponding to the show is 1, and the generalization result is/1/1/1-1-1-1/1. Wherein "-" in 1g28-d83h-5f 43-yur is modified by the mathematical symbol minus sign.
Wherein, the similarity between/v 1/app/18sj-uujf-4308/show and/v 1/app/1g28-d83h-5f 43-yur/show is a combination, if the similarity between/v 1/app/18sj-uujf-4308/show and/v 1/app/1g28-d83h-5f 43-yur/show exceeds a preset value, calculating the normalized result of/v 1/app/18 sj-uujf-4308/show/1/1-1-1/1 and/v 1/app/1g28-d83h-5f 43-yur/show for editing the distance between the two values to obtain the editing distance of 2;
The data length of the composition is 4 for/v 1/app/18sj-uujf-4308/show, and the data length of the composition is 4 for/v 1/app/1g28-d83h-5f 43-yur/show; dividing the edit distance between them by the data length, i.e. 2/4=0.5, if 0.5 is smaller than the threshold value, determining that the edit distance between them satisfies the preset condition, and determining that the interfaces corresponding to them are of the same type. If the threshold value is set smaller and 0.5 is larger than the threshold value, determining that the editing distance between the interfaces does not meet the preset condition and determining that the interfaces corresponding to the interfaces are not of the same type.
S302: and forming interface assets according to each deleted set.
For example, in determining that the interfaces corresponding to the URL addresses in the combination are of the same type, the API assets are aggregated into one API asset, e.g.,/v 1/app/query/1234 and/v 1/app/query/2345, i.e.,/v 1/app/query/{ number }.
Based on the above technical solution, referring to fig. 4, an embodiment of the present invention provides a classification method, including:
s400: intercepting a plurality of URL addresses in an HTTP/TTTP2 request from network traffic at an interface;
since the URL addresses are encoded, the URL addresses need to be decoded after the plurality of URL addresses are acquired;
s401: decoding according to the specific coding format of each URL address;
For example, URL encoding, hex encoding, etc., the URL encoding is decoded, and the hex encoding is decoded.
S402: splitting each URL address by using a preset separator;
s403: carrying out the standardization processing on each URL address;
in detail, determining a generalization value corresponding to each piece of information in each URL address according to the position of each piece of information in each URL address;
the generalization value is the normalization result.
S404: according to Fan Huajie effect of each URL address, each URL address is represented by a sparse vector;
specifically, for each piece of information in each URL address, if the data type to which the information in each URL address belongs is a preset type, using the generalized value corresponding to the information in each URL address and the first preset value as an initial array corresponding to the information in each URL address;
if the data type of the information in each URL address is not the preset type, the generalized value corresponding to the information in each URL address and a second preset value are used as an initial array corresponding to the information in each URL address; wherein the second preset value is greater than the first preset value.
The initial array corresponding to each piece of information in the URL address is combined to form the URL address, and sparse vector representation is adopted. For example, URL is/v 1/app/1017/show, v1 corresponds to an initial array of (1, 2), app corresponds to an initial array of (2, 2), 1017 corresponds to an initial array of (3, 1), show corresponds to an initial array of (4, 2), and URL addresses are expressed as [ (1, 2), (2, 2), (3, 1), (4, 2) ] using sparse vectors.
S405: converting each URL address into a TFIDF vector of each URL address by using sparse vector representation through a TF-IDF model;
wherein, using NLP technology using Gensim tool, a TF-IDF model is formed, each information in each URL address is converted into TFIDF vector.
S406: according to the TFIDF vector of each URL address, determining the similarity between every two URL addresses;
specifically, the cosine distance is calculated according to the TFIDF vector of each URL address, that is, the cosine value of the included angle of two vectors is calculated, and the cosine value is used as the similarity between every two URL addresses.
S407: re-dividing the URL addresses with the similarity exceeding the preset value to obtain a plurality of sets;
s408: carrying out normalization processing on URL addresses in each set, and determining the editing distance of every two URL addresses in each set according to the normalization result of every two URL addresses in each set;
in detail, a generalized value corresponding to a data type to which each piece of information in every two-by-two URL addresses in each set belongs is determined, a symbol in each piece of information in every two-by-two URL addresses in each set is converted into a mathematical symbol, and a normalization result of URL addresses in each set is processed.
Determining the editing distance of the pairwise URL addresses in each set according to the generalized value corresponding to the data type to which each piece of information in the pairwise URL addresses in each set belongs and the mathematical symbol after each piece of information in the pairwise URL addresses in each set is converted;
S409: according to the editing distance of every two URL addresses in each set, determining the URL address to be deleted in each set, and deleting the URL address to be deleted in each set;
s410: and synthesizing each rejected aggregate into an API asset.
Based on the same inventive concept, an embodiment of the present invention provides an interface asset classification device, as shown in connection with fig. 5, including:
an obtaining module 500, configured to obtain a plurality of URL addresses from network traffic at an interface;
a first determining module 501, configured to determine, for each combination, a similarity between URL addresses in the combination according to each piece of information in URL addresses in the combination; each combination consists of any two URL addresses in a plurality of URL addresses; each piece of information in the URL address is obtained by splitting a field in the URL address;
the second determining module 502 is configured to determine that the interfaces corresponding to the URL addresses in the combination are of the same type if the similarity between the URL addresses in the combination exceeds a preset value.
Optionally, the first determining module 501 is specifically configured to:
determining the similarity between every two pieces of information in each piece of information in the URL addresses in the combination;
And determining the similarity between the URL addresses in the combination according to the similarity between every two pieces of information in the URL addresses in the combination.
Optionally, the first determining module 501 is specifically configured to:
determining an array corresponding to each piece of information in the URL addresses in the combination;
and taking the cosine distance between arrays corresponding to every two pieces of information in each piece of information in the URL address in the combination as the similarity between every two pieces of information in each piece of information in the URL address in the combination.
Optionally, the first determining module 501 is specifically configured to:
determining an initial array corresponding to each piece of information in the URL address in the combination according to the position of the URL address in the combination in each piece of information in the URL address in the combination and the data type of each piece of information in the URL address in the combination;
determining a target index of an initial array corresponding to each piece of information in the URL address in the combination, and taking the target index of the initial array corresponding to each piece of information in the URL address in the combination as an array corresponding to each piece of information in the URL address in the combination; the target index of the initial array corresponding to one piece of information in the URL address characterizes the importance degree of the one piece of information in the URL address to the URL address.
Optionally, the first determining module 501 is specifically configured to:
determining a generalization value corresponding to each piece of information in the URL addresses in the combination according to the position of the URL addresses in the combination in each piece of information in the URL addresses in the combination;
for each piece of information in the URL addresses in the combination, if the data type of the information in the URL addresses in the combination is a preset type, the generalized value corresponding to the information in the URL addresses in the combination and a first preset value are used as an initial array corresponding to the information in the URL addresses in the combination;
if the data type of the information in the URL address in the combination is not the preset type, the generalized value corresponding to the information in the URL address in the combination and the second preset value are used as an initial array corresponding to the information in the URL address in the combination; wherein the second preset value is greater than the first preset value.
Optionally, the apparatus further includes:
the post-processing module is used for re-dividing the plurality of combinations according to whether the interfaces corresponding to the URL addresses in each combination are of the same type or not to obtain a plurality of sets; wherein the set consists of at least two URL addresses;
According to the editing distance of every two URL addresses in each set, determining the URL address to be deleted in each set; deleting the URL addresses to be deleted in each set;
and forming interface assets according to each deleted set.
Optionally, the post-processing module is specifically configured to:
determining a generalization value corresponding to a data type to which each piece of information in every two URL addresses in each set belongs, and converting a symbol in each piece of information in every two URL addresses in each set into a mathematical symbol;
determining the editing distance of the pairwise URL addresses in each set according to the generalized value corresponding to the data type to which each piece of information in the pairwise URL addresses in each set belongs and the mathematical symbol after each piece of information in the pairwise URL addresses in each set is converted;
the target URL address, the editing distance between each set and any one other URL address of which does not meet the preset condition, is used as the URL address to be deleted; wherein the other URL addresses in a set are URL addresses in a set other than the target URL address.
In addition, the interface asset classification method and apparatus of the embodiments of the present invention described in connection with fig. 1-5 may be implemented by an electronic device.
An electronic device, comprising: a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement a venue order processing method according to any of the foregoing description.
Based on the above description, the electronic device structure of fig. 6 is proposed by way of example.
The electronic device may include a processor 610 and a memory 620 storing computer program instructions.
In particular, the processor 610 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present invention.
Memory 620 may include mass storage for data or instructions. By way of example, and not limitation, memory 620 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 620 may include removable or non-removable (or fixed) media, where appropriate. Memory 620 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 620 is a non-volatile solid state memory. In a particular embodiment, the memory 620 includes Read Only Memory (ROM). The ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these, where appropriate.
The processor 610 implements the method of performing tasks of any of the above embodiments by reading and executing computer program instructions stored in the memory 620.
In one example, the electronic device may also include a communication interface 630 and a bus 640. As shown in fig. 6, the processor 610, the memory 620, and the communication interface 630 are connected to each other by a bus 640 and perform communication with each other.
The communication interface 630 is mainly used to implement communication between each module, device, unit and/or apparatus in the embodiment of the present invention.
Bus 640 includes hardware, software, or both that couple components of the electronic device to one another. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. Bus 640 may include one or more buses, where appropriate. Although embodiments of the invention have been described and illustrated with respect to a particular bus, the invention contemplates any suitable bus or interconnect.
In addition, in combination with the electronic device in the above embodiment, the embodiment of the present invention may provide a storage medium, which when executed by a processor of the electronic device, enables the electronic device to perform the interface asset classification method according to any one of the above.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. An interface asset classification method, comprising:
Acquiring a plurality of Uniform Resource Locator (URL) addresses from network traffic at an interface;
for each combination, determining the similarity between the URL addresses in the combination according to each piece of information in the URL addresses in the combination; each combination consists of any two URL addresses in a plurality of URL addresses; each piece of information in the URL address is obtained by splitting a field in the URL address;
if the similarity between the URL addresses in the combination exceeds a preset value, determining that the interfaces corresponding to the URL addresses in the combination are of the same type.
2. The interface asset classification method of claim 1, wherein determining the similarity between URL addresses in the group based on each of the information of URL addresses in the group comprises:
determining the similarity between every two pieces of information in each piece of information in the URL addresses in the combination;
and determining the similarity between the URL addresses in the combination according to the similarity between every two pieces of information in the URL addresses in the combination.
3. The interface asset classification method of claim 2, wherein determining the similarity between two-by-two information in each of the URL addresses in the combination comprises:
Determining an array corresponding to each piece of information in the URL addresses in the combination;
and taking the cosine distance between arrays corresponding to every two pieces of information in each piece of information in the URL address in the combination as the similarity between every two pieces of information in each piece of information in the URL address in the combination.
4. The interface asset classification method of claim 3, wherein determining an array for each piece of information in the URL address comprises:
determining an initial array corresponding to each piece of information in the URL address in the combination according to the position of the URL address in the combination in each piece of information in the URL address in the combination and the data type of each piece of information in the URL address in the combination;
determining a target index of an initial array corresponding to each piece of information in the URL address in the combination, and taking the target index of the initial array corresponding to each piece of information in the URL address in the combination as an array corresponding to each piece of information in the URL address in the combination; the target index of the initial array corresponding to one piece of information in the URL address characterizes the importance degree of the one piece of information in the URL address to the URL address.
5. The interface asset classification method of claim 4, wherein determining an initial array corresponding to each piece of information in the URL addresses in the group based on the location of the URL address in the group in each piece of information in the URL addresses in the group and the data type to which each piece of information in the URL addresses in the group belongs, comprises:
Determining a generalization value corresponding to each piece of information in the URL addresses in the combination according to the position of the URL addresses in the combination in each piece of information in the URL addresses in the combination;
for each piece of information in the URL addresses in the combination, if the data type of the information in the URL addresses in the combination is a preset type, the generalized value corresponding to the information in the URL addresses in the combination and a first preset value are used as an initial array corresponding to the information in the URL addresses in the combination;
if the data type of the information in the URL address in the combination is not the preset type, the generalized value corresponding to the information in the URL address in the combination and the second preset value are used as an initial array corresponding to the information in the URL address in the combination; wherein the second preset value is greater than the first preset value.
6. The interface asset classification method according to any one of claims 1-5, wherein after determining that interfaces corresponding to URL addresses in the combination are of the same type, the method further comprises:
according to whether the interfaces corresponding to the URL addresses in each combination are of the same type, re-dividing the plurality of combinations to obtain a plurality of sets; wherein the set consists of at least two URL addresses;
According to the editing distance of every two URL addresses in each set, determining the URL address to be deleted in each set; deleting the URL addresses to be deleted in each set;
and forming interface assets according to each deleted set.
7. The interface asset classification method of claim 6, wherein determining URL addresses to be deleted in each set according to edit distances of URL addresses in each set, comprises:
determining a generalization value corresponding to a data type to which each piece of information in every two URL addresses in each set belongs, and converting a symbol in each piece of information in every two URL addresses in each set into a mathematical symbol;
determining the editing distance of the pairwise URL addresses in each set according to the generalized value corresponding to the data type to which each piece of information in the pairwise URL addresses in each set belongs and the mathematical symbol after each piece of information in the pairwise URL addresses in each set is converted;
the target URL address, the editing distance between each set and any one other URL address of which does not meet the preset condition, is used as the URL address to be deleted; wherein the other URL addresses in a set are URL addresses in a set other than the target URL address.
8. An interface asset classification device, comprising:
the acquisition module is used for acquiring a plurality of Uniform Resource Locator (URL) addresses from network traffic at the interface;
a first determining module, configured to determine, for each combination, a similarity between URL addresses in the combination according to each piece of information in URL addresses in the combination; each combination consists of any two URL addresses in a plurality of URL addresses; each piece of information in the URL address is obtained by splitting a field in the URL address;
and the second determining module is used for determining that the interfaces corresponding to the URL addresses in the combination are of the same type if the similarity between the URL addresses in the combination exceeds a preset value.
9. An electronic device, comprising:
a memory for storing a computer program or instructions;
a processor for executing a computer program or instructions in the memory, such that the method of any of claims 1-7 is performed.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program requests, which when executed by a computer, cause the computer to perform the method of any of claims 1-7.
CN202311685294.1A 2023-12-08 2023-12-08 Interface asset classification method, device, electronic equipment and medium Pending CN117792700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311685294.1A CN117792700A (en) 2023-12-08 2023-12-08 Interface asset classification method, device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311685294.1A CN117792700A (en) 2023-12-08 2023-12-08 Interface asset classification method, device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN117792700A true CN117792700A (en) 2024-03-29

Family

ID=90380712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311685294.1A Pending CN117792700A (en) 2023-12-08 2023-12-08 Interface asset classification method, device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN117792700A (en)

Similar Documents

Publication Publication Date Title
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN109726391B (en) Method, device and terminal for emotion classification of text
WO2019085332A1 (en) Financial data analysis method, application server, and computer readable storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN105653548A (en) Method and system for identifying page type of electronic document
CN116662555B (en) Request text processing method and device, electronic equipment and storage medium
CN111222051B (en) Training method and device for trend prediction model
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN110852893A (en) Risk identification method, system, equipment and storage medium based on mass data
CN117792700A (en) Interface asset classification method, device, electronic equipment and medium
CN115278757A (en) Method and device for detecting abnormal data and electronic equipment
CN113656354A (en) Log classification method, system, computer device and readable storage medium
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN115309891A (en) Text classification method, device and equipment and computer storage medium
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program
CN112417886A (en) Intention entity information extraction method and device, computer equipment and storage medium
CN112528646A (en) Word vector generation method, terminal device and computer-readable storage medium
CN110610213A (en) Mail classification method, device, equipment and computer readable storage medium
CN114139541B (en) Named entity identification method, device, equipment and medium
CN116775889B (en) Threat information automatic extraction method, system, equipment and storage medium based on natural language processing
CN117058432B (en) Image duplicate checking method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination