CN112733898A

CN112733898A - Data identification method and device based on characteristic weight, electronic equipment and medium

Info

Publication number: CN112733898A
Application number: CN202011614147.1A
Authority: CN
Inventors: 金锐文; 赵俊; 单夏烨; 任新新
Original assignee: Guangtong Tianxia Network Technology Co ltd
Current assignee: Guangtong Tianxia Network Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-30

Abstract

The invention discloses a data identification method, a data identification device, electronic equipment and a data identification medium based on characteristic weight, relates to the technical field of data processing, and is used for solving the problem that in the related technology, part of IP addresses in a total data set have a plurality of types of labels and are not screened, so that the output reliability is low. Wherein, the method comprises the following steps: acquiring a total data set, recording the IP addresses with more than two types of labels in the total data set as second IP addresses, and recording the types of labels of the second IP addresses as second type labels; respectively inquiring data sources corresponding to the second type tags, acquiring a sub-feature weight table of the data sources, and recording the feature with the maximum weight as an undetermined feature based on the sub-feature weight table; and acquiring a total characteristic weight table of the total data set, recording the undetermined characteristic with the maximum weight as a selected characteristic based on the total characteristic weight table, and taking a second type label corresponding to the selected characteristic as an output label of the second IP address.

Description

Data identification method and device based on characteristic weight, electronic equipment and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for data identification based on feature weight, an electronic device, and a medium.

Background

The current informatization development and evolution has greatly changed the social life of human beings, but along with the rapid development of informatization, the network security situation is also more severe. Although the network attack approach gradually evolves towards simplification and synthesis, the network attack type gradually evolves towards diversification and complication.

In the related art, during the data analysis, a plurality of data sources are merged to obtain a total data set, and in the total data set, the output reliability of the total data set is low due to the fact that a part of IP addresses have a plurality of types of tags and are not screened.

At present, no effective solution is provided for the problem of low output reliability of the related art due to the fact that part of IP addresses in a total data set have multiple types of tags and are not screened.

Disclosure of Invention

In order to overcome the disadvantages of the related art, an object of the present invention is to provide a data discrimination method, apparatus, electronic device and medium based on feature weights, which improve output reliability of a total data set.

One of the purposes of the invention is realized by adopting the following technical scheme:

a method of feature weight based data authentication, the method comprising:

acquiring a total data set, marking the IP addresses with more than two types of labels in the total data set as second IP addresses, and marking the types of labels of the second IP addresses as second type labels;

respectively querying data sources corresponding to the second type tags, acquiring a partial feature weight table of the data sources, and recording the feature with the maximum weight as an undetermined feature based on the partial feature weight table;

and acquiring a total characteristic weight table of the total data set, recording the undetermined characteristic with the maximum weight as a selected characteristic based on the total characteristic weight table, and taking a second type label corresponding to the selected characteristic as an output label of the second IP address.

In some embodiments, the method further comprises, for a sub-feature weight table for any data source:

constructing a decision tree of the data source;

calculating the weight of each feature based on a decision tree of the data source and a feature weight calculation formula, and generating the sub-feature weight table according to the corresponding relation between the feature and the weight;

wherein, the characteristic weight calculation formula is as follows: a. the_i＝D×GINI(D)-c_i×GINI(c_i) Said A is_iIs the weight of the ith feature, D is the number of sample data in the parent set, GINI (D) is the kini coefficient of the parent set, c_iThe number of sample data in the subset corresponding to the ith feature, GINI (c)_i) The kini coefficient of the corresponding subset of the ith feature.

In some of these embodiments, for a partial feature weight table for any data source, the features are arranged by weight from large to small.

In some embodiments, for the partial feature weight table of any data source, the decision tree of the data source is constructed by using a CART algorithm.

In some of these embodiments, for a table of total feature weights for the total data set, the method further comprises:

constructing a decision tree for the total data set;

calculating the weight of each feature based on the decision tree of the total data set and the feature weight calculation formula, and generating the total feature weight table according to the corresponding relationship between the features and the weights.

In some of these embodiments, the method further comprises:

acquiring more than two data sources, and combining the data sources to obtain the total data set;

and calculating an optimal feature group of the total data set by adopting a k-means clustering algorithm, wherein the features of the optimal feature group are used as the features in the total feature weight table.

In some embodiments, before said merging said data sources into said total data set, said method further comprises, for any data source:

summarizing according to a preset format by taking the IP address as a fixed quantity to obtain sample data;

and judging whether the type label of the IP address exists in any sample data, and if not, deleting the sample data.

The second purpose of the invention is realized by adopting the following technical scheme:

an apparatus for feature weight based data authentication, the apparatus comprising:

the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a total data set, recording the IP addresses with more than two types of labels in the total data set as second IP addresses, and recording the types of labels of the second IP addresses as second type labels;

the query module is used for respectively querying the data sources corresponding to the second type tags, acquiring a sub-feature weight table of the data sources, and recording the features with the maximum weight as the features to be determined based on the sub-feature weight table;

and the processing module is used for acquiring a total characteristic weight table of the total data set, recording the undetermined characteristic with the maximum weight as a selected characteristic based on the total characteristic weight table, and taking a second type label corresponding to the selected characteristic as an output label of the second IP address.

It is a further object of the invention to provide an electronic device performing one of the objects of the invention, comprising a memory in which a computer program is stored and a processor arranged to carry out the method as described above when executing the computer program.

It is a fourth object of the present invention to provide a computer readable storage medium storing one of the objects of the invention, having stored thereon a computer program which, when executed by a processor, implements the method described above.

Compared with the related technology, the invention has the beneficial effects that: the undetermined characteristics of each second type label are determined, then the selected characteristics are determined in the undetermined characteristics according to the total characteristic weight table, and the selected characteristics are used as output labels of the corresponding second IP addresses, so that the second IP addresses and the output labels are in one-to-one correspondence, and the output reliability of the total data set is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a method for feature weight-based data authentication according to an embodiment of the present application;

FIG. 2 is a schematic block diagram illustrating a method for feature weight-based data authentication according to an embodiment of the present application;

FIG. 3 is a flowchart of the step of generating the sub-feature weight table according to the second embodiment of the present application;

fig. 4 is a block diagram illustrating a feature weight-based data authentication apparatus according to a fourth embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to a fifth embodiment of the present application.

Description of the drawings: 41. an acquisition module; 42. a query module; 43. and a processing module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It will be appreciated that such a development effort might be complex and tedious, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, and is not intended to limit the scope of this disclosure.

Example one

The embodiment provides a data identification method based on feature weight, and aims to solve the problem that in the related art, part of IP addresses in a total data set have multiple types of tags and are not screened, so that the output reliability is low.

It is worth mentioning that the steps of the method are performed on the basis of the execution device. Specifically, the execution device may be a server, a cloud server, a client, a processor, or the like, but the execution device is not limited to the above type.

Fig. 1 is a flowchart of a data authentication method based on feature weights according to an embodiment of the present disclosure, and fig. 2 is a functional block diagram of the data authentication method based on feature weights according to an embodiment of the present disclosure. Referring to fig. 1, the method includes steps S101 to S103.

Step S101, acquiring a total data set, recording the IP addresses with more than two type labels in the total data set as second IP addresses, and recording the type labels of the second IP addresses as second type labels. It will be appreciated that the total data set is a combination of more than two data sources, and in any data source, the IP address is one-to-one with the type label. Therefore, in the total data set, there is a case where the IP address and the type tag are one-to-many.

And S102, respectively inquiring data sources corresponding to the second type tags, acquiring a sub-feature weight table of the data sources, and recording the feature with the maximum weight as an undetermined feature based on the sub-feature weight table. It is understood that the data sources and the sub-feature weight tables are one-to-one, i.e. the number of the data sources and the sub-feature weights are equal, and the sub-feature weight tables include the corresponding relationship between the features and the weights. It should be noted that any second IP address can obtain more than two pending characteristics in step S102.

Step S103, a total characteristic weight table of the total data set is obtained, undetermined characteristics with the maximum weight are recorded as selected characteristics based on the total characteristic weight table, and a second type label corresponding to the selected characteristics is used as an output label of the second IP address. It is understood that the total feature weight table includes the corresponding relationship between the features and the weights, but the feature groups in the total feature weight table and the feature groups in the partial feature weight table may be the same or different, and may be specifically adjusted according to the requirement of merging. Wherein, the output tag in this step is the most trusted tag in fig. 2.

In summary, the undetermined features of each second type tag are determined, then the selected feature is determined from the undetermined features according to the total feature weight table, and the selected feature is used as the output tag of the corresponding second IP address, so that the second IP address and the output tag are in one-to-one correspondence, and the output reliability of the total data set is improved.

The steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer executable instructions, and while a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

It should be noted that the data source may be a data source obtained through a web crawler, a data source obtained through a firewall, a data source obtained through a high-security scan, a historical data source, and the like, and a specific obtaining manner is not limited herein. The feature may be a URL, a geographic location, a port, a number of attacks, a log, or the like, and is not limited herein. The type tag may be a CC attack, a virus attack, a Trojan attack, a spam, a ddos attack, or the like.

For example, a second IP address is 1.0.0.0, and referring to table 1, it is a sub-feature weight table of the data source 1, and the type tag corresponding to the second IP address in the data source 1 is CC attack, and the pending feature is a geographic location.

TABLE 1

Referring to table 2, which is a sub-feature weight table of the data source 2, if the type tag corresponding to the second IP address in the data source 2 is spam, the pending feature is a port.

Feature name	Weight of
		Port(s)	0.6
Number of attacks	0.2
		Geographic location	0.1
Log	0.05

TABLE 2

Referring to table 3, which is a total feature weight table of the total data set, in the total feature weight table, if the endpoint weight > the address location weight, the endpoint is the selected feature, and accordingly, the output tag of the second IP address is spam.

Feature name	Weight of
		Port(s)	0.6
Number of attacks	0.4
		Geographic location	0.3
Log	0.2
		Duration of attack	0.1

TABLE 3

Example two

The second embodiment provides a data identification method based on the feature weight, and the second embodiment is performed on the basis of the first embodiment.

The method may further include a sub-feature weight table generating step, fig. 3 is a flowchart of the sub-feature weight table generating step in the second embodiment of the present application, and referring to fig. 1 to 3, for the sub-feature weight of any data source, the sub-feature weight generating step may include steps S201 to S203.

And step S201, constructing a decision tree of a data source. The generation method of the decision tree is not limited herein, and algorithms such as ID3, C4.5, C5.0, CART, etc. may be used, and are not specifically limited herein.

And S202, calculating the weight of each feature based on the decision tree of the data source and a feature weight calculation formula. That is, the feature-weight correspondence relationship can be obtained by this step

And step S203, generating a feature weight table according to the corresponding relation between the features and the weights.

Wherein, the characteristic weight calculation formula is as follows: a. the_i＝D×GINI(D)-c_i×GINI(c_i)，A_iIs the weight of the ith feature, D is the number of sample data in the parent set, GINI (D) is the kini coefficient of the parent set, c_iThe number of sample data in the subset corresponding to the ith feature, GINI (c)_i) The kini coefficient of the corresponding subset of the ith feature. It can be understood that the parent set can be divided into N subsets via the decision tree, where i takes a value of 1 to N, and the calculation method of the kini coefficient may refer to the prior art, which is not described herein in detail.

According to the technical scheme, firstly, a decision tree of a data source is constructed to determine the distinguishing condition of the type label in the data source, so that the corresponding type label can be determined under the condition that any sample is added in the data source, or the corresponding type label can be updated under the condition that any sample in the data source is updated, so that subsequent data processing is facilitated, and certainly, a time interval can be set to adjust the decision tree in the data processing process. Secondly, the weight of each feature is obtained by utilizing the decision tree to determine the influence effect of the feature on the type label, correspondingly, the undetermined feature has the largest influence on the type label of the second IP address in the data source, and the selected feature has the largest influence on the type label of the second IP address in the total data set, so that the second type label corresponding to the selected feature is the credible label, and the output credibility of the total data set is improved.

In an alternative embodiment, for a partial feature weight table for any data source, the features are arranged by weight from large to small. Reference may be made specifically to table 1 and table 2 above to facilitate fast acquisition of each pending feature.

In an alternative embodiment, for the partial feature weight table of any data source, the decision tree of the data source is constructed by using a CART algorithm. Specifically, the method comprises the following steps: firstly, calculating impurity degree indexes of all features, then selecting the features with optimal impurity degree indexes to branch, and calculating impurity degree indexes of sub-nodes after branching in sequence until the decision tree stops growing, so that no features are available, and obtaining the optimal decision tree through multiple iterations.

Because the calculation of the characteristic weight is also based on the CART algorithm, the CART algorithm is preferably adopted for the construction of the decision tree, so that the credibility of the characteristic weight is improved, and related parameters in a characteristic weight calculation formula can be obtained in the construction process of the decision tree.

In an optional embodiment, the method may further include an overall feature weight table generating step, and for the sub-feature weights of any data source, the overall feature weight generating step may include: constructing a decision tree of the total data set; and calculating the weight of each feature based on a decision tree of the total data set and a feature weight calculation formula, and generating a total feature weight table according to the corresponding relation between the features and the weights. The feature weight calculation formula can refer to the related description of the second embodiment above, so as to improve the output reliability of the total data set. Further, the construction of the decision tree of the total data set also preferably adopts a CART algorithm, which is not described herein in detail.

EXAMPLE III

The third embodiment provides a data authentication method based on feature weights, and the third embodiment is performed on the basis of the first embodiment and/or the second embodiment.

The method may further comprise the steps of: acquiring more than two data sources, and combining the data sources to obtain a total data set; and calculating an optimal characteristic group of the total data set by adopting a k-means clustering algorithm, wherein the characteristics of the optimal characteristic group are used as the characteristics in the total characteristic weight table.

It can be understood that after the data of different data sources are combined to obtain a total data set, because the data are grouped, and different data sources may have different dependent feature columns, when similar IP data (a certain IP data originally in the data source 2 is included in an IP network segment in the data source 1, or the IP data in the data source 2 and a certain IP data in the data source 1 are completely matched, it may cause the characteristic column of the large database after being grouped to increase and automatically fill NaN filling for a certain IP originally without the characteristic column, as shown in table 4) and the dependent feature columns are not too different, it may obtain an optimal feature combination through different feature combinations and iterative computation, thereby effectively avoiding using some features that affect the determination result of the type tag of the IP slightly.

IP address

Port(s)

Number of attacks

Geographic location

Log

Attack update time

Label (R)

1.0.0.0

RJ11

2

NaN

XXX

480

Junk mail

1.0.0.0

RJ45

50

CN/ZJ

NaN

10

CC attack

TABLE 4

Specifically, in order to obtain the most ideal K value and thus the most accurate weight, an ordered multiple runs of K-means algorithms (OMRk) is used herein. The main purpose of this is to obtain the optimal k value, and the principle can be explained by reference to the following:

input: training data set, number of test executions, maximum given k value (k _ max)

Output: ideal k x value and partitioned result

V is the profile coefficient, K is the optimized value of K, and P is the optimized partition result.

In an optional embodiment, before merging the data sources to obtain the total data set, for any data source, the method may further include: summarizing according to a preset format by taking the IP address as a ration to obtain sample data; and judging whether the type label of the IP address exists in any sample data, and if not, deleting the sample data. The preset format can be referred to table 4.

It should be noted that, in any data source, one sample data corresponds to one IP address, and accordingly, the characteristics of port, attack, and the like are used as summary variables, and the type tag is used as the result of the sample data. In the total data set, the sample data of the same IP address is not summarized so as to avoid influencing the accuracy of the total characteristic weight table

Example four

The fourth embodiment provides a data identification device based on feature weights, which is the virtual device structure of the foregoing embodiments. Fig. 4 is a block diagram illustrating a structure of a data authentication apparatus based on feature weights according to a fourth embodiment of the present application, and referring to fig. 4, the apparatus may include: an acquisition module 41, a query module 42, and a processing module 43.

The obtaining module 41 is configured to obtain a total data set, record, as a second IP address, the IP addresses having more than two types of tags in the total data set, and record all the types of tags of the second IP address as second type tags.

The query module 42 is configured to query the data sources corresponding to the second type tags, acquire a sub-feature weight table of the data sources, and mark a feature with the largest weight as an undetermined feature based on the sub-feature weight table;

and the processing module 43 is configured to obtain a total feature weight table of the total data set, record the undetermined feature with the largest weight as the selected feature based on the total feature weight table, and use a second type tag corresponding to the selected feature as an output tag of the second IP address.

The modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

EXAMPLE five

In a fifth embodiment, an electronic device is provided, fig. 5 is a block diagram of a structure of the electronic device shown in the fifth embodiment of the present application, and as shown in fig. 5, the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to execute a data identification method based on a feature weight, where a specific example may refer to examples described in the foregoing embodiments and optional embodiments, and this embodiment is not described herein again.

Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

In addition, in combination with the data identification method based on the feature weight in the foregoing embodiments, the fifth embodiment of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program when executed by a processor implements a method of feature weight based data authentication in any of the above embodiments, the method comprising:

acquiring a total data set, recording the IP addresses with more than two types of labels in the total data set as second IP addresses, and recording the types of labels of the second IP addresses as second type labels;

respectively inquiring data sources corresponding to the second type tags, acquiring a sub-feature weight table of the data sources, and recording the feature with the maximum weight as an undetermined feature based on the sub-feature weight table;

As shown in fig. 5, taking a processor as an example, the processor, the memory, the input device and the output device in the electronic device may be connected by a bus or other means, and fig. 5 takes the connection by a bus as an example.

The memory, which is a computer-readable storage medium, may include a high-speed random access memory, a non-volatile memory, and the like, and may be used to store an operating system, a software program, a computer-executable program, and a database, such as program instructions/modules corresponding to the feature weight-based data authentication method according to the embodiment of the present invention, and may further include a memory, which may be used to provide a running environment for the operating system and the computer program. In some examples, the memory may further include memory located remotely from the processor, which may be connected to the electronic device over a network.

The processor, which is used to provide computing and control capabilities, may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured as one or more Integrated circuits to implement embodiments of the present Application. The processor executes various functional applications and data processing of the electronic device by executing the computer-executable programs, software programs, instructions and modules stored in the memory, that is, the method for data authentication based on feature weights in the first embodiment is implemented.

The output device of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

The electronic device may further include a network interface/communication interface, the network interface of the electronic device being for communicating with an external terminal through a network connection. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Those skilled in the art will appreciate that the structure shown in fig. 5 is a block diagram of only a portion of the structure relevant to the present disclosure, and does not constitute a limitation on the electronic device to which the present disclosure applies, and that a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink), DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in the embodiment of the data identification method based on the feature weight, each included unit and each included module are only divided according to the functional logic, but are not limited to the above division as long as the corresponding function can be implemented; in addition, the specific names of the functional units are only for convenience of distinguishing from each other and are not used for limiting the protection scope of the present invention.

Unless otherwise defined, technical or scientific terms referred to herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The terms "comprises," "comprising," "including," "has," "having," and any variations thereof, as referred to herein, are intended to cover a non-exclusive inclusion. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of the associated objects, indicating that three relationships may exist. The character "/" generally indicates that the contextual objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for data authentication based on feature weights, the method comprising:

acquiring a total data set, recording IP addresses with more than two types of labels in the total data set as second IP addresses, and recording the types of labels of the second IP addresses as second type labels;

2. The method of claim 1, wherein for a sub-feature weight table for any data source, the method further comprises:

constructing a decision tree of the data source;

3. The method of claim 2, wherein the features are arranged by weight from large to small for a sub-feature weight table of any data source.

4. The method of claim 2, wherein for the sub-feature weight table of any data source, the decision tree of the data source is constructed by using a CART algorithm.

5. The method of claim 2, wherein for a total feature weight table for the total data set, the method further comprises:

constructing a decision tree for the total data set;

6. The method according to any one of claims 1 to 5, further comprising:

7. The method of claim 6, wherein prior to said merging of said data sources into said total data set, for any data source, said method further comprises:

8. An apparatus for feature weight based data authentication, the apparatus comprising:

the query module is used for respectively querying the data sources corresponding to the second type tags, acquiring a sub-feature weight table of the data sources, and recording the feature with the maximum weight as an undetermined feature based on the sub-feature weight table;

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to carry out the method of any one of claims 1 to 7 when the computer program is executed.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.