CN111027048B

CN111027048B - Operating system identification method and device, electronic equipment and storage medium

Info

Publication number: CN111027048B
Application number: CN201911271563.3A
Authority: CN
Inventors: 谢鹏程
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2022-09-16
Anticipated expiration: 2039-12-11
Also published as: CN111027048A

Abstract

The application provides an operating system identification method, an operating system identification device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a plurality of communication messages generated by communication between devices in a first preset time period; extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters; analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types. According to the embodiment of the application, the operation type of the equipment is identified through the identification model obtained through clustering training, and the identification accuracy is improved.

Description

Operating system identification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an operating system identification method and apparatus, an electronic device, and a storage medium.

Background

Identifying the operating system of the remote host is of great significance in the field of network security. Information collection of the remote host is required from the viewpoint of network security attack and protection, and identification of the operating system of the host is a crucial step in information collection.

The existing operating system identification technologies are mainly divided into two types, one is an active operating system identification technology, and the other is a passive operating system identification technology. The data sources of the two technologies are acquired network data packets, then partial information in the data packets is extracted to generate operating system fingerprints or characteristics, and then the operating system fingerprints or characteristics are matched with a fingerprint database or the type and version of the operating system are judged by adopting a machine learning classification algorithm.

The existing passive operating system identification technologies include a passive operating system identification technology based on fingerprint library matching, and a passive operating system identification technology based on various machine learning classification algorithms, such as an operating system identification technology based on a decision tree or SVM algorithm and a RIPPER algorithm. For the fingerprint database matching identification method, if the device to be identified is not matched with the fingerprint information from the fingerprint database, the judgment cannot be made, that is, the operating system type of the device to be identified cannot be given. Passive operating system identification techniques based on machine learning classification algorithms can lead to poor classification results if there are not enough sets of labeled data.

Disclosure of Invention

An embodiment of the present application provides an operating system identification method, an operating system identification device, an electronic device, and a storage medium, so as to solve the problem in the prior art that identification of an operating system type is inaccurate.

In a first aspect, an embodiment of the present application provides an operating system identification method, including: acquiring a plurality of communication messages generated by communication between devices in a first preset time period; extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters; analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

According to the embodiment of the application, the operation type of the equipment is identified through the identification model obtained through clustering training, and the identification accuracy is improved.

Further, the extracting feature parameters corresponding to the target device from the plurality of communication messages and obtaining feature vectors according to the feature parameters includes: extracting a first preset field from a communication message of which the source IP address is the IP address corresponding to the target equipment, and extracting a second preset field from a communication message of which the target IP address is the IP address corresponding to the target equipment; and obtaining the feature vector according to the first preset field and the second preset field.

According to the method and the device, the characteristic parameters of the target device are extracted from different communication messages respectively, representative characteristic parameters can be obtained, and therefore the type of the operating system can be identified accurately.

Further, the first preset field comprises a time to live TTL field, a flag bit DF field for fragmentation or not, and a SACK Permitted field in a selected confirmation option, and the second preset field comprises a window size WS field; the obtaining the feature vector according to the first preset field and the second preset field includes: and constructing the feature vector according to the TTL field, the DF field, the SACK Permitted field and the WS field.

According to the method and the device, the characteristic vector is constructed through the fields, the characteristics of the operating system corresponding to the target device can be better reflected, and therefore the type of the operating system of the target device can be accurately identified.

Further, the constructing the feature vector according to the TTL field, the DF field, the SACK Permitted field, and the WS field includes: calculating the ratios of values of the TTL fields to (0, 64), (64, 128) and (128, 255) respectively to obtain a first TTL ratio, a second TTL ratio and a third TTL ratio; calculating ratios of the minimum value, the maximum value, the lower quartile, the median, the upper quartile and the average value in the values of the WS fields to the theoretical maximum value of the WS fields respectively to obtain a first WS ratio, a second WS ratio, a third WS ratio, a fourth WS ratio, a fifth WS ratio and a sixth WS ratio; calculating the ratio of the value 1 in the DF field to the total DF field number to obtain the DF ratio; calculating the ratio of the value of 4 in the SACK Permitted field to the total number of the SACK Permitted fields to obtain the SACK Permitted ratio; and constructing the feature vector according to the first TTL ratio, the second TTL ratio, the third TTL ratio, the first WS ratio, the second WS ratio, the third WS ratio, the fourth WS ratio, the fifth WS ratio, the sixth WS ratio, the DF ratio and the SACK Permitted ratio.

Further, the recognition model comprises a plurality of operating system types and a clustering center corresponding to each operating system type; analyzing the feature vector by using the recognition model to obtain the operating system type corresponding to the target device, including: and calculating the distance from the feature vector to the clustering center corresponding to each operating system type, and determining the operating system type corresponding to the clustering center closest to the feature vector as the operating system type corresponding to the target device.

According to the method and the device, the operating system type is determined by calculating the distance from the feature vector to each cluster center, and the situation that the corresponding operating system type cannot be found is avoided.

Further, before acquiring the data packet generated by the inter-device communication within the first preset time period, the method further includes: acquiring a plurality of historical communication messages generated by communication among a plurality of devices within a second preset time period; extracting training characteristic parameters corresponding to each device from the multiple historical communication messages according to the IP address corresponding to each device, and obtaining training characteristic vectors according to the training characteristic parameters; performing clustering training by using the training feature vectors corresponding to the devices to obtain the recognition model; the identification model comprises a plurality of operating system types and a cluster center corresponding to each operating system type.

Further, performing clustering training by using the training feature vectors corresponding to the devices to obtain a recognition model, including: determining K clustering centers, wherein K is a positive integer; based on the training feature vectors and the K clustering centers, performing center clustering in at least one iteration cycle to obtain the final K clustering centers and training feature vectors corresponding to each clustering center; obtaining the recognition model according to the final K clustering centers and the training feature vector corresponding to each clustering center; wherein an iteration cycle comprises: calculating the distance from each training feature vector to K clustering centers respectively, and distributing each training feature vector to the clustering center closest to the training feature vector to obtain K clusters, wherein each cluster comprises at least one training feature vector; and calculating a corresponding new cluster center according to the training feature vector corresponding to the cluster.

Further, the obtaining the recognition model according to the final K clustering centers and the training feature vector corresponding to each clustering center includes: counting the number of training characteristic vectors corresponding to each operating system type belonging to the same clustering center, and determining the operating system type with the largest number of characteristic vectors as the operating system type corresponding to the clustering center; and obtaining the identification model according to the final K clustering centers and the operating system type corresponding to each clustering center.

Compared with supervised training in the prior art, the recognition model obtained through clustering training does not need to carry out a large amount of statistics on the labels in the data set in advance.

In a second aspect, an embodiment of the present application provides an operating system identification apparatus, including: the data acquisition module is used for acquiring a plurality of communication messages generated by communication among the devices within a first preset time period; the characteristic extraction module is used for extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages and obtaining characteristic vectors according to the characteristic parameters; the identification module is used for analyzing the characteristic vector by utilizing an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor being capable of performing the method of the first aspect when invoked by the program instructions.

In a fourth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, including:

the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method of the first aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic diagram of a training process of a recognition model according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a clustering method provided in the embodiment of the present application;

fig. 3 is a schematic flowchart of an operating system identification method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Prior to the present application, the identification of the operating system type of a device (which may also be referred to as a host) has primarily involved the following methods:

the method of the passive operating system identification technology based on fingerprint library matching comprises the following steps: extracting fields related to the operating system in the data packet, establishing an operating system fingerprint of the current host, matching the operating system fingerprint with the fingerprint library, and giving the operating system of the current host when the operating system fingerprint is completely matched with the fingerprint in the fingerprint library. The fingerprint database is required to have complete fingerprint information, and if the fingerprint cannot be matched, judgment cannot be made, namely, the operating system of the host cannot be provided.

The method of the passive operating system identification technology based on the machine learning classification algorithm comprises the following steps: the data sets in each fingerprint repository are collected, and the network data sets are collected to build their own data sets. Then extracting fields related to the operating system in the data set to establish a characteristic matrix, and then learning and training. Finally, the operating system fingerprint (or characteristic) of the host unknown to the operating system is input into the trained classification model, and the type of the operating system is finally judged. This approach is a very laborious and time-consuming task to build sufficient sets of known tags, which if not sufficient sets of tagged data would lead to poor classification results.

In order to solve the foregoing technical problem, an embodiment of the present application provides an operating system identification method, which identifies a feature vector associated with an operating system of a device through an identification model obtained through clustering training, so as to obtain an operating system type of the device. Before describing the recognition method, the present application first describes a training process of a recognition model used in the recognition method.

Fig. 1 is a schematic view of a training process of a recognition model provided in an embodiment of the present application, and as shown in fig. 1, the method includes:

step 101: acquiring a plurality of historical communication messages generated by communication among a plurality of devices within a second preset time period;

step 102: extracting training characteristic parameters corresponding to each device from the multiple historical communication messages according to the IP address corresponding to each device, and obtaining training characteristic vectors according to the training characteristic parameters;

step 103: performing clustering training by using training feature vectors corresponding to the devices to obtain the recognition model; the identification model comprises a plurality of operating system types and a cluster center corresponding to each operating system type.

The following describes steps 101 to 103 in detail.

Since data transmission is performed between the devices in the network in real time and tens of thousands of communication messages can be generated in one second, or even more, in step 101, training feature vectors for training are determined from the communication messages. The historical communication messages may be TCP/IP protocol messages.

The applicant captured 15 minutes of data on one device, and this captured packet includes 2000000 communication packets, and these 2000000 communication packets include data transmitted and received between 3000 devices. Therefore, the second preset time period is preset manually, and may be 5 seconds, 1 minute, 10 minutes, and the like, the second preset time period is too long or too short, and if the second preset time period is too short, more data corresponding to the device cannot be obtained, so that the types of the operating system of the device are too few; if the setting is too long, too many communication messages are generated, and at the moment, the communication messages need to be deleted, so that the efficiency of training the recognition model is reduced.

Because the devices send communication messages to each other, the historical communication messages can be selected from the switch for acquisition.

In step 102, training feature parameters corresponding to each device are extracted from the multiple historical communication messages according to the IP address corresponding to each device, and a training feature vector is obtained according to the training feature parameters.

Each device has a unique IP address, and each historical communication message has a source IP address and a destination IP address, so that whether the historical communication message is related to the corresponding device, sent from the device or received can be judged through the IP addresses. In addition, one historical communication message contains a plurality of fields, some fields can represent the operating system type of the equipment corresponding to the source IP address, some fields can represent the operating system type of the equipment corresponding to the target IP address, and the fields are called characteristic parameters.

For example: and for each data packet, reading a source IP address and a destination IP address, and simultaneously extracting a time to live TTL field and a flag bit DF field whether to fragment of the IP header, and a window size WS field and a SACK Permitted field in the TCP header. Where the TTL field, DF field, and SACK Permitted field belong to the attributes of the source IP address, and the WS field is the attributes of the target IP address. For the extraction of the characteristic parameters of one device, a historical communication message as a source IP address and a historical communication message as a destination IP address are extracted. Only such a history communication packet has the above four attributes at the same time. When the feature vector corresponding to each IP address is constructed from the historical communication message, the TTL field, the DF field and the SACK Permitted field in the selection confirmation option are filled into the corresponding position of the feature vector of the source IP address, and the WS field is filled into the corresponding position of the feature vector of the destination IP address. Therefore, the characteristic parameters in the data packet can be correctly allocated to the equipment corresponding to the source IP address or the equipment corresponding to the destination IP address.

When the characteristic matrix is established, according to the value characteristics of the TTL field and the WS field, the TTL field is represented by a three-dimensional vector by using a statistical method, and the WS field is represented by a six-dimensional vector, namely the original attribute values are replaced by the statistical characteristics of the original attribute values of the two fields. The default initial TTL value of the operating system typically takes one of these three (64,128, 255). While actual statistical data show that: the TTL value corresponding to each IP address is not unique but varies within a certain range. For each IP address, the value of the WS field is represented by a six-dimensional vector according to the value characteristics of the WS field in the historical communication message, and each dimension represents the ratio of the minimum value, the maximum value, the mean value, the median, the upper quartile and the lower quartile of the value of the WS to the theoretical maximum value respectively, wherein the theoretical maximum value can be 65535.

According to the above description, the process of establishing the feature vector can be summarized as follows: the values corresponding to the above four attributes may be represented by a 1 × 11-dimensional vector for each IP address. A first dimension element of the vector represents the proportion of the number of the communication messages with 1 in the DF field value corresponding to a certain IP address to the total communication messages of the IP address; the second, third, and fourth dimension elements of the vector respectively represent the proportion of TTL field values (0, 64), (64, 128), (128, 255) corresponding to a certain IP address, the fifth dimension element of the vector represents the proportion of the number of communication packets with 4 in the SACK allowed value corresponding to a certain IP address to the total number of communication packets of the IP address, and the sixth, seventh, eighth, ninth, tenth, and eleventh dimension elements of the vector respectively represent the ratios of the minimum value, the maximum value, the lower quartile, the middle value, the upper quartile, and the average value in the WS values corresponding to a certain IP address to the WS theoretical maximum value 65535.

In step 103, performing clustering training by using training feature vectors corresponding to the devices to obtain the recognition model; the identification model comprises a plurality of operating system types and a cluster center corresponding to each operating system type.

After the training feature vectors corresponding to the respective devices are obtained, clustering training may be performed by using the training feature vectors, and fig. 2 is a schematic diagram of a further clustering process provided in the embodiment of the present application, as shown in fig. 2, including:

step 201: determining K clustering centers, wherein K is a positive integer;

step 202: based on the training feature vectors and the K clustering centers, performing center clustering in at least one iteration cycle to obtain the final K clustering centers and the training feature vectors corresponding to each clustering center;

step 203: and obtaining the recognition model according to the final K clustering centers and the training feature vector corresponding to each clustering center.

For step 201, K cluster centers are determined, K being a positive integer. The number of operating system types of devices in the historical communication message may be counted in advance, for example: all devices have 10 operating system types, and K may take the value of 10. If the number of operating system types cannot be counted in advance, it can be determined from a priori knowledge, for example: if there are 15 os types on the market, K may take the value of 15.

For step 201, based on the training feature vectors and the K clustering centers, through center clustering in at least one iteration cycle, the final K clustering centers and the training feature vector corresponding to each clustering center are obtained.

The K training feature vectors may be randomly selected from the plurality of training feature vectors as a clustering center. The training feature vectors with different types of K operating systems can also be selected as clustering centers, so that the operating types of the equipment need to be known on the premise, and the method has the advantage of accelerating the convergence degree in clustering.

The center cluster for one iteration cycle is: after K clustering centers are obtained, for a training feature vector, calculating distances from the feature vector to the K clustering centers respectively, where the distance is used to represent a similarity degree between the training feature vector and a feature vector corresponding to the K clustering centers, and may be an euclidean distance, a manhattan distance, or another parameter capable of representing the similarity degree between the training feature vector and the feature vector, and this is not specifically limited in this embodiment of the present application. The training feature vector is then assigned to the nearest cluster center. Each training feature vector performs the above operations to classify the training feature vectors, each cluster center may include at least one training feature vector, and the training feature vectors corresponding to each cluster center form a cluster (also referred to as a cluster). Then, a new cluster center is calculated for each cluster, wherein the new cluster center is calculated by averaging all training feature vectors in the cluster.

The termination condition of the iterative training may be that the iteration number is preset, the iteration is stopped when the iteration number is reached, or the iteration is stopped until the training feature vector in each cluster does not change any more, or the cluster center of each cluster does not change any more.

In step 203, the recognition model is obtained according to the final K clustering centers and the training feature vector corresponding to each clustering center.

After the iteration is stopped, the final K clustering centers and the training feature vector corresponding to each cluster can be obtained. Each cluster comprises training feature vectors corresponding to a plurality of devices, when an operating system type corresponding to a certain cluster is determined, the number of the devices corresponding to each operating system type in the cluster can be counted, and then the operating system type with the largest number is used as the operating system type of the cluster. In addition, when the number of devices included in a certain cluster is large, the devices in the cluster do not need to be totally counted, a sampling method can be adopted, and the sampling ratio is determined according to the situation. And taking the operating system with the largest number of devices extracted from the cluster as the operating system type of the cluster. For example: for a certain cluster, a total of 10 devices belong to the cluster, and among the 10 devices, the operating system types of 8 devices are class a, the operating system type of 1 device is class B, the operating system type of 1 device is class C, and then the operating system type of the cluster is class a.

After the operating system type corresponding to each cluster is obtained, the recognition model can be obtained according to the cluster center of each cluster and the operating system type of the cluster.

According to the method and the device, clustering training is carried out through the historical communication messages, the recognition model is obtained, equipment does not need to be marked in advance, very rich labels are obtained, and the recognition model can accurately recognize the type of the operating system of the target equipment according to the characteristic parameters of the target equipment.

Fig. 3 is a schematic flow chart of an operating system identification method provided in an embodiment of the present application, and as shown in fig. 3, the method is applied to an identification device, where the identification device may be an intelligent electronic device such as a desktop computer, a notebook computer, a tablet computer, a smart phone, and an intelligent wearable device. The method comprises the following steps:

step 301: acquiring a plurality of communication messages generated by communication between devices in a first preset time period;

step 302: extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters;

step 303: analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

Details regarding steps 301-303 are described below.

In step 301, a plurality of communication packets generated by inter-device communication within a first preset time period are obtained.

The first preset time period may be manually set in advance, specifically may be 5 seconds, 1 minute, 10 minutes, and the like, and is not set too long or too short, and if the first preset time period is set too short, more communication messages corresponding to the target device cannot be obtained, so that the feature parameters cannot be extracted in the subsequent steps; if the setting is too long, too many communication messages are generated, and at the moment, if all the communication messages are subjected to subsequent calculation, the calculation difficulty is increased; otherwise, a part of communication messages needs to be deleted, operation steps are added, and the identification efficiency is influenced. The plurality of communication messages may include only messages sent by the target device to other devices, and messages received by the target device from other devices, and may also include communication messages between other devices.

For step 302, extracting feature parameters corresponding to the target device from the plurality of communication messages, and obtaining feature vectors according to the feature parameters.

When different operating system manufacturers implement the TCP/IP protocol, there are differences in default setting values of some parameters, and these differences are called operating system fingerprints, and the operating system fingerprints can be used to distinguish different operating systems. Furthermore, each device has a unique IP address, and each communication message includes a source IP address and a destination IP address, wherein the source IP address is used for indicating from which device the communication message is sent, and the destination IP address is used for indicating to which device the communication message is sent. And each communication message comprises the characteristic parameter corresponding to the source IP address and the characteristic parameter corresponding to the destination IP address. Therefore, a communication packet whose source IP address or destination IP address is the IP address of the target device can be selected from the plurality of communication packets, and then the characteristic parameters of the target device can be extracted from the communication packets. The characteristic parameter refers to field information capable of representing the type of the operating system in the communication message. According to the acquired feature parameters, a feature vector can be constructed.

For step 303, analyzing the feature vector by using an identification model to obtain an operating system type corresponding to the target device; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

And inputting the characteristic vector into the recognition model, and analyzing the characteristic vector by the recognition model so as to obtain the operating system type corresponding to the target equipment. It can be understood that the recognition model can be obtained by performing clustering training in the above embodiments, which is not described herein again.

According to the method and the device, the characteristic parameters of the target device are extracted from the communication message, the corresponding characteristic vectors are constructed, and then the characteristic vectors are analyzed by using the recognition model obtained through clustering training, so that the type of the operating system of the target device is obtained. Compared with a fingerprint library matching method, the situation that the type of the operating system cannot be determined due to the fact that the objects cannot be matched in the fingerprint library is avoided. Compared with a model obtained by supervised learning, a large amount of label statistics is avoided. Therefore, the method and the device can improve the accuracy and the efficiency of identification.

On the basis of the above embodiment, the extracting the feature parameters corresponding to the target device from the plurality of communication messages, and obtaining the feature vector according to the feature parameters includes:

extracting a first preset field from a communication message of which the source IP address is the IP address corresponding to the target equipment, and extracting a second preset field from a communication message of which the target IP address is the IP address corresponding to the target equipment;

and obtaining the feature vector according to the first preset field and the second preset field.

In a specific implementation process, one communication message at least includes the following fields: TTL field, DF field, WS field, and SACK Permitted field. And, the TTL field, DF field, and SACK Permitted field belong to the attributes of the source IP address, and the WS field is the attribute of the destination IP address. Therefore, in order to extract the complete characteristic parameters of the target device, a first preset field may be extracted from the communication packet whose source IP address is the IP address of the target device, where the first preset field includes at least a TTL field, a DF field, and a SACK Permitted field, and then a second preset field may be extracted from the communication packet whose destination IP address is the IP address of the target device, where the second preset field includes at least a WS field. And then constructing a feature vector of the target device according to the TTL field, the DF field, the SACK Permitted field and the WS field.

The default initial TTL value of the operating system typically takes one of three (64,128, 255). While actual statistical data show that: the TTL value corresponding to the source IP address in each communication message is not unique, but varies within a certain range. According to the value-taking characteristics of WS fields in a plurality of communication messages, the value-taking of the WS field is expressed by a six-dimensional vector, each dimension respectively represents the ratio of the minimum value, the maximum value, the mean value, the median, the upper quartile and the lower quartile of the value-taking of the WS to the theoretical maximum value, wherein the theoretical maximum value is 65535.

According to the above description, the process of establishing the feature vector can be summarized as follows: the values corresponding to the above four attributes are represented by a 1 × 11-dimensional vector. A first dimension element of the vector represents the proportion of the number of the communication messages with 1 in the DF value corresponding to the IP address of the target device in the total communication messages of the IP address; the second, third, and fourth dimensional elements of the vector respectively represent the proportion of TTL values corresponding to the IP address of the target device at (0, 64), (64, 128), (128, 255), the fifth dimensional element of the vector represents the proportion of the number of communication packets with 4 in the SACK allowed values corresponding to the IP address of the target device in the total number of communication packets of the IP address, and the sixth, seventh, eighth, ninth, tenth, and eleventh dimensional elements of the vector respectively represent the ratios of the minimum value, the maximum value, the lower quartile, the median, the upper quartile, and the average value in the WS values corresponding to the IP address of the target device to the WS theoretical maximum value 65535.

According to the method and the device, the characteristic parameters corresponding to the equipment are extracted from the historical communication messages, and because different operating system types are different in set values of the characteristic parameters, clustering training is carried out through the characteristic parameters, and the accuracy of operating system type identification can be improved through the obtained identification model.

On the basis of the above embodiment, the recognition model includes a plurality of operating system types and a clustering center corresponding to each operating system type; analyzing the feature vector by using the recognition model to obtain the operating system type corresponding to the target device, including:

and calculating the distance from the feature vector to the clustering center corresponding to each operating system type, and determining the operating system type corresponding to the clustering center closest to the feature vector as the operating system type corresponding to the target device.

In a specific implementation process, a cluster center is a feature vector, and when an operating system type corresponding to a target device is determined, a distance between the feature vector corresponding to the target device and each cluster center is calculated. And taking the operating system type corresponding to the cluster center closest to the target equipment as the operating system type of the target equipment.

According to the method and the device, the distance between the characteristic vector of the target device and the clustering center is calculated, so that the type of the operating system of the target device is determined, the situation that the type of the operating system of the target device cannot be determined due to the fact that fingerprint libraries are not matched is avoided, a large number of labels do not need to be collected in advance, and the accuracy of identifying the type of the operating system can be improved.

In addition, two data packets are respectively captured, and each captured data packet comprises a plurality of communication messages. And performing clustering training by using the first data packet to obtain a recognition model, acquiring the label of each clustering center through label acquisition, and assigning the label of each clustering center to a corresponding class. The accuracy of the identified model was found to be 95% by comparison with the true label of each data point (device).

And utilizing the secondary data packet, extracting the features to obtain the feature vectors of the equipment, substituting the feature vectors into the identification model, and endowing the corresponding equipment with an operating system type by the identification model according to the feature vectors. By comparing with the real label, the identification accuracy of the operating system of the unknown device is 91%.

Fig. 4 is a schematic structural diagram of an apparatus provided in an embodiment of the present application, where the apparatus may be a module, a program segment, or code on an electronic device. It should be understood that the apparatus corresponds to the above-mentioned embodiment of the method of fig. 3, and can perform various steps related to the embodiment of the method of fig. 3, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes: a data acquisition module 401, a feature extraction module 402, and an identification module 403, wherein:

the data acquisition module 401 is configured to acquire a plurality of communication messages generated by inter-device communication within a first preset time period; the feature extraction module 402 is configured to extract feature parameters corresponding to the target device from the multiple communication messages, and obtain feature vectors according to the feature parameters; the identification module 403 is configured to analyze the feature vector by using an identification model, and obtain an operating system type corresponding to the target device; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

On the basis of the foregoing embodiment, the feature extraction module 402 is specifically configured to:

On the basis of the above embodiment, the first preset field includes a time to live TTL field, a flag bit DF field for fragmentation or not, and a SACK Permitted field, and the second preset field includes a window size WS field;

the feature extraction module 402 is specifically configured to:

and constructing the feature vector according to the TTL field, the DF field, the SACK Permitted field and the WS field.

calculating the ratios of values of the TTL fields to (0, 64), (64, 128) and (128, 255) respectively to obtain a first TTL ratio, a second TTL ratio and a third TTL ratio;

calculating ratios of the minimum value, the maximum value, the lower quartile, the median, the upper quartile and the average value in the values of the WS fields to the theoretical maximum value of the WS fields respectively to obtain a first WS ratio, a second WS ratio, a third WS ratio, a fourth WS ratio, a fifth WS ratio and a sixth WS ratio;

calculating the ratio of the value 1 to the total DF field number in the DF field to obtain the DF ratio;

calculating the ratio of the value of 4 in the SACK Permitted field to the total number of the SACK Permitted fields to obtain the SACK Permitted ratio;

and constructing the feature vector according to the first TTL ratio, the second TTL ratio, the third TTL ratio, the first WS ratio, the second WS ratio, the third WS ratio, the fourth WS ratio, the fifth WS ratio, the sixth WS ratio, the DF ratio and the SACK Permitted ratio.

On the basis of the above embodiment, the recognition model includes a plurality of operating system types and a clustering center corresponding to each operating system type; the identification module 403 is specifically configured to:

On the basis of the above embodiment, the apparatus further includes a model training module configured to:

acquiring a plurality of historical communication messages generated by communication among a plurality of devices within a second preset time period;

extracting training characteristic parameters corresponding to each device from the multiple historical communication messages according to the IP address corresponding to each device, and obtaining training characteristic vectors according to the training characteristic parameters;

performing clustering training by using training feature vectors corresponding to the devices to obtain the recognition model; the identification model comprises a plurality of operating system types and a cluster center corresponding to each operating system type.

On the basis of the above embodiment, the model training module is specifically configured to:

determining K clustering centers, wherein K is a positive integer;

based on the training feature vectors and the K clustering centers, performing center clustering in at least one iteration cycle to obtain the final K clustering centers and the training feature vectors corresponding to each clustering center;

obtaining the recognition model according to the final K clustering centers and the training feature vector corresponding to each clustering center;

wherein an iteration cycle comprises:

calculating the distance from each training feature vector to K clustering centers respectively, and distributing each training feature vector to the clustering center closest to each training feature vector to obtain K clusters, wherein each cluster comprises at least one training feature vector;

and calculating a corresponding new clustering center according to the training feature vector corresponding to the clustering.

counting the number of training characteristic vectors corresponding to each operating system type belonging to the same clustering center, and determining the operating system type with the largest number of characteristic vectors as the operating system type corresponding to the clustering center;

and obtaining the identification model according to the final K clustering centers and the type of the operating system corresponding to each clustering center.

In summary, the embodiment of the present application can determine the operating system of a device as long as the operating system has the characteristics of the operating system. The operating system type can be output by taking the device feature vector as input into the trained recognition model as long as the device feature vector exists. However, in the operating system identification technology based on fingerprint library matching, when a completely matched object cannot be found in the fingerprint library, the type of the operating system cannot be judged.

Furthermore, the clustering algorithm is utilized in the embodiment of the application, the method belongs to unsupervised learning, namely complicated label statistics is not needed before the learning training process, and only a small amount of label statistics is needed after the model is obtained to determine the label of each class center for subsequently judging the unknown equipment of the operating system. Therefore, compared with the operating system identification technology based on the machine learning classification algorithm, the efficiency of the embodiment of the application is remarkably improved.

Fig. 5 is a schematic structural diagram of an entity of an electronic device provided in an embodiment of the present application, and as shown in fig. 5, the electronic device includes: a processor (processor)501, a memory (memory)502, and a bus 503; wherein the content of the first and second substances,

the processor 501 and the memory 502 are communicated with each other through the bus 503;

the processor 501 is configured to call program instructions in the memory 502 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring a plurality of communication messages generated by communication between devices in a first preset time period; extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters; analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

The processor 501 may be an integrated circuit chip having signal processing capabilities. The Processor 501 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The Memory 502 may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), electrically Erasable Read Only Memory (EEPROM), and the like.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring a plurality of communication messages generated by communication between devices in a first preset time period; extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters; analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a plurality of communication messages generated by communication between devices in a first preset time period; extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters; analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An operating system identification method, comprising:

acquiring a plurality of communication messages generated by communication between devices in a first preset time period;

extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages, and obtaining characteristic vectors according to the characteristic parameters;

analyzing the characteristic vector by using an identification model to obtain an operating system type corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types;

the identification model comprises a plurality of operating system types and a clustering center corresponding to each operating system type; the analyzing the feature vector by using the recognition model to obtain the operating system type corresponding to the target device includes:

calculating the distance from the feature vector to the clustering center corresponding to each operating system type, and determining the operating system type corresponding to the clustering center closest to the feature vector as the operating system type corresponding to the target device;

the extracting the feature parameters corresponding to the target device from the plurality of communication messages and obtaining the feature vectors according to the feature parameters include:

obtaining the feature vector according to the first preset field and the second preset field;

the first preset field comprises a Time To Live (TTL) field, a flag bit DF field whether to fragment or not and a SACK Permitted field in a selection confirmation option, and the second preset field comprises a window size WS field;

the obtaining the feature vector according to the first preset field and the second preset field includes:

constructing the feature vector according to the TTL field, the DF field, the SACK Permitted field and the WS field;

constructing the feature vector according to the TTL field, the DF field, the SACK Permitted field, and the WS field, including:

calculating the ratio of the value 1 in the DF field to the total DF field number to obtain the DF ratio;

2. The method of claim 1, wherein before obtaining the data packet generated by the inter-device communication within the first preset time period, the method further comprises:

3. The method of claim 2, wherein performing cluster training using the training feature vectors corresponding to the devices to obtain the recognition model comprises:

determining K clustering centers, wherein K is a positive integer;

wherein an iteration cycle comprises:

4. The method according to claim 3, wherein the obtaining the recognition model according to the final K cluster centers and the training feature vector corresponding to each cluster center comprises:

5. An operating system identification apparatus, comprising:

the data acquisition module is used for acquiring a plurality of communication messages generated by communication among the devices within a first preset time period;

the characteristic extraction module is used for extracting characteristic parameters corresponding to the target equipment from the plurality of communication messages and obtaining characteristic vectors according to the characteristic parameters;

the identification module is used for analyzing the characteristic vector by utilizing an identification model to obtain the type of the operating system corresponding to the target equipment; the identification model is obtained by performing clustering training by using feature vectors corresponding to a plurality of devices with known operating system types;

the identification model comprises a plurality of operating system types and a clustering center corresponding to each operating system type; the identification module is specifically configured to:

the feature extraction module is specifically configured to:

6. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any one of claims 1-4.

7. A non-transitory computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-4.