CN110502677B

CN110502677B - Equipment identification method, device and equipment, and storage medium

Info

Publication number: CN110502677B
Application number: CN201910312754.3A
Authority: CN
Inventors: 王滨; 万里; 王星; 何承润; 姚铮; 刘松
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2022-09-16
Anticipated expiration: 2039-04-18
Also published as: CN110502677A

Abstract

The invention provides a device identification method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring source code information of a target webpage, wherein the target webpage is related to target equipment; extracting webpage features used for representing the target webpage from the source code information; classifying the target equipment according to the webpage characteristics; and when the target equipment is determined to belong to the target category according to the webpage characteristics, identifying the attribute information of the target equipment according to the webpage characteristics of the target equipment and preset attribute tag data. Under the condition that the webpage source code of the equipment is changed to a certain extent, equipment identification can still be realized.

Description

Equipment identification method, device and equipment, and storage medium

Technical Field

The present invention relates to the field of information technologies, and in particular, to a device identification method, apparatus, device, and storage medium.

Background

With the rapid development of network technologies, devices in various systems need to be deployed in a network, and perform corresponding work based on the network. Taking a network video monitoring system as an example, the deployment mode of the network video monitoring system is gradually changed from a traditional local area network or private network-based mode to an internet-based mode, and the change gradually exposes the security problem of the network video monitoring system. Security management and safeguards of the network video devices themselves are becoming increasingly important. In order to uniformly and effectively monitor and manage various video devices, it is a problem to be solved to accurately discover video devices existing in a network and identify attribute information of the devices, such as brand information.

In the related equipment identification mode, the first step is to establish an equipment fingerprint library, send a specific detection data packet to an open port of equipment and obtain the content of the data packet returned by the equipment, manually extract a characteristic character string which can distinguish the equipment from other brands of equipment from the content of the data packet returned by the equipment, construct a regular expression corresponding to the characteristic character string as an equipment fingerprint, and gradually accumulate the characteristic character string to form the equipment fingerprint library; and secondly, identifying the equipment by using the equipment fingerprint library, sequentially sending detection data packets corresponding to the fingerprints of the equipment in the equipment fingerprint library to the equipment for the equipment to be identified, and carrying out fuzzy matching or precise matching on the content of the data packets returned by the equipment and the fingerprints of the corresponding equipment so as to identify the equipment.

The above method has the following disadvantages: even if the content of the returned data packet is changed slightly, the regular expression cannot be matched, so that the problem of failure in identification due to fingerprint failure of the equipment can be caused, and the fingerprint change condition of the equipment caused by version upgrading, customization development and the like cannot be dealt with.

Disclosure of Invention

In view of this, the present invention provides a device identification method, apparatus, device, and storage medium, which can still realize device identification when the web page source code of the device changes to a certain extent.

A first aspect of the present invention provides an apparatus identification method, including:

acquiring source code information of a target webpage, wherein the target webpage is related to target equipment;

extracting webpage features used for representing the target webpage from the source code information;

classifying the target equipment according to the webpage characteristics;

and when the target equipment is determined to belong to the target category according to the webpage characteristics, identifying the attribute information of the target equipment according to the webpage characteristics of the target equipment and preset attribute tag data.

According to an embodiment of the present invention, classifying the target device according to the web page feature includes: classifying the target device into a video device or a non-video device according to the webpage characteristics;

when the target device is determined to belong to the target category according to the webpage features, identifying the attribute information of the target device according to the webpage features of the target device and preset attribute tag data comprises: and when the target equipment is determined to be video equipment according to the webpage characteristics, identifying the attribute information of the video equipment according to the webpage characteristics of the video equipment and preset attribute tag data.

In accordance with one embodiment of the present invention,

the extracted webpage features are divided into M feature categories, wherein M is equal to 1 or more than 1; classifying the target device as a video device or a non-video device according to the web page feature comprises:

determining a feature vector corresponding to the webpage features belonging to each feature category;

and inputting each feature vector into a trained equipment category classifier, and identifying the equipment category of the target equipment as video equipment or non-video equipment by the equipment category classifier according to the input feature vector.

In accordance with one embodiment of the present invention,

identifying the attribute information of the target device according to the webpage characteristics of the target device and preset attribute tag data comprises:

and respectively inputting each feature vector to a corresponding attribute classifier, and calculating attribute information of the target equipment by each attribute classifier according to the input feature vector and preset attribute label data, wherein the attribute classifier input by the feature vector is related to the feature class corresponding to the feature vector.

According to one embodiment of the invention, the device class classifier is trained by:

acquiring sample source code information of S sample webpages, wherein S is larger than 1, and the S sample webpages are respectively related to S sample devices;

extracting sample characteristics of M characteristic categories from each sample source code information, wherein the sample characteristics represent corresponding sample webpages;

selecting target sample characteristics from all sample characteristics belonging to the same characteristic category, and determining a target sample characteristic vector corresponding to the target sample characteristics belonging to each characteristic category of each sample device;

and training by using all target sample feature vectors to obtain the equipment category classifier.

According to an embodiment of the present invention, the selecting the target sample feature from all sample features belonging to the same feature class includes:

acquiring category variables composed of the device categories of the S sample devices;

constructing reference characteristic variables corresponding to the same sample characteristic in all sample equipment, and calculating the correlation between the reference characteristic variables and the category variables;

and sequencing the sample characteristics according to the correlation, and selecting N sample characteristics with the top correlation as target sample characteristics, wherein N is more than 1.

According to one embodiment of the invention, the attribute classifier is trained by:

and aiming at each feature class, training and optimizing at least two initial attribute classifiers by using the target sample feature vector corresponding to the feature class, and selecting the attribute classifier with the optimal attribute identification performance to determine the attribute classifier which is trained and corresponds to the feature class.

A second aspect of the present invention provides an apparatus for identifying a device, including:

the source code information acquisition module is used for acquiring source code information of a target webpage, and the target webpage is related to target equipment;

the webpage feature extraction module is used for extracting webpage features used for representing the target webpage from the source code information;

the target equipment classification module is used for classifying the target equipment according to the webpage characteristics;

and the attribute identification module is used for identifying the attribute information of the target equipment according to the webpage characteristics of the target equipment and preset attribute label data when the target equipment is determined to belong to the target category according to the webpage characteristics.

According to an embodiment of the present invention, when the target device classification module classifies the target device according to the web page feature, the target device classification module is specifically configured to: classifying the target device into a video device or a non-video device according to the webpage characteristics;

when determining that the target device belongs to the target category according to the web page feature, the attribute identification module is specifically configured to, when identifying the attribute information of the target device according to the web page feature of the target device and preset attribute tag data: and when the target equipment is determined to be video equipment according to the webpage characteristics, identifying the attribute information of the video equipment according to the webpage characteristics of the video equipment and preset attribute tag data.

According to one embodiment of the invention, the extracted web page features are divided into M feature categories, M being equal to 1 or greater than 1; the target device classification module includes:

the characteristic vector determining unit is used for determining a characteristic vector corresponding to the webpage characteristics belonging to each characteristic category;

and the target equipment classification unit is used for inputting each feature vector to the trained equipment classification classifier so as to identify the equipment classification to which the target equipment belongs as video equipment or non-video equipment by the equipment classification classifier according to the input feature vector.

According to one embodiment of the invention, the attribute identification module comprises:

and the attribute information determining unit is used for respectively inputting the feature vectors to the corresponding attribute classifiers so as to calculate the attribute information of the target equipment by the attribute classifiers according to the input feature vectors and preset attribute label data, wherein the attribute classifiers input by the feature vectors are related to the feature classes corresponding to the feature vectors.

According to an embodiment of the present invention, the attribute information includes attribute probabilities, the attribute probabilities represent probability values that the target device belongs to each designated attribute, and the number of the designated attributes is greater than 1;

after the attribute information determining unit, the attribute identifying module further includes:

the reference probability value calculation unit is used for multiplying each probability value output by each attribute classifier by the preset weight corresponding to the attribute classifier to obtain the reference probability value of each designated attribute of the attribute classifier;

the target probability value calculating unit is used for adding the reference probability values of the same designated attribute to obtain a target probability value of the designated attribute;

and the target attribute determining unit is used for determining the specified attribute with the maximum target probability value as the target attribute to which the target equipment belongs.

According to one embodiment of the invention, the device class classifier is trained by a first training module comprising:

the system comprises a sample source code information unit, a data processing unit and a data processing unit, wherein the sample source code information unit is used for acquiring sample source code information of S sample webpages, S is larger than 1, and the S sample webpages are respectively related to S sample devices;

the sample feature extraction unit is used for extracting sample features of M feature categories from each sample source code information, and the sample features represent corresponding sample webpages;

the target sample characteristic selecting unit is used for selecting target sample characteristics from all sample characteristics belonging to the same characteristic category and determining a target sample characteristic vector corresponding to the target sample characteristics belonging to each characteristic category of each sample device;

and the first training unit is used for training by using all the target sample feature vectors to obtain the equipment category classifier.

According to an embodiment of the present invention, the target sample feature extracting unit includes:

a category variable acquiring subunit, configured to acquire a category variable composed of device categories of the S sample devices;

the correlation calculation subunit is used for constructing reference characteristic variables corresponding to the same sample characteristic in all the sample devices and calculating the correlation between the reference characteristic variables and the category variables;

and the target sample characteristic determining subunit is used for sequencing the sample characteristics according to the correlation, and selecting N sample characteristics with the top correlation as target sample characteristics, wherein N is greater than 1.

According to one embodiment of the invention, the attribute classifier is trained by a second training module comprising:

and the second training unit is used for training and optimizing at least two initial attribute classifiers by using the target sample feature vector corresponding to each feature class, and selecting the attribute classifier with the optimal attribute identification performance to determine the attribute classifier which is trained and corresponds to the feature class.

A third aspect of the invention provides an electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein, when the processor executes the program, the device identification method according to the foregoing embodiment is implemented.

A fourth aspect of the present invention provides a machine-readable storage medium on which a program is stored, the program, when executed by a processor, implementing the device identification method according to the foregoing embodiment.

Compared with the prior art, the embodiment of the invention at least has the following beneficial effects:

in the embodiment of the invention, the webpage characteristics representing the webpage are extracted from the webpage source code information related to the target equipment, the extracted webpage characteristics can be used as the related characteristics of the target equipment, the equipment type and attribute information of the target equipment are identified according to the extracted webpage characteristics, and the characteristics which are completely consistent with the equipment fingerprint are different.

Meanwhile, two-layer classification is adopted to realize equipment identification, the first-layer classification only needs to focus on distinguishing the characteristics whether the target equipment belongs to the target category, and the second-layer classification only needs to carry out attribute information identification under the condition that the first-layer classification is identified as belonging to the target category, so that the interference of non-target categories on attribute information identification is eliminated, and more accurate identification is realized.

Drawings

Fig. 1 is a schematic flowchart of an apparatus identification method according to an embodiment of the present invention;

FIG. 2 is a block diagram of an apparatus identification device according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of training an obtained device class classifier according to an embodiment of the present invention;

fig. 4 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

In order to make the description of the embodiments of the present invention more clear and concise, some technical terms are explained below.

And (3) supervised learning: a machine learning method can learn a model by using a training sample set and predict a test sample by using the model. Supervised learning is characterized by input training data that contains both the features of a sample and the expected output for that sample (e.g., class labels for samples in a classification problem).

Unsupervised learning: one machine learning method, the common unsupervised learning, is data clustering. Unlike supervised learning, the training data for unsupervised learning need not contain the expected output for the sample.

In the binary classification problem, according to the real class and the prediction class of the sample to be measured, the sample to be measured can be classified into the following 4 classes: true Positive (TP) refers to a positive sample that is correctly predicted as "positive"; true negative examples (TN) are negative samples that are correctly predicted to be "negative"; false positive examples (FP) are negative samples that are mispredicted as "positive", i.e., samples that are misinformed; false Negatives (FNs) are positive samples that are mispredicted as "negative", i.e., samples that are missed.

In the two classification problems, the performance indexes can be used as the standard for evaluating the quality of the classifier after training, and the performance indexes comprise:

precision (Precision): Precision-NTP/(NTP + NFP);

recall (Recall): Recall-NTP/(NTP + NFN);

f1 score (F1-score): the harmonic mean of Precision and Recall, F1-score ═ 2 × Precision × Recall)/(Precision + Recall);

accuracy (Accuracy, ACC): ACC ═ (NTP + NTN)/(NTP + NTN + NFP + NFN);

wherein NTP refers to the number of TPs, NTN refers to the number of TNs, NFP refers to the number of FPs, and NFN refers to the number of FNs.

Convolutional Neural Network (CNN): a neural network model commonly used for image recognition and natural language processing.

One-to-many SVM classifier (OvR-SVM): processing a multi-classification problem by using SVM classifiers, for example, k classes, k being greater than 1, training an SVM classifier for each class in turn for classifying the class and other classes, thus obtaining k SVM classifiers as an OvR-SVM.

Bag of words model (bag-of-words model): a text representation method, the text is represented as a vector composed of the occurrence times of each word contained in the text.

The following describes the device identification method according to the embodiment of the present invention more specifically, but not limited thereto.

In one embodiment, referring to FIG. 1, a method of device identification is shown, the method comprising the steps of:

s100: acquiring source code information of a target webpage, wherein the target webpage is related to target equipment;

s200: extracting webpage features used for representing the target webpage from the source code information;

s300: classifying the target equipment according to the webpage characteristics;

s400: and when the target equipment is determined to belong to the target category according to the webpage characteristics, identifying the attribute information of the target equipment according to the webpage characteristics of the target equipment and preset attribute tag data.

The device identification method of the embodiment of the invention can be applied to electronic devices, and the electronic devices can be computer devices or mobile devices and the like and only need to have information processing capacity. The electronic device is preferably a device connected to the target device via a network.

In step S100, the source code information of the target web page may be obtained by sending a web page source code request to the target device. The target webpage is related to the target device, and may be a webpage for describing the target device, a webpage for managing the target device, and the like. The source code information may be code of the target web page, such as CSS code or HTML code.

The target device may be any device in the video monitoring network, that is, a device in the network where the video monitoring system is located, and may be a video device or a non-video device. In this embodiment of the present invention, a video device may also be referred to as a video monitoring device, and may include: network cameras (IPC), video monitoring platform devices, Network Video Recorders (NVR), Digital Video Recorders (DVR), and the like. Of course, the target device may be a device in other networks, depending on the particular desired identified device.

If the server has the target webpage, the target equipment can acquire the source code of the target webpage from the server and return the source code information of the webpage; if the target webpage does not exist in the server, no information or error information is returned, and the identification can be finished.

In step S200, a web page feature for characterizing the target web page is extracted from the source code information.

The web page features may be, for example, features of tags in the web page, features of link addresses, and the like. Different web pages have different web page characteristics, and thus a web page characteristic may characterize a web page.

In this embodiment, the web page feature obtained from the source code information of the target web page may represent the target web page, and since the target web page is related to the target device, the web page feature may be used as a related feature of the target device, and the target device may be identified by using the web page feature.

In step S300, the target devices are classified according to the web page features.

The target devices are classified according to the web page features, for example, the target devices can be classified according to the web page features by training a neural network; for another example, the features of known device categories may be preset, and then whether a feature matching the web page feature exists in the preset features is searched for, and if so, the device category corresponding to the searched feature is determined as the category of the target device. Of course, the classification method is not limited to this, and the target devices may be classified according to the web page features.

In this embodiment, when classifying the target device, only whether the target device belongs to the target class may be distinguished, and step S400 is executed only when the target device belongs to the target class. The target category is, for example, a video device category, that is, only when the target device is a video device, step S400 is further executed, otherwise, the identification of the attribute information of the target device may not be required.

Of course, the object class is not limited to the video device class, but may be other classes such as the voice device class.

It is to be understood that, when classifying the target devices, it is also possible to determine what category the target devices belong to, for example, if the target devices are video devices, it is possible to determine that the target devices belong to the video device category, if the target devices are voice devices, it is possible to determine that the target devices belong to the voice device category, and so on.

In step S400, when the target device is determined to belong to the target category according to the web page feature, the attribute information of the target device is identified according to the web page feature of the target device and preset attribute tag data.

Since the attribute information of the target device is recognized only when the target device belongs to the target category, the attribute information is attribute information of a device belonging to the target category. The preset attribute tag data may be calibrated with attributes of various devices belonging to the target class. For example, when the target device is a video device, the attribute information corresponds to attribute information of the video device, and the attribute information is brand information, for example.

The probability that the target device belongs to each attribute tag data may be calculated according to the web page features, and then the attribute information may be the probability that the target device belongs to each attribute tag data, or which attribute tag data of preset attribute tag data the target device corresponds to may be calculated according to the web page features, and the attribute information may be the attribute tag data corresponding to the target device.

In the case where the attribute is a brand, the preset attribute tag data may include brand information such as HK, D1, T1, AX, etc., and the attribute information may be a probability that the brand of the target device is brand information such as HK, D1, T1, AX, or may be a specific brand of the several brands.

In one embodiment, the above method flow may be executed by the device identification apparatus 100, as shown in fig. 2, the device identification apparatus 100 mainly includes 4 modules: the system comprises a source code information acquisition module 101, a webpage feature extraction module 102, a target device classification module 103 and an attribute identification module 104. The source code information obtaining module 101 is configured to perform the step S100, the web page feature extracting module 102 is configured to perform the step S200, the target device classifying module 103 is configured to perform the step S300, and the attribute identifying module 104 is configured to perform the step S400.

In one embodiment, in step S100, acquiring the source code information of the target webpage includes the following steps:

s101: sending an HTTP or HTTPS request to the target equipment to request the source code information of the target webpage;

s102: and receiving the source code information of the target webpage returned by the target equipment.

In step S101, an HTTP (Hypertext Transfer Protocol) or HTTPs (Hypertext Transfer Protocol over Secure Socket Layer or Hypertext Transfer Protocol Secure) Protocol port of the target device may be accessed, and an HTTP or HTTPs request may be sent thereto.

When receiving the HTTP or HTTPs request, the target device may request the source code information of the target web page from the server, and after receiving the source code information issued by the server, returns the source code information to the target device.

In step S102, the source code information of the target web page returned by the target device is received. And using the received source code information for feature extraction.

In one embodiment, after step S102, the method may further include:

checking whether the received source code information has an automatic jump statement;

if so, acquiring source code information of the webpage to which the automatic jump statement jumps, and using the source code information for feature extraction.

The video equipment can realize automatic skipping by using Javascript scripts, and the automatic skipping sentence realization mode comprises the following steps:

1) configure the document location variable in the Javascript script, for example:

document.location.replace('./home/monitoring.cgi')

2) configure window. localtion variables in Javascript script, for example:

window.location.href＝"/doc/page/login.asp？_"+(new D1te()).getTime()；

3) configure top in Javascript script location variables, for example:

top.location.href＝"login.htm？_＝"+new D1te().getTime()

4) configuring the HTTP-EQUIV attribute in the Javascript as 'Refresh', and realizing jump by specifying the Content attribute, for example:

<METAHTTP-EQUIV＝"Refresh"CONTENT＝"0；

URL＝/view/viewer_index.shtmlid＝519">。

in one embodiment, in step S300, classifying the target device according to the web page feature includes: classifying the target device into a video device or a non-video device according to the webpage characteristics;

in step S400, when it is determined that the target device belongs to the target category according to the web page feature, identifying attribute information of the target device according to the web page feature of the target device and preset attribute tag data includes: and when the target equipment is determined to be video equipment according to the webpage characteristics, identifying the attribute information of the video equipment according to the webpage characteristics of the video equipment and preset attribute tag data.

As mentioned in the background art, the deployment of the network video monitoring system is gradually changed from the traditional local area network or private network based mode to the internet based mode, and this change gradually exposes the security problem of the network video monitoring system itself. Security management and safeguards of the network video devices themselves are becoming increasingly important. In order to uniformly and effectively monitor and manage various video devices, it is a problem to be solved to accurately discover the video devices existing in the network and identify attribute information of the devices, such as brand information.

In this embodiment, it is possible to discover video devices in a network by identifying whether a target device is a video device, and further identify attribute information of the video device, such as brand information, when the target device is a video device, so that the above problems can be solved, and various video devices can be effectively monitored and managed.

In one embodiment, in step S200, the extracted web page features are divided into M feature categories, where M is equal to 1 or greater than 1;

classifying the target device as a video device or a non-video device according to the web page feature comprises:

When the webpage features are extracted from the source code information, the webpage features comprise more than two feature categories, each feature category comprises at least one webpage feature, and the target webpage can be represented from different angles. The characteristics of the webpage are described by the webpage characteristics of different characteristic categories from different angles, so that the equipment is comprehensively depicted from multiple angles, and the identification accuracy can be improved.

And converting the webpage features of the feature category into a vector form suitable for the input of the machine learning classification method aiming at each feature category. That is, each type of web page feature is converted into a feature vector, so that M feature vectors can be obtained.

For example, the extracted web page features can be classified into the following three categories:

counting the frequency of each label in the equipment webpage source code to obtain webpage label statistical characteristics;

link address statistical characteristics obtained by performing statistics on data related to link addresses in the equipment webpage source codes;

and carrying out statistics on the frequency of each word appearing in all the labels in the equipment webpage source code to obtain text content characteristics.

Of course, the feature categories of the web page features are not particularly limited to the above categories. The web page features of these three feature categories are detailed below:

first, webpage tag statistical characteristics:

the number of occurrences of the web page tag determines the structure and appearance of the web page itself. Therefore, the difference of the appearance and structure of the web page of the devices with different device types or attributes can be reflected in the statistical characteristics of the web page tags.

The tags with the required statistical frequency can refer to an HTML tag list, and the tags in the HTML tag list can be used as statistical objects. The frequency of each label in the webpage source code (the number of times the label appears) is counted. And (3) normalizing the frequency of each label, and mapping the frequency number to a range of [0,1 ]. And forming a vector by all the normalized label frequencies to obtain a characteristic vector of the statistical characteristic category of the webpage label of the equipment.

Second, link address statistics:

links contained in a web page may represent the relationship of the web page to other external resources such as web pages, files, code libraries, and the like. Since the external resources on which the web pages of different device classes or properties depend are often different, the link feature can be used to distinguish between web pages of different device classes or properties.

The data associated with the link address may include HTML tags, attributes of the link, and may include at least one of:

< link > tag, its href attribute;

< a > tag, its href attribute;

< nav > tag, its href attribute;

< base > tag, its href attribute;

< base > tag, its target attribute;

< script > tag, its src attribute;

< img > tag, its src attribute;

< form > tag, action attribute thereof.

For example, in this embodiment, the statistical characteristics of the link addresses are as follows (1):

watch (1)

The method for constructing the Boolean value list of the webpage features with the sequence number of 14 is that external Javascript files introduced by < link > tags and < script > tags appearing in all sample webpages during training are considered, if the external Javascript files are introduced by the < script > tags and the < link > tags in the webpages of a certain device, the values of the corresponding webpage features are 1, otherwise the values of the corresponding features are 0; the method for constructing the boolean value list of the web page features with the sequence number of 15 is to consider the external CSS files introduced by the < link > tags appearing in all sample web pages during training, and if the external CSS files are introduced by the < link > tags in the web pages of a certain device, the value of the corresponding web page feature is 1, otherwise, the value of the corresponding web page feature is 0.

Third, text content characteristics:

the text content of a web page often expresses the information of the target, function, attribution, etc. of the web page, so that the device category and attribute of the device can be distinguished based on the text content of the web page. The text content refers to the text content expressed by the non-code language in the source code and contains the information required to express the webpage.

When the source code information is obtained, traversing all the labels in the source code information, and sequentially obtaining text contents in all the labels; and splicing all the acquired text contents to obtain the text contents of the webpage, and analyzing the text contents to obtain the text content characteristics.

For example, for a Chinese webpage, a word segmentation tool can be used to segment the obtained text content to obtain a plurality of word sequences, then a word bag model is used to convert the word sequences into frequency vectors of words, and the frequency vectors are normalized to obtain text content characteristics of the webpage. The word segmentation tools include, for example, word segmentation at the ending, word segmentation in the Baidu Chinese, etc.

After the M feature vectors are obtained, the feature vectors can be input to a trained device class classifier, and the device class classifier identifies the device class to which the target device belongs as a video device or a non-video device according to the input feature vectors.

The device class classifier is pre-trained and may be pre-stored in the electronic device or stored in an external device and invoked when needed.

After each feature vector is input into the device type classifier, the device type classifier can identify whether the target device is a video device according to the feature vectors. And when the device type classifier identifies that the target device is the video device, performing subsequent operations.

The device class classifier may use a binary classification algorithm to identify the device class to which the target device belongs according to the feature vectors, and obtain two results, one is that the target device belongs to the video device class, and the other is that the target device belongs to the non-video device class.

The device category classifier can be implemented by using a Support Vector Machine (SVM) classification algorithm, a Logistic regression classification algorithm, a decision tree classification algorithm, or a naive bayes classification algorithm, and the specific classification algorithm is not limited.

In one embodiment, in step S400, identifying the attribute information of the target device according to the web page feature of the target device and preset attribute tag data includes:

and respectively inputting the feature vectors into corresponding attribute classifiers, and calculating attribute information of the target equipment by using each attribute classifier according to the input feature vectors and preset attribute label data, wherein the attribute classifiers input by the feature vectors are related to the feature classes corresponding to the feature vectors.

If the attribute classifier can identify more than two attribute information, the attribute classifier can be implemented by adopting a multi-classification algorithm; if an attribute information can be identified, a binary classification algorithm can be used, such as identifying whether the brand of the target device is HK. The number of preset attribute tag data in the attribute classifier corresponds to the identifiable attribute information, for example, when it can be identified whether the brand of the target device is HK, two attribute tag data of HK and non-HK may be preset.

The different attribute classifiers correspond to different feature classes, the feature vector of each feature class can be input into the attribute classifier corresponding to the feature class, and each attribute classifier calculates the attribute information of the target device according to the input feature vector.

Each attribute classifier may output a plurality of attribute information. For example, the target category is a video category, the brands of video devices may be many, such as HK, D1, T1, AX, etc., and thus the attribute classifier may be a multi-classifier that can distinguish the brands. The attribute classifier may be implemented using a decision tree classifier, OvR-SVM classifier, CNN classifier, or the like.

The attribute classifier can only identify the attribute information of the equipment belonging to the target category, and the attribute identification precision of the equipment of the target category is improved. Each feature category corresponds to one attribute classifier, and the attribute classifier which is most suitable for the web page features of different feature categories can be adopted for the web page features of different feature categories according to the feature classification effect.

In one embodiment, the attribute information includes an attribute probability;

after each feature vector is input to the corresponding attribute classifier, the method further comprises the following steps:

after the feature vectors are respectively input to the corresponding attribute classifiers, the method further includes:

for each probability value output by each attribute classifier, multiplying each probability value by a preset weight corresponding to the attribute classifier to obtain a reference probability value of each designated attribute of the attribute classifier;

adding the reference probability values of the same designated attribute to obtain a target probability value of the designated attribute;

and determining the specified attribute with the maximum target probability value as the target attribute to which the target equipment belongs.

In this embodiment, the attribute is taken as a brand for example, and the specified attribute is a specified brand. For example, if each attribute classifier marks four specified brands, i.e., HK, D1, T1, and AX, the attribute classifier outputs a probability that the target device belongs to the HK brand, a probability that the target device belongs to the D1 brand, a probability that the target device belongs to the T1 brand, and a probability that the target device belongs to the AX brand.

The attribute probabilities output by the attribute classifiers can be subjected to comprehensive statistics, and the target brand to which the target equipment belongs is determined according to the statistical result. The comprehensive statistical manner may be a manner of averaging probabilities of the same brand, or a manner of voting with probability weighting of the same brand, and the specific manner is not limited.

For example, there are three attribute classifiers corresponding to the feature classes, which are respectively a first attribute classifier, a second attribute classifier, and a third attribute classifier, wherein,

the first attribute classifier outputs the following result: the probabilities that the target equipment belongs to HK, D1, T1 and AX brands are 70%, 10% and 10% respectively;

the result output by the second attribute classifier is: the probabilities that the target equipment belongs to HK, D1, T1 and AX brands are 80%, 5% and 10% respectively;

the result output by the third attribute classifier is: the probability that the target device belongs to the brand HK, D1, T1, AX is 70%, 19%, 10%, 1%.

In this embodiment, the attribute probabilities output by the attribute classifiers are integrated in a weighted voting manner, so that the importance of the web page features of various feature categories can be weighed to determine the final target brand.

For example, the preset weight corresponding to the first attribute classifier is 0.5, the preset weight corresponding to the second attribute classifier is 0.3, and the preset weight corresponding to the third attribute classifier is 0.2, and the reference probability value of each designated brand of each attribute classifier is calculated, with the following results:

for the first attribute classifier, the reference probability values for HK, D1, T1, AX brands are 70% × 0.5 ═ 0.35, 10% × 0.5 ═ 0.05, respectively;

for the second attribute classifier, the reference probability values for the HK, D1, T1, AX brands are 80% × 0.3 ═ 0.24, 5% × 0.3 ═ 0.015, 10% × 0.3 ═ 0.03, respectively;

for the third attribute classifier, the reference probability values for the HK, D1, T1, AX brands are 70% × 0.2 ═ 0.14, 19% × 0.2 ═ 0.038, 10% × 0.2 ═ 0.02, and 1% × 0.2 ═ 0.002, respectively.

And adding the reference probability values of the same appointed brand to obtain a target probability value of the appointed brand, wherein the result is as follows:

the sum of the reference probability values of HK brands of the attribute classifiers is 0.35+0.24+0.14 to 0.73;

the sum of the reference probability values of the D1 brands of the attribute classifiers is 0.05+0.015+0.038 — 0.103;

the sum of the reference probability values of the T1 brands of the attribute classifiers is 0.05+0.015+0.02 ═ 0.085;

the sum of the reference probability values for AX brands for each attribute classifier is 0.05+0.03+0.002 ═ 0.082.

In other words, the target probability value for HK brand is 0.73, the target probability value for D1 brand is 0.103, the target probability value for T1 brand is 0.085, and the target probability value for AX brand is 0.082.

The target probability value of the HK brand is highest, and thus the HK brand is determined as the target brand to which the target device belongs.

Before the above steps are performed, a device class classifier and an attribute classifier can be obtained through training.

In one embodiment, referring to fig. 3, the device class classifier is trained by:

a100: acquiring sample source code information of S sample webpages, wherein S is larger than 1, and the S sample webpages are respectively related to S sample devices;

a200: extracting sample characteristics of M characteristic categories from each sample source code information, wherein the sample characteristics represent corresponding sample webpages;

a300: selecting target sample characteristics from all sample characteristics belonging to the same characteristic category, and determining a target sample characteristic vector corresponding to the target sample characteristics belonging to each characteristic category of each sample device;

a400: and training by using all target sample feature vectors to obtain the equipment category classifier.

In step a100, the manner of obtaining the sample source code information of the sample web page may be the same as the manner of obtaining the web page source code in step S100, or the sample source code information may be obtained on the sample web page, and the specific manner is not limited.

The specific device class and attributes of the sample device are not limiting. In the case where the target category is a video device category, 24 brands of camera devices such as HK, T1, D1, AX, and the like may be prepared as sample devices belonging to the target category; meanwhile, a plurality of common devices where websites such as blogs, e-commerce, portals and router management backgrounds are located are prepared as sample devices belonging to non-target categories. And the total number of the S sample devices is S, and each sample device corresponds to one sample webpage.

In step a200, when the sample features are extracted, the required feature types are the same as the feature types in step S200, and the extraction manner may also be the same, which may refer to the contents of the foregoing embodiments specifically, and will not be described herein again.

In step a300, after the sample features of all the feature classes of all the sample devices are obtained, sample feature screening is performed on each class, and a suitable target sample feature in each class of sample features is selected. Because the effect of each sample feature on classification in each type of sample feature is different, the target sample feature with larger effect on classification in each type of sample feature can be selected.

After the target sample features of each feature class are determined, for each sample device, a corresponding target sample feature vector is determined based on the target sample features belonging to each feature class. I.e. each type of target sample feature for each sample device is converted into one target sample feature vector, so that there may be M target sample feature vectors per sample device.

In step a400, after the feature vectors of the target samples of all the feature classes are determined, the device class classifier is obtained by training using the feature vectors of the target samples.

The initial device class classifier may be trained to obtain a trained device class classifier. The initial equipment category classifier can be a classifier, M target sample feature vectors of one sample equipment are input at each time, the calibrated equipment category of the sample equipment is output, and the initial equipment category classifier is trained and optimized to obtain the required equipment category classifier.

The calibrated device types are target types and non-target types (such as video types and non-video types), and the device types of the collected sample devices can be calibrated by observing the appearance, the webpage content and the like of the sample webpage.

In one embodiment, in step a300, the selecting the target sample feature from all sample features belonging to the same feature class includes the following steps:

a301: acquiring category variables composed of the device categories of the S sample devices;

a302: constructing reference characteristic variables corresponding to the same sample characteristic in all sample equipment, and calculating the correlation between the reference characteristic variables and the category variables;

a303: and sequencing the sample characteristics according to the correlation, and selecting N sample characteristics with the top correlation as target sample characteristics, wherein N is more than 1.

The class variable is composed of the device classes of the S sample devices, and may be constructed in advance according to the device classes of the S sample devices.

Different sample devices have the same sample characteristics, the same sample characteristics in all the sample devices are used as a group of sample characteristics, corresponding reference characteristic variables can be constructed according to each group of sample characteristics, and the correlation between each reference characteristic variable and the category variable is calculated.

Correlation calculation methods include, but are not limited to, using mutual information methods, using t-test methods, using pearson correlation coefficient methods. The category variable and the reference characteristic variable are random variables, and mutual information can be used for measuring the information quantity contained by the two random variables. The larger the mutual information, the stronger the dependency between two variables.

In the device class classification of this embodiment, the larger the mutual information between the feature variable and the class variable is, the larger the role of the sample feature on classification is likely to be, so that N sample features with the top correlation can be selected as target sample features, and further, the required target sample feature vector is determined to train and obtain the device class classifier.

Of course, all sample characteristics of each sample device may be used as target sample characteristics of the sample device.

Optionally, after the sample features are sorted according to the correlation size, four sets of sample feature vector sets may be constructed based on the sample features with the top correlations of 10%, 30%, 50%, and 100%, respectively, where each set of sample feature vector set includes M target sample feature vectors of each sample device.

And then, training the initial equipment category classifier by using each group of sample feature vector groups and the calibrated equipment categories to obtain four trained equipment category classifiers. The device class classifier may employ a Logistic regression model. According to the advantages and disadvantages of the device class identification performances (precision, recall rate, F1-score and the like) of the four trained device class classifiers, the target sample feature vector with the top percentage is determined to be used as a final target sample feature vector (an attribute classifier is trained by the final target sample feature vector), and the device class classifier with the optimal device class identification performance is used as a device class classifier for device class identification.

In one embodiment, the attribute classifier is trained by:

b100: and aiming at each feature class, training and optimizing at least two initial attribute classifiers by using the target sample feature vector corresponding to the feature class, and selecting the attribute classifier with the optimal attribute identification performance to determine the attribute classifier which is trained and corresponds to the feature class.

The above step B100 may be performed after all the target sample feature vectors are determined.

On the basis of the target sample feature vectors obtained in the foregoing steps, at least two initial attribute classifiers are trained for each feature class, where the initial attribute classifier may be a multi-classifier, and includes: decision tree classifiers, OvR-SVM classifiers, CNN classifiers, etc.

After the initial attribute classifier corresponding to each feature class is trained and optimized, the attribute identification performances such as average Precision (Precision), average Recall rate (Recall), average F1 score and the like of the prediction results of the attribute classifiers can be compared through prediction, and the attribute classifier with the optimal attribute identification performance is used as the attribute classifier corresponding to the feature class. The attribute identification performance is optimal, for example, the average precision is highest, the average recall rate is highest, the average F1 score is highest, and/or the accuracy rate is highest, and the specific evaluation manner is not limited.

The average precision, average recall rate, and average F1 score are calculated in a similar manner as the precision, recall rate, and F1 scores in the binary problem. Taking average precision as an example, suppose that one attribute classifier can classify ten attributes a1-a10, when calculating: firstly, respectively calculating the precision of each attribute of the attribute classifier, for example, when calculating the precision on A1, regarding A2-A10 as non-A1, substituting the prediction result into a binary precision calculation formula to obtain the precision on A1, and repeating the steps to obtain 10 precisions; after summing the 10 accuracies, the ratio of the sum of accuracies to 10 is determined as the average accuracy of the attribute classifier.

In the training of the device class classifier and the attribute classifier, the device class and the attribute of each sample device may be calibrated in advance, and the calibration method may include: the method comprises the steps of utilizing priori knowledge to identify keywords of a response message of an open port of sample equipment, identifying a webpage screenshot of a homepage of the sample equipment, matching feature strings of webpage contents of the sample equipment and the like.

Aiming at the characteristics that the web pages of equipment with different equipment types and different attributes are different in development technology and interface style, in the embodiment of the invention, the characteristics of multiple types of web pages are extracted from the source code information of the web pages, and the equipment types and attributes are identified by using a machine learning classification algorithm. From the training perspective of the classifier, compared with a related mode based on manual searching of device fingerprints, the method can automatically extract the webpage features of devices with different types and attributes, and greatly reduces the manual workload.

The present invention also provides an apparatus identification device, and referring to fig. 2, the apparatus identification device 100 includes:

a source code information obtaining module 101, configured to obtain source code information of a target web page, where the target web page is related to a target device;

a web page feature extraction module 102, configured to extract a web page feature used for representing the target web page from the source code information;

a target device classification module 103, configured to classify the target device according to the webpage features;

and the attribute identification module 104 is configured to identify attribute information of the target device according to the web page feature of the target device and preset attribute tag data when the target device is determined to belong to the target category according to the web page feature.

In an embodiment, when the target device classification module classifies the target device according to the web page feature, the target device classification module is specifically configured to: classifying the target device into a video device or a non-video device according to the webpage characteristics;

In one embodiment, the extracted web page features are divided into M feature categories, where M is equal to 1 or greater than 1; the target device classification module includes:

In one embodiment, the attribute identification module includes:

In one embodiment, the attribute information includes attribute probability, the attribute probability represents a probability value that the target device belongs to each specified attribute, and the number of the specified attributes is greater than 1;

the reference probability value calculating unit is used for multiplying each probability value output by each attribute classifier by a preset weight corresponding to the attribute classifier to obtain a reference probability value of each designated attribute of the attribute classifier;

In one embodiment, the device class classifier is trained by a first training module comprising:

In one embodiment, the target sample feature extracting unit includes:

In one embodiment, the attribute classifier is trained by a second training module comprising:

and the second training unit is used for training and optimizing at least two initial attribute classifiers by using the target sample feature vector corresponding to each feature class according to each feature class, and selecting the attribute classifier with the optimal attribute identification performance to determine the attribute classifier as the trained attribute classifier corresponding to the feature class.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts shown as units may or may not be physical units.

The invention also provides an electronic device, which comprises a processor and a memory; the memory stores a program that can be called by the processor; wherein, when the processor executes the program, the device identification method according to the foregoing embodiment is implemented.

The embodiment of the device identification device can be applied to electronic equipment. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 4, fig. 4 is a hardware structure diagram of an electronic device where the device identification apparatus 100 is located according to an exemplary embodiment of the present invention, and except for the processor 510, the memory 530, the interface 520, and the nonvolatile memory 540 shown in fig. 4, the electronic device where the apparatus 100 is located in the embodiment may also include other hardware according to an actual function of the electronic device, which is not described again.

The present invention also provides a machine-readable storage medium on which a program is stored, which, when executed by a processor, implements the device identification method as described in the foregoing embodiments.

The present invention may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Machine-readable storage media include permanent and non-permanent, removable and non-removable media, and the storage of information may be accomplished by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of machine-readable storage media include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A device identification method, comprising:

extracting webpage features used for representing the target webpage from the source code information; the extracted webpage features are divided into M feature categories, wherein M is equal to or greater than 1, and the M feature categories comprise webpage label statistical features, link address statistical features and text content features;

classifying whether the target device belongs to the video device according to the webpage features, including: determining a feature vector corresponding to the webpage features belonging to each feature category; inputting each feature vector into a trained equipment category classifier, and identifying the equipment category of the target equipment as video equipment or non-video equipment by the equipment category classifier according to the input feature vector;

when the target equipment is determined to belong to the target category according to the webpage features, identifying attribute information of the target equipment according to the webpage features of the target equipment and preset attribute tag data; wherein the object class is a video device.

2. The device identification method of claim 1, wherein identifying the attribute information of the target device according to the web page feature of the target device and preset attribute tag data comprises:

3. The device identification method of claim 2,

the attribute information comprises attribute probability, the attribute probability represents the probability value of the target equipment belonging to each designated attribute, and the number of the designated attributes is more than 1;

4. An apparatus for identifying a device, comprising:

the webpage feature extraction module is used for extracting webpage features for representing the target webpage from the source code information; the extracted webpage features are divided into M feature categories, wherein M is equal to or greater than 1, and the M feature categories comprise webpage label statistical features, link address statistical features and text content features;

the target device classification module is used for classifying whether the target device belongs to the video device according to the webpage characteristics, and comprises: determining a feature vector corresponding to the webpage features belonging to each feature category; inputting each feature vector into a trained equipment category classifier, and identifying the equipment category of the target equipment as video equipment or non-video equipment by the equipment category classifier according to the input feature vector;

the attribute identification module is used for identifying the attribute information of the target equipment according to the webpage characteristics of the target equipment and preset attribute label data when the target equipment is determined to belong to the target category according to the webpage characteristics; wherein the object class is a video device.

5. The device identification apparatus according to claim 4, wherein the extracted web page features are divided into M feature categories, M is equal to 1 or greater than 1, the M feature categories include web page tag statistical features, link address statistical features, and text content features; the target device classification module includes:

6. An electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the device identification method of any one of claims 1-3.

7. A machine-readable storage medium, having stored thereon a program which, when executed by a processor, implements the device identification method according to any one of claims 1 to 3.