CN112134920B - File identification method and device - Google Patents

File identification method and device Download PDF

Info

Publication number
CN112134920B
CN112134920B CN202010806391.1A CN202010806391A CN112134920B CN 112134920 B CN112134920 B CN 112134920B CN 202010806391 A CN202010806391 A CN 202010806391A CN 112134920 B CN112134920 B CN 112134920B
Authority
CN
China
Prior art keywords
network configuration
ssid
word
configuration file
keyword library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010806391.1A
Other languages
Chinese (zh)
Other versions
CN112134920A (en
Inventor
程柯楠
王浩
公鹏耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Information Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN202010806391.1A priority Critical patent/CN112134920B/en
Publication of CN112134920A publication Critical patent/CN112134920A/en
Application granted granted Critical
Publication of CN112134920B publication Critical patent/CN112134920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/30Profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a file identification method and a file identification device, wherein the method comprises the following steps: respectively acquiring an AP identifier and an SSID in each network configuration file; for each network configuration file, sequentially segmenting the AP identification in the network configuration file according to a first segmentation rule set for the AP identification, and sequentially segmenting the SSID in the network configuration file according to a second segmentation rule set for the SSID; matching each word in the word segmentation result of each AP identifier in the network configuration file with an AP identifier keyword library one by one; if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one; and if the words matched with the SSID keyword library exist, identifying the network configuration file as a network configuration file of the user in the education industry. The application can improve the identification efficiency.

Description

File identification method and device
Technical Field
The present application relates to the field of technologies, and in particular, to a file identification method and apparatus.
Background
Currently, the education industry is one of the industries in which wireless networks are widely applied, and a wireless service provider needs to analyze relevant network configuration files in order to provide more efficient network operation and maintenance and higher-quality user experience.
Typically, the network profile of the user obtained by a technician or a developer is derived from a cloud platform or an Access Controller (AC). The file identifications of the network profiles are basically in a style such as SHGEXY _20200809_ config _ ac.cfg, and the names of Chinese and English and acronyms are mixed, so that except for people who name the network profiles, the people need to spend a long time to identify which network profiles are the network profiles of users in the education industry, and the identification efficiency is low.
Disclosure of Invention
In order to overcome the problems in the related art, the application provides a file identification method and a file identification device.
According to a first aspect of embodiments of the present application, there is provided a file identification method, including:
respectively acquiring an Access Point (AP) Identifier and a Service Set Identifier (SSID) in each network configuration file from the plurality of acquired network configuration files;
for each network configuration file, sequentially segmenting the AP identification in the network configuration file according to a first segmentation rule set for the AP identification, and sequentially segmenting the SSID in the network configuration file according to a second segmentation rule set for the SSID;
matching each word in the word segmentation result of each AP identifier in the network configuration file with an AP identifier keyword library one by one;
if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one;
if words matched with the SSID keyword library exist, the network configuration file is identified as a network configuration file of a user in the education industry;
the AP identification keyword library and the SSID keyword library are generated according to sample network configuration files of users in education industry and non-education industry.
According to a second aspect of embodiments of the present application, there is provided a document identification apparatus, the apparatus comprising:
the acquisition module is used for respectively acquiring the AP identification and the SSID in each network configuration file from the plurality of acquired network configuration files;
the word segmentation module is used for sequentially segmenting the AP identifications in the network configuration files according to a first word segmentation rule set for the AP identifications and sequentially segmenting the SSIDs in the network configuration files according to a second word segmentation rule set for the SSIDs aiming at each network configuration file;
the matching module is used for matching each word in the word segmentation result of each AP identifier in the network configuration file with the AP identifier keyword library one by one; if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one;
the identification module is used for identifying the network configuration file as a network configuration file of a user in the education industry if the matching result of the matching module is that words matched with the SSID keyword library exist;
the AP identification keyword library and the SSID keyword library are generated according to sample network configuration files of users in education industry and non-education industry.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
in the embodiment of the application, an AP identification keyword library and an SSID keyword library are obtained according to sample network configuration files of users in the education industry and the non-education industry, then, words are only needed to be segmented for the AP identification and the SSID in each obtained network configuration file, and according to matching results of words in respective word segmentation results and respective corresponding keyword libraries, which network configuration files are the network configuration files of the users in the education industry can be identified. The identification mode can realize automatic identification of the network configuration file without manual participation, thereby greatly improving the identification efficiency.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic flowchart of a file identification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a document identification apparatus according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
Next, examples of the present application will be described in detail.
An embodiment of the present application provides a file identification method, which may be applied to a wireless device such as an AC or a fat AP, and as shown in fig. 1, the method may include the following steps:
and S11, respectively acquiring the AP identification and the SSID in each network configuration file from the plurality of acquired network configuration files.
And S12, for each network configuration file, sequentially segmenting the AP identifications in the network configuration file according to the first segmentation rule set for the AP identifications, and sequentially segmenting the SSIDs in the network configuration file according to the second segmentation rule set for the SSIDs.
And S13, matching each word in the word segmentation result of each AP identifier in the network configuration file with the AP identifier keyword library one by one.
And S14, if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one.
And S15, if the words matched with the SSID keyword library exist, identifying the network configuration file as the network configuration file of the user in the education industry.
In the embodiment of the application, the AP identification keyword library and the SSID keyword library are both generated according to sample network configuration files of users in education industry and non-education industry.
Specifically, the AP identification keyword library and the SSID keyword library may be generated by:
dividing sample network configuration files of educational industry users and non-educational industry users into training sets and testing sets, wherein the training sets and the testing sets comprise sample network configuration files of the educational industry users and the non-educational industry users;
respectively acquiring an AP (access point) identifier and an SSID (service set identifier) in each sample network configuration file from the sample network configuration files of education industry users in the training set;
according to a first word segmentation rule, sequentially segmenting the obtained AP identifications, and according to a second word segmentation rule, sequentially segmenting the obtained SSID;
counting words in the obtained word segmentation result of the identifier based on a Term Frequency-Inverse Document Frequency (TF-IDF) algorithm aiming at the obtained word segmentation result of any one identifier in the AP identifier and the obtained SSID to obtain the Term Frequency of each word, and sequencing the obtained Term frequencies according to a sequence from high to low;
adding words corresponding to the first M word frequencies in the ordered word frequencies except for the word frequency of the word representing the designated meaning into a keyword library corresponding to the identifier, wherein M is a positive integer;
selecting words with the same meaning as the word representation in the keyword library from the words corresponding to the rest word frequencies in the ordered word frequencies, and adding the selected words into the keyword library;
processing words in a current keyword library and a sample network configuration file after the words are segmented according to a word segmentation rule set for the identification in a sample network configuration file of a user in the non-education industry based on an Inverse Document Frequency (IDF) algorithm to obtain an IDF value of each word in the current keyword library; here, the current keyword library is actually a keyword library obtained after the selected word is added to the keyword library;
sequencing the calculated IDF values according to the sequence from small to large, and deleting words corresponding to the first N IDF values from the current keyword library, wherein N is a positive integer; here, the current keyword library is still a keyword library obtained after the selected word is added to the keyword library;
for each sample network configuration file in the test set, according to a first word segmentation rule, sequentially segmenting the AP identification obtained from the sample network configuration file, and according to a second word segmentation rule, sequentially segmenting the SSID obtained from the sample network configuration file;
matching each word in the word segmentation result of each AP identifier in the sample network configuration file with the current keyword library corresponding to the AP identifier one by one, and if matched words exist, not matching the words; here, the current keyword library corresponding to the AP identifier is actually the keyword library corresponding to the AP identifier obtained after adding the selected word to the keyword library corresponding to the AP identifier and deleting the words corresponding to the N IDF values;
if the words matched with the current key library corresponding to the AP identification exist, identifying the sample network configuration file as a network configuration file of a user in the education industry;
if the words matched with the current keyword library corresponding to the AP identification do not exist, matching each word in the word segmentation result of each SSID in the sample network configuration file with the current keyword library corresponding to the SSID one by one, and if the words which are not matched exist in the subsequent matching, not matching; here, the current keyword library corresponding to the SSID is actually the keyword library corresponding to the SSID obtained after the selected word is added to the keyword library corresponding to the SSID and the words corresponding to the N IDF values are deleted;
if the words matched with the current keyword library corresponding to the SSID exist, identifying the sample network configuration file as a network configuration file of the user in the education industry;
counting the total number of the identified sample network configuration files as the network configuration files of the users in the education industry;
and if the ratio of the counted total number to the total number of the network configuration files of the real education industry users in the test set is not larger than the set ratio, increasing the value of M, returning and executing the step of adding words corresponding to the first M word frequencies in the ordered word frequencies into a keyword library corresponding to the obtained AP identification and the obtained SSID according to the word segmentation result of any one identification in the SSID until the ratio is not smaller than the set ratio, determining the current keyword library corresponding to the AP identification as the AP identification keyword library, and determining the current keyword library corresponding to the SSID as the SSID keyword library.
It should be noted that the above process of generating the AP identification keyword library and the SSID keyword library may be divided into a training process and a testing process.
In particular, the network profiles of non-educational industry users in the training set used in the training process are only used when using the IDF algorithm. The number of sample network profiles in the training set used in the training procedure and the number of sample network profiles in the testing set used in the testing procedure may generally be as follows 3: 1, and the application does not limit the specific distribution ratio.
In the training process, when the obtained AP identifiers are segmented in sequence according to the first segmentation rule, the specific segmentation process is as follows: for each AP identifier, segmenting the AP identifier according to the configuration rule of the AP identifier to obtain an initial segmentation result; then, for each word in the initial word segmentation result, if the word comprises a number, deleting the number in the word, and if only 1 character remains in the deleted word, deleting the word; and taking the finally obtained word segmentation result as the word segmentation result of the AP identifier.
For example, assume that the configuration rule of a certain AP is: the character string of 1-64 characters can comprise letters, numbers, underlines, "[", "]", "/" and "-", and is distinguished from case to case; the AP identifier of the AP is dcda-8099-8500, and after the word segmentation process is executed, the word segmentation result of the AP identifier of the AP can be obtained as dcda.
Similarly, when the obtained SSIDs are segmented in sequence according to the second segmentation rule, the specific segmentation process is as follows: for each SSID, segmenting the SSID according to the configuration rule of the SSID to obtain an initial segmentation result; then, for each word in the initial word segmentation result, if the word comprises a number, deleting the number in the word, and if only 1 character remains in the deleted word, deleting the word; and taking the finally obtained word segmentation result as the word segmentation result of the SSID.
In the training process, when the words corresponding to the first M word frequencies in the ordered word frequencies except for the word frequency of the word representing the designated meaning are added to the keyword library corresponding to the identifier, if the identifier is an AP identifier, the word representing the designated meaning may refer to a wireless term unrelated to the education industry, such as "AP", "radio", and the like; if the identification is SSID, the words that characterize the specified meaning may refer to common words unrelated to the educational industry or to acronyms in the wireless field, such as "test", "st", "cmcc", "net", and the like.
Subsequently, based on the IDF algorithm, when processing the words in the current keyword library and the sample network configuration file in which the word in the sample network configuration file of the non-education industry user has been segmented according to the segmentation rule set for the identifier, the specific processing procedure is as follows: for each word in the current keyword library, respectively counting the total number (called as a first number) of sample network configuration files of users in the non-education industry and the total number (called as a second number) of files comprising the word in the sample network configuration files of the users in the non-education industry; then calculating the quotient of the first quantity and the second quantity after adding 1; and finally, carrying out logarithm operation on the calculated quotient to obtain the IDF value of the word.
The larger the IDF value is, the better the category distinguishing capability of the word is; the smaller the IDF value, the less the word has the classification discrimination capability. Therefore, in order to improve the recognition accuracy, the words corresponding to the first N IDF values are removed from the current keyword library.
In the training process, no matter the word segmentation result identified by the AP or the word segmentation result identified by the SSID, although the writing methods of some words in the word segmentation results are different, the meaning of the representations is the same, and the education industry can be obviously characterized, for example, words such as jiaoxuelou, jiao, jxl, edu, education, stu, sdu, and the like, but because the writing methods of the words are different, the word frequency of some words may be higher, and in the corresponding keyword library, the word frequency of some words is lower and is not in the corresponding keyword library. To improve recognition accuracy, these words that are not in the corresponding keyword library are also added to the corresponding keyword library.
It should be further noted that, in the training process, the values of M and N may be uniformly set without distinguishing the AP identifier and the SSID; it is also possible to distinguish between the AP identification and the SSID, and set them separately for either one of them.
In the testing process, when the ratio of the counted total number to the total number of the network configuration files of the users in the real education industry in the test set is not greater than the set ratio, the value of M may be increased according to a preset adjustment rule, for example, according to a multiple of the value of M.
After the value of M is increased, whether the word segmentation result is for the obtained AP identification or the word segmentation result is for the obtained SSID, the step of adding the words corresponding to the first M word frequencies in the sequenced word frequencies into the keyword library corresponding to the identification is returned, namely, training and testing are continued until the ratio is not less than the set ratio, for example, the set ratio is 0.9, the current keyword library corresponding to the AP identification used in the last testing process is determined as the AP identification keyword library, and the current keyword library corresponding to the SSID is determined as the SSID keyword library, so that the network configuration files of educational industry users can be automatically identified according to the AP identification keyword library and the SSID keyword library later, and the identification efficiency and the identification accuracy are improved.
Further, in the embodiment of the present application, in step S13, before performing matching with the AP identification keyword library one by one, the following operations may be further performed:
and sequencing words in the word segmentation result identified by each AP in the network configuration file according to the sequence of the occurrence times from high to low.
In step S14, before matching with the SSID keyword library one by one, the following operations may be further performed:
and sequencing words in the word segmentation result of each SSID in the network configuration file according to the sequence of the occurrence times from high to low.
Of course, the two operation flows can also be applied to the test flow to improve the identification accuracy.
The subsequent matching of the AP identification keyword library and the SSID keyword library may be performed in the same manner as the corresponding matching manner in the test procedure, i.e., no matching is performed if there are words that are not matched, regardless of whether the AP identification keyword library or the SSID keyword library is matched one by one.
Further, in the embodiment of the present application, after the step S13 is executed, the following operations may be further executed:
and after matching each word in the word segmentation result of each AP identifier in the network configuration file with the AP identifier keyword library one by one, if the word matched with the AP identifier keyword library exists, identifying the network configuration file as the network configuration file of the user in the education industry.
In the embodiment of the present application, after the step S14 is executed, the following operations may also be executed:
and if the words matched with the AP identification keyword library do not exist and the words matched with the SSID keyword library do not exist, identifying the network configuration file as the network configuration file of the user in the non-education industry.
According to the technical scheme, the AP identification keyword library and the SSID keyword library are obtained according to the sample network configuration files of users in the education industry and the non-education industry, then, the words of the AP identification and the SSID in each obtained network configuration file are segmented, and according to the matching results of the words in the respective segmentation results and the respective corresponding keyword libraries, which network configuration files are the network configuration files of the users in the education industry can be identified. The identification mode can realize automatic identification of the network configuration file without manual participation, thereby greatly improving the identification efficiency.
Based on the same inventive concept, the present application further provides a file identification apparatus, which is applied to wireless devices such as an AC or fat AP, and a schematic structural diagram of the apparatus is shown in fig. 2, and specifically includes:
an obtaining module 21, configured to obtain, from the obtained multiple network configuration files, an AP identifier and an SSID in each network configuration file respectively;
a word segmentation module 22, configured to, for each network configuration file, sequentially perform word segmentation on the AP identifiers in the network configuration file according to a first word segmentation rule set for the AP identifiers, and sequentially perform word segmentation on the SSIDs in the network configuration file according to a second word segmentation rule set for the SSIDs;
the matching module 23 is configured to match each word in the word segmentation result of each AP identifier in the network configuration file with the AP identifier keyword library one by one; if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one;
the identification module 24 is configured to identify the network configuration file as a network configuration file of a user in the education industry if the matching result of the matching module 23 is that a word matching with the SSID keyword library exists;
the AP identification keyword library and the SSID keyword library are generated according to sample network configuration files of users in education industry and non-education industry.
Preferably, the apparatus further comprises:
a generating module (not shown in FIG. 2) for generating the AP identification keyword library and the SSID keyword library by:
dividing the sample network configuration files of the educational industry users and the non-educational industry users into a training set and a testing set, wherein the training set and the testing set comprise the sample network configuration files of the educational industry users and the non-educational industry users;
respectively acquiring an AP (access point) identifier and an SSID (service set identifier) in each sample network configuration file from the sample network configuration files of the education industry users in the training set;
according to the first word segmentation rule, sequentially segmenting the obtained AP identification, and according to the second word segmentation rule, sequentially segmenting the obtained SSID;
counting words in the acquired word segmentation result of the identifier based on a TF-IDF algorithm aiming at the acquired word segmentation result of any identifier in the AP identifier and the acquired SSID to obtain the word frequency of each word, and sequencing the obtained word frequencies according to the sequence from high to low;
adding words corresponding to the first M word frequencies in the ordered word frequencies except for the word frequency of the word representing the designated meaning into a keyword library corresponding to the identifier, wherein M is a positive integer;
selecting words with the same meaning as the word representation in the keyword library from the words corresponding to the rest word frequencies in the ordered word frequencies, and adding the selected words into the keyword library;
processing the words in the current keyword bank and the sample network configuration file after the words are segmented from the identification in the sample network configuration file of the non-education-industry user according to the word segmentation rule set for the identification based on the IDF algorithm to obtain the IDF value of each word in the current keyword bank;
sequencing the calculated IDF values according to the sequence from small to large, and deleting words corresponding to the first N IDF values from the current keyword library, wherein N is a positive integer;
for each sample network configuration file in the test set, according to the first word segmentation rule, sequentially segmenting the AP identification obtained from the sample network configuration file, and according to the second word segmentation rule, sequentially segmenting the SSID obtained from the sample network configuration file;
matching each word in the word segmentation result of each AP identifier in the sample network configuration file with the current keyword library corresponding to the AP identifier one by one;
if the words matched with the current key library corresponding to the AP identification exist, identifying the sample network configuration file as a network configuration file of a user in the education industry;
if the word matched with the current keyword library corresponding to the AP identifier does not exist, matching each word in the word segmentation result of each SSID in the sample network configuration file with the current keyword library corresponding to the SSID one by one;
if the words matched with the current keyword library corresponding to the SSID exist, identifying the sample network configuration file as a network configuration file of the user in the education industry;
counting the total number of the identified sample network configuration files as the network configuration files of users in the education industry;
if the ratio of the total number counted to the total number of the network configuration files of the users in the real education industry in the test set is not larger than a set ratio, increasing the value of M, returning and executing the step of adding words corresponding to the first M word frequencies in the ordered word frequencies to a keyword library corresponding to the obtained AP identification and the obtained SSID according to the word segmentation result of any one identification in the SSID, determining the current keyword library corresponding to the AP identification as the AP identification keyword library until the ratio is not smaller than the set ratio, and determining the current keyword library corresponding to the SSID as the SSID keyword library.
Preferably, the apparatus further comprises:
a first sorting module (not shown in fig. 2) configured to, before the matching module 23 matches the AP identification keyword library one by one, sort the words in the word segmentation result of each AP identification in the network configuration file in an order from high to low in occurrence frequency;
the device further comprises:
a second sorting module (not shown in fig. 2), configured to sort, before the matching module 23 matches the SSID keyword library one by one, words in the word segmentation result of each SSID in the network configuration file according to an order from high to low occurrence times.
Preferably, the identification module 24 is further configured to:
after the matching module 23 matches with the AP identification keyword library one by one, if the matching result of the matching module 23 is that there is a word matching with the AP identification keyword library, the network configuration file is identified as the network configuration file of the user in the education industry.
Preferably, the identification module 24 is further configured to:
after the matching module 23 matches with the SSID keyword library one by one, if the matching result of the matching module 23 is that there is no word matching with the SSID keyword library, the network configuration file is identified as the network configuration file of the non-education industry user.
According to the technical scheme, the AP identification keyword library and the SSID keyword library are obtained according to the sample network configuration files of users in the education industry and the non-education industry, then, the words of the AP identification and the SSID in each obtained network configuration file are segmented, and the network configuration files can be identified according to the matching results of the words in the respective segmentation results and the respective corresponding keyword libraries. The identification mode can realize automatic identification of the network configuration file without manual participation, thereby greatly improving the identification efficiency.
An electronic device is further provided in the embodiments of the present application, as shown in fig. 3, including a processor 31 and a machine-readable storage medium 32, where the machine-readable storage medium 32 stores machine-executable instructions that can be executed by the processor 31, and the processor 31 is caused by the machine-executable instructions to: and realizing the file identification method.
The machine-readable storage medium may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the machine-readable storage medium may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, in which a computer program is stored, which when executed by a processor implements the steps of the above-mentioned file identification method.
The above description is only a preferred embodiment of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (8)

1. A method for identifying a document, the method comprising:
respectively acquiring an Access Point (AP) identifier and a Service Set Identifier (SSID) in each network configuration file from the acquired network configuration files;
for each network configuration file, sequentially segmenting the AP identifications in the network configuration file according to a first segmentation rule set for the AP identifications, and sequentially segmenting the SSIDs in the network configuration file according to a second segmentation rule set for the SSIDs;
matching each word in the word segmentation result of each AP identifier in the network configuration file with an AP identifier keyword library one by one;
if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one;
if words matched with the SSID keyword library exist, the network configuration file is identified as a network configuration file of a user in the education industry;
the AP identification keyword library and the SSID keyword library are both generated according to sample network configuration files of users in education industry and non-education industry;
generating the AP identification keyword library and the SSID keyword library by:
dividing the sample network configuration files of the educational industry users and the non-educational industry users into a training set and a testing set, wherein the training set and the testing set respectively comprise the sample network configuration files of the educational industry users and the non-educational industry users;
respectively acquiring an AP (access point) identifier and an SSID (service set identifier) in each sample network configuration file from the sample network configuration files of the education industry users in the training set;
according to the first word segmentation rule, sequentially segmenting the obtained AP identification, and according to the second word segmentation rule, sequentially segmenting the obtained SSID;
counting words in the obtained AP identification and the obtained SSID word segmentation result based on a word frequency-reverse file frequency TF-IDF algorithm aiming at the obtained AP identification and the obtained SSID word segmentation result to obtain the word frequency of each word, and sequencing the obtained word frequencies from high to low;
adding words corresponding to the first M word frequencies in the ordered word frequencies except for the word frequency of the word representing the designated meaning into a keyword library corresponding to the obtained AP identifier and the obtained SSID, wherein M is a positive integer;
selecting words with the same meaning as the word representation in the keyword library from the words corresponding to the rest word frequencies in the ordered word frequencies, and adding the selected words into the keyword library;
processing the words in the current keyword bank and the sample network configuration file after the words of the acquired AP identifications and the acquired SSID in the sample network configuration file of the non-education industry users in the training set are segmented according to the word segmentation rules set aiming at the acquired AP identifications and the acquired SSID based on a reverse file frequency IDF algorithm to obtain an IDF value of each word in the current keyword bank;
sequencing the calculated IDF values according to the sequence from small to large, and deleting words corresponding to the first N IDF values from the current keyword library, wherein N is a positive integer;
for each sample network configuration file in the test set, according to the first word segmentation rule, sequentially segmenting the AP identification obtained from the sample network configuration file, and according to the second word segmentation rule, sequentially segmenting the SSID obtained from the sample network configuration file;
matching each word in the word segmentation result of each AP identifier in the sample network configuration file with the current keyword library corresponding to the AP identifier one by one;
if the words matched with the current key library corresponding to the AP identification exist, identifying the sample network configuration file as a network configuration file of a user in the education industry;
if the word matched with the current keyword library corresponding to the AP identifier does not exist, matching each word in the word segmentation result of each SSID in the sample network configuration file with the current keyword library corresponding to the SSID one by one;
if the words matched with the current keyword library corresponding to the SSID exist, identifying the sample network configuration file as a network configuration file of the user in the education industry;
counting the total number of the identified sample network configuration files as the network configuration files of users in the education industry;
and if the ratio of the counted total number to the total number of the network configuration files of the real education industry users in the test set is not larger than a set ratio, increasing the value of M, returning to execute the step of adding words corresponding to the first M word frequencies in the sequenced word frequencies into a keyword library corresponding to the obtained AP identifier and the obtained SSID according to the obtained AP identifier and the obtained SSID word segmentation result until the ratio is not smaller than the set ratio, determining the current keyword library corresponding to the AP identifier as the AP identifier keyword library, and determining the current keyword library corresponding to the SSID as the SSID keyword library.
2. The method of claim 1, wherein prior to matching with the library of AP-identifying keywords one by one, the method further comprises:
sequencing words in the word segmentation result identified by each AP in the network configuration file according to the sequence of the occurrence times from high to low;
before matching with the SSID keyword library one by one, the method further comprises the following steps:
and sequencing words in the word segmentation result of each SSID in the network configuration file according to the sequence of the occurrence times from high to low.
3. The method of claim 1, wherein after matching with the library of AP-identifying keywords one by one, the method further comprises:
and if the words matched with the AP identification keyword library exist, identifying the network configuration file as a network configuration file of the user in the education industry.
4. The method of claim 1, wherein after matching one by one to a library of SSID keywords, the method further comprises:
and if no words matched with the SSID keyword library exist, identifying the network configuration file as a network configuration file of a user in a non-education industry.
5. A document identification device, the device comprising:
the acquisition module is used for respectively acquiring an Access Point (AP) identifier and a Service Set Identifier (SSID) in each network configuration file from the plurality of acquired network configuration files;
the word segmentation module is used for sequentially segmenting the AP identifications in the network configuration files according to a first word segmentation rule set for the AP identifications and sequentially segmenting the SSIDs in the network configuration files according to a second word segmentation rule set for the SSIDs aiming at each network configuration file;
the matching module is used for matching each word in the word segmentation result of each AP identifier in the network configuration file with the AP identifier keyword library one by one; if no words matched with the AP identification keyword library exist, matching each word in the word segmentation result of each SSID in the network configuration file with the SSID keyword library one by one;
the identification module is used for identifying the network configuration file as a network configuration file of a user in the education industry if the matching result of the matching module is that words matched with the SSID keyword library exist;
the AP identification keyword library and the SSID keyword library are both generated according to sample network configuration files of users in education industry and non-education industry;
the device further comprises:
a generating module, configured to generate the AP identification keyword library and the SSID keyword library by:
dividing the sample network configuration files of the educational industry users and the non-educational industry users into a training set and a testing set, wherein the training set and the testing set comprise the sample network configuration files of the educational industry users and the non-educational industry users;
respectively acquiring an AP (access point) identifier and an SSID (service set identifier) in each sample network configuration file from the sample network configuration files of the education industry users in the training set;
according to the first word segmentation rule, sequentially segmenting the obtained AP identification, and according to the second word segmentation rule, sequentially segmenting the obtained SSID;
counting words in the obtained AP identification and the obtained SSID word segmentation result based on a word frequency-reverse file frequency TF-IDF algorithm aiming at the obtained AP identification and the obtained SSID word segmentation result to obtain the word frequency of each word, and sequencing the obtained word frequencies from high to low;
adding words corresponding to the first M word frequencies except the word frequency of the word representing the designated meaning in the ordered word frequencies into a keyword library corresponding to the obtained AP identifier and the obtained SSID, wherein M is a positive integer;
selecting words with the same meaning as the word representation in the keyword library from the words corresponding to the rest word frequencies in the ordered word frequencies, and adding the selected words into the keyword library;
processing the words in the current keyword bank and the sample network configuration file after the words of the acquired AP identifications and the acquired SSID in the sample network configuration file of the non-education industry users in the training set are segmented according to the word segmentation rules set aiming at the acquired AP identifications and the acquired SSID based on a reverse file frequency IDF algorithm to obtain an IDF value of each word in the current keyword bank;
sequencing the calculated IDF values according to the sequence from small to large, and deleting words corresponding to the first N IDF values from the current keyword library, wherein N is a positive integer;
for each sample network configuration file in the test set, according to the first word segmentation rule, sequentially segmenting the AP identification obtained from the sample network configuration file, and according to the second word segmentation rule, sequentially segmenting the SSID obtained from the sample network configuration file;
matching each word in the word segmentation result of each AP identifier in the sample network configuration file with the current keyword library corresponding to the AP identifier one by one;
if the words matched with the current key library corresponding to the AP identification exist, identifying the sample network configuration file as a network configuration file of a user in the education industry;
if the word matched with the current keyword library corresponding to the AP identifier does not exist, matching each word in the word segmentation result of each SSID in the sample network configuration file with the current keyword library corresponding to the SSID one by one;
if the words matched with the current keyword library corresponding to the SSID exist, identifying the sample network configuration file as a network configuration file of the user in the education industry;
counting the total number of the identified sample network configuration files as the network configuration files of the users in the education industry;
and if the ratio of the total number counted to the total number of the network configuration files of the users in the real education industry in the test set is not more than a set ratio, increasing the value of M, returning to execute the step of adding words corresponding to the first M word frequencies in the sequenced word frequencies into a keyword library corresponding to the obtained AP identification and the obtained SSID according to the obtained AP identification and the obtained SSID segmentation results until the ratio is not less than the set ratio, determining the current keyword library corresponding to the AP identification as the AP identification keyword library, and determining the current keyword library corresponding to the SSID as the SSID keyword library.
6. The apparatus of claim 5, further comprising:
the first sequencing module is used for sequencing words in the word segmentation result of each AP identifier in the network configuration file according to the sequence of the occurrence times from high to low before the matching module is matched with the AP identifier keyword library one by one;
the device further comprises:
and the second sequencing module is used for sequencing the words in the word segmentation result of each SSID in the network configuration file from high to low according to the occurrence frequency before the matching module is matched with the SSID keyword library one by one.
7. The apparatus of claim 5, wherein the identification module is further configured to:
and after the matching module is matched with the AP identification keyword bank one by one, if the matching result of the matching module is that words matched with the AP identification keyword bank exist, identifying the network configuration file as a network configuration file of a user in the education industry.
8. The apparatus of claim 5, wherein the identification module is further configured to:
and after the matching module is matched with the SSID keyword bank one by one, if the matching result of the matching module is that no words matched with the SSID keyword bank exist, identifying the network configuration file as the network configuration file of the user in the non-education industry.
CN202010806391.1A 2020-08-12 2020-08-12 File identification method and device Active CN112134920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806391.1A CN112134920B (en) 2020-08-12 2020-08-12 File identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806391.1A CN112134920B (en) 2020-08-12 2020-08-12 File identification method and device

Publications (2)

Publication Number Publication Date
CN112134920A CN112134920A (en) 2020-12-25
CN112134920B true CN112134920B (en) 2022-08-30

Family

ID=73851605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806391.1A Active CN112134920B (en) 2020-08-12 2020-08-12 File identification method and device

Country Status (1)

Country Link
CN (1) CN112134920B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2629909A1 (en) * 2007-04-26 2008-10-26 Bowe Bell + Howell Company Apparatus, method and programmable product for identification of a document with feature analysis
CN103324745B (en) * 2013-07-04 2017-04-05 微梦创科网络科技(中国)有限公司 Text garbage recognition methods and system based on Bayesian model
CN103441924B (en) * 2013-09-03 2016-06-08 盈世信息科技(北京)有限公司 A kind of rubbish mail filtering method based on short text and device
CN104820703A (en) * 2015-05-12 2015-08-05 武汉数为科技有限公司 Text fine classification method
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN111125332B (en) * 2019-12-20 2023-07-21 东软集团股份有限公司 Method, device, equipment and storage medium for calculating TF-IDF value of word

Also Published As

Publication number Publication date
CN112134920A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN110222791B (en) Sample labeling information auditing method and device
CN107545038B (en) Text classification method and equipment
CN111159404B (en) Text classification method and device
CN106897290B (en) Method and device for establishing keyword model
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN106951415A (en) A kind of name of firm searching method and device
CN110263854B (en) Live broadcast label determining method, device and storage medium
CN105989001A (en) Image searching method and device, and image searching system
CN112416778A (en) Test case recommendation method and device and electronic equipment
CN108717519B (en) Text classification method and device
CN111539612B (en) Training method and system of risk classification model
CN114153925A (en) Data table association analysis method and device
CN107908649B (en) Text classification control method
CN112134920B (en) File identification method and device
CN111611781B (en) Data labeling method, question answering device and electronic equipment
CN111723182B (en) Key information extraction method and device for vulnerability text
CN110955774B (en) Word frequency distribution-based character classification method, device, equipment and medium
CN112328469A (en) Function level defect positioning method based on embedding technology
CN111400495A (en) Video bullet screen consumption intention identification method based on template characteristics
CN110674632A (en) Method and device for determining security level, storage medium and equipment
CN114281983B (en) Hierarchical text classification method, hierarchical text classification system, electronic device and storage medium
Ryman et al. Application of source code plagiarism detection and grouping techniques for short programs
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN111930911B (en) Rapid field question-answering method and device thereof
JP2009217528A (en) Document classification method, system, and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230619

Address after: 310052 11th Floor, 466 Changhe Road, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: H3C INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 310052 Changhe Road, Binjiang District, Hangzhou, Zhejiang Province, No. 466

Patentee before: NEW H3C TECHNOLOGIES Co.,Ltd.