CN113114679B - Message identification method and device, electronic equipment and medium - Google Patents

Message identification method and device, electronic equipment and medium Download PDF

Info

Publication number
CN113114679B
CN113114679B CN202110397527.2A CN202110397527A CN113114679B CN 113114679 B CN113114679 B CN 113114679B CN 202110397527 A CN202110397527 A CN 202110397527A CN 113114679 B CN113114679 B CN 113114679B
Authority
CN
China
Prior art keywords
request message
request
message
matrix
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110397527.2A
Other languages
Chinese (zh)
Other versions
CN113114679A (en
Inventor
吴鸿霖
吕博良
叶红
姜城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110397527.2A priority Critical patent/CN113114679B/en
Publication of CN113114679A publication Critical patent/CN113114679A/en
Application granted granted Critical
Publication of CN113114679B publication Critical patent/CN113114679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure provides a message identification method, and belongs to the field of big data. The method comprises the following steps: acquiring a request message for accessing a website; identifying whether the request message belongs to a file uploading message or a non-file uploading message by using a classifier model so as to process the request message according to an identification result; the classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages. The disclosure also provides a message identification device, an electronic device and a computer readable storage medium.

Description

Message identification method and device, electronic equipment and medium
Technical Field
The present disclosure belongs to the technical field of artificial intelligence, and more particularly, to a method and an apparatus for identifying a packet, an electronic device, and a computer-readable storage medium.
Background
In the web application program of the modern internet, in order to improve the service efficiency, a function of uploading files is often provided, but the risk that the web application is attacked is also improved, if the web application has a file uploading bug, an attacker can use the bug to attack, and further the harm of controlling the whole website and even controlling a server is caused. Therefore, monitoring and processing file upload messages that access a web application is important to protecting the web application, and it is also desirable to be able to first screen out file upload messages from a large number of request messages that access the web application.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a medium for identifying a packet, which can intelligently and automatically identify whether a request packet is a file upload packet to a greater extent, so as to facilitate a website to analyze and process the packet.
In one aspect of the disclosed embodiments, a method for identifying a packet is provided. The method comprises the following steps: acquiring a request message for accessing a website; identifying whether the request message belongs to a file uploading message or a non-file uploading message by using a classifier model so as to process the request message according to an identification result; the classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages.
According to the embodiment of the present disclosure, the identifying, by using the classifier model, that the request packet belongs to a file upload packet or a non-file upload packet includes obtaining a feature matrix of the request packet in any one of the following manners or a combination of the following manners: acquiring a character distribution characteristic matrix of the request message based on natural language processing of characters in the request message; acquiring an uploading behavior distribution characteristic matrix of the request message based on matching of characters in the request message with a preset behavior dictionary matrix space; or, based on the matching of the characters in the request message and a preset type dictionary matrix space, obtaining a file type feature matrix of the request message. The behavior dictionary matrix space is composed of words used for representing uploading behaviors of messages, and the type dictionary matrix space is composed of words used for representing uploaded files. Then, the feature matrix of the request message is input to the classifier model to obtain a classification result output by the classifier model.
According to the embodiment of the present disclosure, the obtaining the feature matrix of the request packet further includes splicing the character distribution feature matrix, the uploading behavior distribution feature matrix, and the file type feature matrix to obtain the feature matrix of the request packet.
According to an embodiment of the present disclosure, the obtaining a character distribution feature matrix of the request message based on natural language processing of characters in the request message includes: performing n-gram word segmentation on the request message by taking characters as the minimum unit of word segmentation to obtain a word segmentation list of the request message; and comparing the word segmentation list with a preset word segmentation matrix space to obtain a character distribution characteristic matrix of the request message. The word segmentation matrix space is formed by word segmentation obtained by performing n-gram word segmentation on a plurality of historical request messages by taking characters as the minimum unit and performing reverse word frequency TF-IDF statistics.
According to an embodiment of the present disclosure, the performing n-gram segmentation on the request message with characters as the minimum unit of segmentation includes performing n-gram segmentation on characters in parameters of a request line, a request header, and a request body of the request message.
According to an embodiment of the present disclosure, the words in the behavior dictionary matrix space include at least one of: upload, filename, boundary, data, or path; and/or the words in the type dictionary matrix space include at least one of: png, jpg, jpeg, image, gif, bmp, ffd8, or 424D.
According to an embodiment of the present disclosure, the classifier model is obtained by training as follows: obtaining a plurality of history request messages; marking each history request message as a file uploading message or a non-file uploading message; extracting the feature matrix of each historical request message in the same way as the feature matrix of the request message is obtained; and training the classifier model by taking the feature matrix of each historical request message as input and the mark of each historical request message as output reference.
In another aspect of the embodiments of the present disclosure, an apparatus for identifying a packet is provided. The device comprises an acquisition module and an identification module. The acquisition module is used for acquiring a request message for accessing the website. The identification module is used for identifying whether the request message belongs to a file uploading message or a non-file uploading message by using the classifier model so as to process the request message according to an identification result. The classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages.
According to an embodiment of the present disclosure, the recognition module includes a feature extraction sub-module, and a classification sub-module. The feature extraction submodule is used for obtaining a feature matrix of the request message in any one or combination of the following modes: acquiring a character distribution characteristic matrix of the request message based on natural language processing of characters in the request message; acquiring an uploading behavior distribution characteristic matrix of the request message based on matching of characters in the request message with a preset behavior dictionary matrix space; obtaining a file type feature matrix of the request message based on matching of characters in the request message with a preset type dictionary matrix space; the behavior dictionary matrix space is composed of words used for representing uploading behaviors related to the messages, and the type dictionary matrix space is composed of words used for representing uploaded files. And the classification submodule is used for inputting the characteristic matrix of the request message into the classifier model so as to obtain a classification result output by the classifier model.
According to an embodiment of the present disclosure, the apparatus further comprises a training module. The training module is used for training to obtain the classifier model in the following way: obtaining a plurality of history request messages; marking each history request message as a file uploading message or a non-file uploading message; extracting the feature matrix of each historical request message by using the feature extraction submodule in the same way of obtaining the feature matrix of the request message; and training the classifier model by taking the feature matrix of each historical request message as input and the mark of each historical request message as output reference.
In another aspect of the disclosed embodiments, an electronic device is provided. The electronic device includes one or more memories, and one or more processors. The memory stores executable instructions. The processor executes the executable instructions to implement the method as described above.
Another aspect of the embodiments of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of embodiments of the present disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
One or more of the above-described embodiments may provide the following advantages or benefits: whether a request message is a file uploading message or not can be classified and identified by using a two-classification machine learning model, so that the type of the request message can be intelligently judged by using a big data method, and the request message can be conveniently subjected to targeted subsequent processing.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an exemplary system architecture of a method and apparatus for message identification according to an embodiment of the present disclosure;
fig. 2 schematically shows a flow chart of a method of identifying a message according to an embodiment of the present disclosure;
fig. 3 schematically shows a flow chart of a method for extracting a feature matrix of a request message in an identification method of a message according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of a method for training a classifier model in a recognition method of a packet according to an embodiment of the present disclosure;
fig. 5 schematically shows a block diagram of an apparatus for recognition of a message according to an embodiment of the present disclosure;
fig. 6 schematically shows a block diagram of an apparatus for recognition of a message according to another embodiment of the present disclosure; and
fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement message identification according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that these descriptions are illustrative only and are not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
In the related art, for example, during a website access or test, it is usually determined through manual experience whether a request message is a file upload message, so as to perform subsequent processing of the message. The judgment process of the message type depends on manual experience and is not intelligent enough.
In view of this, embodiments of the present disclosure provide an intelligent and automatic message identification method and apparatus, an electronic device, and a medium. The method comprises the steps of firstly obtaining a request message for accessing a website, then identifying whether the request message belongs to a file uploading message or a non-file uploading message by using a classifier model, and processing the request message according to an identification result. The classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages. In this way, the type of the request message can be intelligently identified by means of artificial intelligence, and the request message can be conveniently subjected to targeted processing measures.
It should be noted that the method and apparatus for identifying a packet determined in the embodiment of the present disclosure may be used in the financial field, and may also be used in any field other than the financial field.
Fig. 1 schematically illustrates an exemplary system architecture 100 of a method and apparatus for message identification according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 may include a user 101, a browser or APP (Application) 102, a recognition device 103, and a target Application server 104. The target application server 104 may provide a background management service for the user 101 through a browser or a website browsed in the APP 102. The recognition device 103 may perform the recognition method of the packet according to the embodiment of the present disclosure, and intelligently recognize the type of the request packet for accessing the website supported by the target application server 104.
The embodiment of the disclosure is suitable for a development phase, a function test phase, a safety test phase and/or a use phase in a software development life cycle. Thus, the user 101 may be a developer, a functional tester, a security tester, and/or a user, among others.
The user 101 may initiate an access request (e.g., an http request) through the browser or APP 102. The identifying device 103 may intercept the access request, then execute the method of the embodiment of the present disclosure to identify whether the request packet of the access request belongs to a file upload packet or a non-file upload packet, and process the request packet according to the identification result. For example, when the request message is identified as a file upload message in the software use stage, it may be detected whether an uploaded file in the request message is secure (e.g., whether a virus is carried, whether a file format is compliant, whether a file size meets a requirement, etc.), and the uploaded file is released when determined to be secure, and discarded if not secure, so that the target application server 104 may be protected from attack. Or for another example, in the software security testing stage, when it is identified that the request packet belongs to the file upload packet, the request packet may be modified to construct various types of vulnerability attack packets, and then the vulnerability protection performance and the like of the target application server 104 may be detected by using the various types of vulnerability attack packets. Therefore, the embodiment of the disclosure can automatically identify the type of the request message, so that the request message can be processed in a targeted manner.
The identification apparatus 103 may be implemented as any one or a combination of any plurality of the identification apparatus 500, the identification apparatus 600, the electronic device 700, a computer readable storage medium, or a computer program, which are described below, and the present disclosure is not limited thereto. The identifying means 103 may be disposed in the target application server 104, or in other devices communicating with the target application server 104 and the browser or APP102, which is not limited in this disclosure.
Fig. 2 schematically shows a flow chart of a message identification method according to an embodiment of the present disclosure.
As shown in fig. 2, the method 200 for identifying a packet according to this embodiment may include operations S210 to S220.
In operation S210, a request message for accessing a website is acquired.
In operation S220, the request packet is identified as a file upload packet or a non-file upload packet by using the classifier model, so that the request packet is processed according to the identification result. The classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages. The classifier model may be used to classify a request message into either a file upload message or a non-file upload message. In this way, the type of the request message can be intelligently identified by means of machine learning, and targeted processing measures can be taken on the request message conveniently.
Specifically, when the classifier model is used to identify that the request packet belongs to the file upload packet or the non-file upload packet in operation S220, the feature matrix of the request packet may be extracted first, and then the feature matrix of the request packet is input to the classifier model to obtain the classification result output by the classifier model, so as to determine the type of the request packet according to the classification result.
In one embodiment, the feature matrix of the request message may be extracted by extracting data of each field or some fields from the request line, the request header, and/or the request body of the request message, and processing the extracted data into the feature matrix.
In another embodiment, when extracting the feature matrix of the request message, the characters in the request message may be used as the minimum unit to perform word segmentation and/or matching and other processing on the characters in the request message, so as to obtain the feature matrix of the request message. In some embodiments, in the process of processing the request packet with the minimum unit of character, the semantic information of the character and/or the frequency characteristic of the character (for example, statistical information based on word frequency-inverse text frequency TF-IDF) may also be combined to process to obtain the characteristic matrix of the packet, and a specific embodiment may refer to the following schematic illustration of fig. 3.
Fig. 3 schematically shows an illustration of a method flow 300 for extracting a feature matrix of a request packet in an identification method of a packet according to an embodiment of the disclosure.
As shown in fig. 3, the method flow 300 may include at least one of operations S311 to S313, and operation S320.
In operation S311, a character distribution feature matrix of the request message is obtained based on natural language processing of characters in the request message.
For example, n-gram word segmentation may be performed on the request message by using characters as the minimum unit of word segmentation to obtain a word segmentation list of the request message, and then the word segmentation list is compared with a preset word segmentation matrix space to obtain a character distribution characteristic matrix of the request message. The word segmentation matrix space is formed by word segmentation obtained by performing n-gram word segmentation on a plurality of historical request messages by taking characters as the minimum unit and performing reverse word frequency TF-IDF statistics.
It is considered that the file uploaded in the upload request message is attached to the request body in a certain encoding format, and the encoding has no natural language intelligibility. Therefore, when performing n-gram segmentation on the request message by taking characters as the minimum unit of segmentation, only the characters in the request line, the request header and the parameters of the request body of the request message can be subjected to n-gram segmentation, and the file codes in the request body can be subjected to no segmentation processing.
For example, assume that the content of the request line of one request packet is "POST/upload, labs/Pass-11/file/uploadName =403.Jpg http/1.1", and the parameters of the request header and the request body are "uid = rc-upload-1603767422299-5, name =403.Jpg, type = image/jpeg, size =3875, filenamepath =" E: \403.Jpg, content-Type: text/play ". The request message can be participled through a character window of n-gram (for example, n = 3) to form a participle list of pos | ost/| t/u |/up | upl | plo | \8230; pla | lai | ain |. And then comparing the word segmentation list with a preset word segmentation matrix space to form a character distribution characteristic matrix.
The character distribution characteristic matrix corresponds to the data distribution of the word segmentation matrix space, wherein the data value of each position of the character distribution characteristic matrix is determined according to whether the word segmentation of each position in the word segmentation matrix space can be found in the obtained word segmentation list or in other words, according to whether the word in the word segmentation list exists in the word segmentation matrix space. If a word exists, setting the value of the position corresponding to the word in the word segmentation matrix space in the character distribution characteristic matrix as 1, and if the word does not exist, setting the value of the position corresponding to the word in the word segmentation matrix space in the character distribution characteristic matrix as 0. For example, three words pos | ost | st/| in the word segmentation list are respectively matched with pos, ost, st/at corresponding positions in the word segmentation matrix space, and the values of the positions corresponding to the pos, ost, st/in the preset word segmentation matrix space in the character distribution characteristic matrix are set to be 1. This means that, in this embodiment, through natural language processing, the POST word in the request row can be represented by pos, ost, st/three words in the participle matrix space.
The word segmentation matrix space can be formed by performing n-gram word segmentation on a large number of historical request messages by taking characters as the minimum unit and performing reverse word frequency TF-IDF statistics on each word in the large number of historical request messages to obtain the word segmentation. For example, the participles with statistical indexes larger than a certain threshold value are selected according to TF-IDF statistics, or partial participles with statistical indexes ranked in the front are selected after TF-IDF statistics and then are arranged into a matrix, so that a participle matrix space is formed.
In operation S312, an uploading behavior distribution feature matrix of the request packet is obtained based on matching the characters in the request packet with a preset behavior dictionary matrix space. The behavior dictionary matrix space is composed of words used for representing and relating to uploading behaviors of the messages. The words in the behavior dictionary matrix space may be empirically or statistically selected words that describe behavior characteristics of the request packet, including words of an upload specification, an upload path, an upload operation, and the like of the upload packet. For example, words in the behavioral dictionary matrix space include at least one of: upload, filename, boundary, data, or path.
In operation S313, a file type feature matrix of the request message is obtained based on matching of the characters in the request message with a preset type dictionary matrix space. Wherein the type dictionary matrix space is comprised of words used to characterize the uploaded files. The words in the type dictionary matrix space may be selected empirically or statistically from the format, naming characteristics, file type, etc. of the uploaded file. For example, words in the type dictionary matrix space include at least one of: png, jpg, jpeg, image, gif, bmp, ffd8, or 424D.
Then, in operation S320, a feature matrix of the request packet is obtained based on any one or a combination of more of the character distribution feature matrix, the upload behavior distribution feature matrix, or the file type feature matrix.
For example, in one embodiment, the character distribution feature matrix, the upload behavior distribution feature matrix, and the file type feature matrix may be concatenated to obtain the feature matrix of the request packet. Therefore, the obtained feature matrix of the request message can reflect the feature of one message in the aspect of uploading files more comprehensively from multiple dimensions. Therefore, when the feature matrix is used as the input of the classifier model, the classifier model can synthesize information of multiple dimensions to analyze the features of the uploaded messages of the learning file, and the intelligent degree and the classification precision of the classifier model are improved.
Fig. 4 schematically shows a flowchart of a method 400 for training a classifier model in a recognition method of a packet according to an embodiment of the present disclosure.
As shown in fig. 4, the method for recognizing a packet according to the embodiment of the present disclosure further includes training a classifier model according to a process shown in the method 400. The method 400 may include operations S410 to S440.
In operation S410, a plurality of history request messages are acquired.
In operation S420, each history request message is marked as a file upload message or a non-file upload message. For example, manually analyze the type of each historical request message and then perform tagging.
In operation S430, the feature matrix of each historical request message is extracted in the same manner as the feature matrix of the obtained request message. For details, reference may be made to the related description of the method flow 300 for the processing procedure, which is not described herein again.
In operation S440, a classifier model is trained with the feature matrix of each historical request message as input and the label of each historical request message as an output reference.
Fig. 5 schematically shows a block diagram of an apparatus 500 for identifying a packet according to an embodiment of the present disclosure.
As shown in fig. 5, the message recognition apparatus 500 may include two units: a message sample repository 510 and a file upload decision model 520.
The message sample library 510 stores a large number of http messages with historical service functions, and by acquiring various keywords related to file uploading in the http messages and automatically classifying and summarizing, for example, "file", "file _ name", "upload", "file _ upload", "document", "file _ submit", "ext _ name", and the like, file extensions such as ". Jpg", ". Png", ". Txt", ". Doc", ". Docx", ". Xls", ". Xlsx", and the like, and sets a flag for each message to represent whether the message is a file uploading message.
The file upload determination model 520 may determine the type of the request packet by using a big data method according to the character distribution characteristics and/or upload type distribution characteristics generated by the incoming request packet. The specific mode is explained as follows.
The word segmentation matrix space can be constructed by using the collected historical service function http message. For example, the parameters in the request line, the request header and the request body of the http message with the historical service function are subjected to n-gram word segmentation by taking characters as the minimum unit of word segmentation, and then a word segmentation matrix space is obtained through calculation of a TF-IDF algorithm.
Meanwhile, an upload type distribution feature may be constructed, for example, a behavior dictionary space and a type dictionary space may be set. The behavior dictionary space includes, but is not limited to, upload related behaviors and path related contents such as [ upload, filename, boundary, data, path ] and the like. The type dictionary space includes, but is not limited to, file extensions, file contents, and header identification related contents such as [ png, jpg, jpeg, image, gif, bmp, ffd8, 424D ].
Then, a feature matrix of the message is extracted. Also taking the message exemplified above as an example, the content of the request line is "POST/upload, labs/Pass-11/file/upload name =403.Jpghttp/1.1", and the parameters of the request header and the request body are "uid = rc-upload-1603767422299-5, name =403.Jpg, type = image/jpeg, size =3875, filenamepath =" E: \403.Jpg, content-Type: text/play ". When extracting the feature matrix of the message, the following steps can be implemented.
In the first step, the word segmentation is carried out on the request head and the request body through a character window with n-gram and n =3 to form a word segmentation list of pos | ost/| t/u |/up | upl | plo | \ 8230and pla | lai | ain |. And comparing the word segmentation class with the word segmentation matrix space, wherein the corresponding word segmentation is 1 if existing, and is 0 if not existing. Thereby obtaining a character distribution characteristic matrix.
And secondly, mapping the request line, the request head and the request body according to a behavior dictionary to form a characteristic matrix, wherein if the behavior dictionary is [ upload, filename, boundary, data and path ], the uploading request is mapped to an uploading behavior distribution characteristic matrix [1, 0,1].
And thirdly, mapping the uploaded message according to a type dictionary, and if the type dictionary is [ png, jpg, jpeg, image, gif, bmp, ffd8 and 424D ], mapping the type dictionary into a file type feature matrix [0,1, 0 and 0].
And finally, splicing the three feature matrixes to form a feature matrix for judging the category of the message.
Next, feature extraction and mapping in a similar manner as above may be performed on the marked upload message and other messages, and the marked upload message and other messages are sent to a machine learning classification model, such as a gradient descent tree algorithm (GBDT) algorithm model, a Random Forest (Random Forest), and the like, to perform classification and identification, where the identification type includes two major types, that is, a file upload message and a non-file upload message, and in this manner, a classifier model is trained, and the trained classifier model is used to identify the type of the request message to be identified.
Fig. 6 schematically shows a block diagram of an apparatus 600 for identifying a packet according to another embodiment of the present disclosure.
As shown in fig. 6, an apparatus 600 for identifying a packet according to an embodiment of the present disclosure may include an obtaining module 610 and an identifying module 620. According to other embodiments of the present disclosure, the apparatus 600 may further include a training module 630. The apparatus 600 may be used to implement the methods described with reference to fig. 2-4.
The obtaining module 610 is configured to obtain a request message for accessing a website.
The identifying module 620 is configured to identify whether the request packet belongs to a file upload packet or a non-file upload packet by using the classifier model, so as to process the request packet according to the identification result. The classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages.
According to some embodiments of the present disclosure, the identification module 620 may include a feature extraction sub-module 621, and a classification sub-module 622.
The feature extraction sub-module 621 is configured to obtain a feature matrix of the request packet in any one of the following manners or a combination of multiple manners: obtaining a character distribution characteristic matrix of the request message based on natural language processing of characters in the request message; acquiring an uploading behavior distribution characteristic matrix of the request message based on matching of characters in the request message with a preset behavior dictionary matrix space; obtaining a file type characteristic matrix of the request message based on matching of characters in the request message with a preset type dictionary matrix space; the behavior dictionary matrix space is composed of words used for representing uploading behaviors related to the messages, and the type dictionary matrix space is composed of words used for representing uploaded files.
The classification sub-module 622 is configured to input the feature matrix of the request packet into the classifier model to obtain a classification result output by the classifier model.
The training module 630 is configured to train the classifier model by: acquiring a plurality of historical request messages; marking each historical request message as a file uploading message or a non-file uploading message; extracting the feature matrix of each historical request message through the feature extraction submodule 621 in the same way as the feature matrix of the obtained request message; and training a classifier model by taking the feature matrix of each historical request message as input and the mark of each historical request message as an output reference.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any of the obtaining module 610, the identifying module 620, the training module 630, the feature extracting sub-module 621, and the classifying sub-module 622 may be combined into one module to be implemented, or any one of the modules may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 610, the identifying module 620, the training module 630, the feature extracting sub-module 621, and the classifying sub-module 622 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or by a suitable combination of any of them. Alternatively, at least one of the acquisition module 610, the recognition module 620, the training module 630, the feature extraction sub-module 621, and the classification sub-module 622 may be implemented at least in part as a computer program module that, when executed, may perform a corresponding function.
Fig. 7 schematically illustrates a block diagram of an electronic device 700 suitable for implementing message identification according to an embodiment of the present disclosure. The electronic device 700 shown in fig. 7 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, an electronic device 700 according to an embodiment of the present disclosure includes a processor 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. The processor 701 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 702 and/or the RAM 703. It is noted that the programs may also be stored in one or more memories other than the ROM 702 and RAM 703. The processor 701 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 700 may also include input/output (I/O) interface 705, which input/output (I/O) interface 705 is also connected to bus 704, according to an embodiment of the present disclosure. The electronic device 700 may also include one or more of the following components connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by the processor 701, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 702 and/or the RAM 703 and/or one or more memories other than the ROM 702 and the RAM 703 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being configured to cause the electronic device to implement the method for identifying a message provided by the embodiments of the present disclosure.
The computer program, when executed by the processor 701, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via the communication section 709, and/or installed from the removable medium 711. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (8)

1. A message identification method comprises the following steps:
acquiring a request message for accessing a website; and
identifying whether the request message belongs to a file uploading message or a non-file uploading message by using a classifier model, wherein the classifier model is a two-classification machine learning model obtained based on a plurality of historical request messages through training; and
processing the request message according to the identification result, comprising: when the website is in a test stage, when the request message is identified to belong to a file uploading message, reconstructing the request message to construct various types of vulnerability attack messages, and detecting vulnerability protection performance of a server of the website by using the various types of vulnerability attack messages;
wherein the content of the first and second substances,
the identifying, by using the classifier model, that the request packet belongs to a file upload packet or a non-file upload packet includes: inputting the feature matrix of the request message into the classifier model to obtain a classification result output by the classifier model, wherein the feature matrix of the request message is obtained by combining the following modes:
based on matching characters in a request line, a request head and a request body of the request message with words in a preset behavior dictionary matrix space, distributing numerical values representing matching results according to positions of the words in the behavior dictionary matrix space to obtain an uploading behavior distribution characteristic matrix of the request message; the behavior dictionary matrix space is composed of words used for representing and relating to uploading behaviors of the messages; words in the behavioral dictionary matrix space include at least one of: upload, filename, boundary, data, or path; and
based on matching of characters in a request line, a request head and a request body of the request message with words in a preset type dictionary matrix space, arranging numerical values representing matching results according to positions of the words in the behavior type dictionary space to obtain a file type feature matrix of the request message; the type dictionary matrix space is formed by words used for representing uploaded files; the words in the type dictionary matrix space include at least one of: png, jpg, jpeg, image, gif, bmp, ffd8, or 424D.
2. The method of claim 1, wherein the obtaining the feature matrix of the request packet further comprises:
acquiring a character distribution characteristic matrix of the request message based on natural language processing of characters in the request message; and
and splicing the character distribution characteristic matrix, the uploading behavior distribution characteristic matrix and the file type characteristic matrix to obtain the characteristic matrix of the request message.
3. The method of claim 2, wherein the obtaining a character distribution feature matrix of the request message based on natural language processing of characters in the request message comprises:
performing n-gram word segmentation on the request message by taking characters as the minimum unit of word segmentation to obtain a word segmentation list of the request message; and
comparing the word segmentation list with a preset word segmentation matrix space to obtain a character distribution characteristic matrix of the request message;
wherein the content of the first and second substances,
the word segmentation matrix space is formed by word segmentation obtained by performing n-gram word segmentation on a plurality of historical request messages by taking characters as the minimum unit and performing reverse word frequency TF-IDF statistics.
4. The method of claim 3, wherein the n-gram participling the request message with characters as a minimum unit of participle comprises:
and performing n-gram word segmentation on characters in the request line, the request head and the parameters of the request body of the request message.
5. The method of any of claims 2 to 4, wherein the classifier model is trained by:
obtaining a plurality of history request messages;
marking each history request message as a file uploading message or a non-file uploading message;
extracting the feature matrix of each historical request message in the same way as the feature matrix of the request message is obtained; and
and training the classifier model by taking the feature matrix of each historical request message as input and the mark of each historical request message as output reference.
6. An apparatus for recognizing a packet, comprising:
the acquisition module is used for acquiring a request message for accessing a website; and
the identification module is used for identifying whether the request message belongs to a file uploading message or a non-file uploading message by using the classifier model and processing the request message according to an identification result; the classifier model is a two-classification machine learning model obtained based on training of a plurality of historical request messages;
wherein;
the identifying, by the classifier model, that the request packet belongs to a file upload packet or a non-file upload packet includes: inputting the feature matrix of the request message into the classifier model to obtain a classification result output by the classifier model, wherein the feature matrix of the request message is obtained by combining the following modes:
based on matching characters in a request line, a request head and a request body of the request message with words in a preset behavior dictionary matrix space, arranging numerical values representing matching results according to positions of the words in the behavior dictionary matrix space to obtain an uploading behavior distribution characteristic matrix of the request message; the behavior dictionary matrix space is composed of words used for representing and relating to uploading behaviors of the messages; words in the behavioral dictionary matrix space include at least one of: upload, filename, boundary, data, or path; and
based on matching of characters in a request line, a request head and a request body of the request message with words in a preset type dictionary matrix space, arranging numerical values representing matching results according to positions of the words in the behavior type dictionary space to obtain a file type feature matrix of the request message; the type dictionary matrix space is composed of words for representing uploaded files; the words in the type dictionary matrix space include at least one of: png, jpg, jpeg, image, gif, bmp, ffd8, or 424D;
the processing the request message according to the identification result comprises: when the website is in a test stage, when the request message is identified to belong to a file uploading message, the request message is reformed to construct various types of vulnerability attack messages, and the vulnerability protection performance of a server of the website is detected by using the various types of vulnerability attack messages.
7. An electronic device, comprising:
one or more memories storing executable instructions; and
one or more processors executing the executable instructions to implement the method of any one of claims 1-5.
8. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 5.
CN202110397527.2A 2021-04-13 2021-04-13 Message identification method and device, electronic equipment and medium Active CN113114679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110397527.2A CN113114679B (en) 2021-04-13 2021-04-13 Message identification method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110397527.2A CN113114679B (en) 2021-04-13 2021-04-13 Message identification method and device, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN113114679A CN113114679A (en) 2021-07-13
CN113114679B true CN113114679B (en) 2023-03-24

Family

ID=76716791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110397527.2A Active CN113114679B (en) 2021-04-13 2021-04-13 Message identification method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN113114679B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115473856A (en) * 2022-09-07 2022-12-13 中国银行股份有限公司 Message checking method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111935140A (en) * 2020-08-10 2020-11-13 中国工商银行股份有限公司 Abnormal message identification method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10585989B1 (en) * 2018-09-07 2020-03-10 International Business Machines Corporation Machine-learning based detection and classification of personally identifiable information
CN109246027B (en) * 2018-09-19 2022-02-15 腾讯科技(深圳)有限公司 Network maintenance method and device and terminal equipment
CN111241273A (en) * 2018-11-29 2020-06-05 北京京东尚科信息技术有限公司 Text data classification method and device, electronic equipment and computer readable medium
CN110674289A (en) * 2019-07-04 2020-01-10 南瑞集团有限公司 Method, device and storage medium for judging article belonged classification based on word segmentation weight
CN110569359B (en) * 2019-08-26 2023-09-15 腾讯科技(深圳)有限公司 Training and application method and device of recognition model, computing equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111935140A (en) * 2020-08-10 2020-11-13 中国工商银行股份有限公司 Abnormal message identification method and device

Also Published As

Publication number Publication date
CN113114679A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
US11941491B2 (en) Methods and apparatus for identifying an impact of a portion of a file on machine learning classification of malicious content
CN109492222B (en) Intention identification method and device based on concept tree and computer equipment
KR101858620B1 (en) Device and method for analyzing javascript using machine learning
CN109739989B (en) Text classification method and computer equipment
EP3637292A1 (en) Determination device, determination method, and determination program
US20230418943A1 (en) Method and device for image-based malware detection, and artificial intelligence-based endpoint detection and response system using same
US20220253526A1 (en) Incremental updates to malware detection models
CN111222137A (en) Program classification model training method, program classification method and device
CN113568841A (en) Risk detection method, device and equipment for applet
KR102516454B1 (en) Method and apparatus for generating summary of url for url clustering
CN113114679B (en) Message identification method and device, electronic equipment and medium
CN116186716A (en) Security analysis method and device for continuous integrated deployment
CN113282920B (en) Log abnormality detection method, device, computer equipment and storage medium
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
CN110798481A (en) Malicious domain name detection method and device based on deep learning
KR20200001453A (en) Risk management system for information cecurity
CN113971284B (en) JavaScript-based malicious webpage detection method, equipment and computer readable storage medium
Ugarte-Pedrero et al. On the adoption of anomaly detection for packed executable filtering
KR102192196B1 (en) An apparatus and method for detecting malicious codes using ai based machine running cross validation techniques
CN111651658A (en) Method and computer equipment for automatically identifying website based on deep learning
CN112732573B (en) Test case acquisition method, device, system and medium
CN113037555B (en) Risk event marking method, risk event marking device and electronic equipment
CN115643044A (en) Data processing method, device, server and storage medium
US11232202B2 (en) System and method for identifying activity in a computer system
CN114143074A (en) Webshell attack recognition device and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant