WO2022162655A1 - A system and method for producing specifications for fields with variable number of elements - Google Patents

A system and method for producing specifications for fields with variable number of elements Download PDF

Info

Publication number
WO2022162655A1
WO2022162655A1 PCT/IL2022/050004 IL2022050004W WO2022162655A1 WO 2022162655 A1 WO2022162655 A1 WO 2022162655A1 IL 2022050004 W IL2022050004 W IL 2022050004W WO 2022162655 A1 WO2022162655 A1 WO 2022162655A1
Authority
WO
WIPO (PCT)
Prior art keywords
messages
length
message
given
string
Prior art date
Application number
PCT/IL2022/050004
Other languages
French (fr)
Inventor
Lior KOGAN
Original Assignee
Elbit Systems C4I and Cyber Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from IL280433A external-priority patent/IL280433B/en
Priority claimed from IL280435A external-priority patent/IL280435B/en
Priority claimed from IL280437A external-priority patent/IL280437B/en
Priority claimed from IL280436A external-priority patent/IL280436B/en
Application filed by Elbit Systems C4I and Cyber Ltd. filed Critical Elbit Systems C4I and Cyber Ltd.
Publication of WO2022162655A1 publication Critical patent/WO2022162655A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers

Definitions

  • the invention relates to a system and method for producing specifications for fields with variable number of elements.
  • US Patent application No. 2009/0006645 (Cui et al.) published on January 1, 2009, discloses a system for automatic inference of message formats from network packets is described.
  • Each network message from a set of network messages is split into one or more tokens based on the types of bytes in the network messages.
  • the set of network messages can then be classified into clusters based on token patterns.
  • the network messages in each cluster can then be further sub-clustered recursively based on the message formats. Further, the messages with a similar message format across the sub-clusters can be merged into a cluster.
  • the set of formatted clusters thus obtained correspond to a set of message formats that can be used further for protocol reverse engineering.
  • US Patent application No. 2019/0296935 (HONG et al.) published on September 26, 2019, discloses a device and method for dividing a field boundary of a CAN trace.
  • the method for dividing a field boundary of a CAN trace includes: collecting a CAN trace of a CAN bus; dividing the CAN trace into multiple blocks including multiple frames of the CAN trace; performing first static field division to each of the multiple blocks; and performing second static field division based on the result of the first static field division to divide a final field boundary of the CAN trace.
  • US Patent No. 9,100,326 (Iliofotou et al.) published on August 4, 2015, discloses a method for analyzing an application protocol of a network.
  • the method includes extracting non-alphanumeric tokens from conversations of the network, selecting frequently occurring non-alphanumeric token as a field delimiter candidate for dividing each conversation into a slice-set, analyzing slice-sets of the conversations to determine a statistical measure of matched slices for each conversation, and -o determine a field delimiter candidate score by aggregating the statistical measure of matched slices for all conversations, and selecting the non-alphanumeric token as the field delimiter of the protocol based on the field delimiter candidate score associated with the non-alphanumeric token.
  • US Patent No. 6,931,574 (Coupal et al.) published on August 4, 2015, discloses preferred embodiments of the current invention are directed to a protocol analyzer for interpreting data frames captured on a communications network.
  • the protocol analyzer includes a network interface connection for providing the electrical and physical connection to the communications network and for receiving data frames from the network in a particular physical layer protocol format.
  • the protocol analyzer further includes analysis software for providing an interpretation of received data frames.
  • the interpretation of a frame is based upon a series of definition constructs that are stored in a protocol definition file and a protocol database of the protocol analyzer.
  • the definition constructs collectively define the characteristics of a data frame for a given physical layer protocol.
  • constructs provide a means for identifying any one of a number of higher-level protocols that may be embedded within the data frame. Also disclosed is a graphical user interface for use as a protocol editor for assembling the necessary definition constructs for inclusion in a protocol definition file. Further, embodiments of a graphical interface for displaying the results of interpreted frames is also disclosed.
  • US Patent application No. 2015/0363215 (Versteeg et al.) published on December 17, 2015, discloses a method of service emulation, a plurality of messages communicated between a system under test and a target system for emulation are recorded in a computer-readable memory.
  • Ones of the messages are clustered to define a plurality of message clusters, and respective cluster prototypes are generated for the message clusters.
  • the respective cluster prototypes include a commonality among the ones of the messages of the corresponding message clusters.
  • One of the message clusters is identified as corresponding to a request from the system under test based on a comparison of the request with the respective cluster prototypes, and a response to the request for transmission to the system under test is generated based on the one of the message clusters that was identified.
  • Related computer systems and computer program products are also discussed. GENERAL DESCRIPTION
  • a system for determining location and parameters of constant length and variable length string fields within a plurality of messages of a given type comprising a processing circuitry configured to: (a) obtain the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determine, for each of the messages of the given type, an index byte, being a first byte of a sequence of bytes of each of the respective messages; (c) determine, for the index byte of each of the messages: (A) a message string plausibility score, indicating a plausibility that a part of the respective messages starting at the index byte is a string field, based on analysis of a content of the part of the respective messages, and (B) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value; (d) upon a criterio
  • the processing circuitry is further configured to: determine for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value; and upon the criterion being met, determine a type of the variable length string, utilizing at least the full string candidate length.
  • the type is one of: (a) a constant length string field not ending with the terminator value; (b) a constant length string field ending with the terminator value; (c) a constant length string field ending with padding values; (d) a constant length string field with a length prefix ending with noise values; (e) a variable length string field with the length prefix and not ending with the terminator value; or (f) a variable length string field ending with the terminator value.
  • the processing circuitry is further configured to: remove the variable length string from each of the messages; and repeat (b)-(d). In some cases, upon the criterion being met, the processing circuitry is further configured to determine one or more parameters associated with the variable length string.
  • the processing circuitry is further configured to validate, using the parameters, that the variable length string is a valid variable length string, and wherein upon the validation being unsuccessful, repeat (c)- (d) with the index byte being a byte, subsequent to the index byte, if any.
  • the messages are obtained from a trace obtained from one of more computerized networks.
  • the trace includes a plurality of additional messages of one or more other types other than the given type.
  • At least two of the messages have different message length.
  • variable length string is an alphanumeric string.
  • the processing circuitry is further configured to provide a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string.
  • a system for determining parameters of variable length fields being fields with a variable number of elements, within a plurality of messages of a given type
  • the given message length is a shortest message length of the messages.
  • identifying the plurality of possible solutions also includes testing the possible solutions against a given number of additional message lengths, being lengths of one or more other messages of the messages other than the given message, and wherein for each of the additional message lengths at least one solution exists to a given additional message length equation, wherein the given additional message length equation is: additional message length ⁇ e 1 .k 1 + e2-k 2 + ... + e y .k y + f.pl + m, so that e; ⁇ 0.
  • identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k 1 ..k y .
  • the messages are obtained from a trace obtained from one of more computerized networks.
  • the trace includes a plurality of additional messages of one or more other types other than the given type.
  • At least two of the messages have different message length.
  • the processing circuitry is further configured to provide an output including at least the list of (skip j , es j ).
  • ki has a predetermined lower threshold and a predetermined upper threshold.
  • At least one of f and y has a predetermined upper threshold.
  • pl is one of a predetermined set of values, byte, two bytes or four bytes.
  • pl has one of: a big-endian representation or a little-endian representation.
  • a system comprising a processing circuitry configured to: obtain a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; apply one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determine, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
  • the processing circuitry is further configured to generate a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace.
  • the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
  • chaol estimators Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
  • the traffic trace is obtained from one or more computerized networks.
  • the processing circuitry is further configured to provide the at least one of the first estimation or the second estimation to a user of die system or an external system.
  • the processing circuitry is further configured to: receive a desired unobserved message types number; and recommend the given number of additional messages to be obtained in the additional traffic trace based on the second estimation.
  • processing circuitry is further configured to provide the relationship model to a user of the system or to an external system.
  • a system for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length comprising a processing circuitry configured to: (a) obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) provide a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores, each representing a plausibility that a respective field-type of constant length field starts at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers; (c) utilize the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages; (d) identify the highest score based on (i) each classifier’
  • marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score does not exceed an end of the messages of the given type.
  • marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score are not already marked.
  • the messages are obtained from a trace obtained from one of more computerized networks.
  • the trace includes a plurality of additional messages of one or more other types other than the given type.
  • the constant length fields are one or more of: big-endian 16 bits integer; big-endian 32 bits integer; big-endian 64 bits integer; big-endian 32 bits floating point; big-endian 64 bits floating point; little-endian 16 bits integer; little- endian 32 bits integer; little-endian 64 bits integer; little-endian 32 bits floating point; or little-endian 64 bits floating point.
  • a method for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determining, by the processing circuitry, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages; (c) determining, by the processing circuitry, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value; (d) upon
  • the method further comprising: determining, by the processing circuitry, for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value; and upon the criterion being met, determining, by the processing circuitry, a type of the variable length string, utilizing at least the full string candidate length.
  • the type is one of: (a) a constant length string field not ending with the terminator value; (b) a constant length string field ending with the terminator value; (c) a constant length string field ending with padding values; (d) a constant length string field with a length prefix ending with noise values; (e) a variable length string field with the length prefix and not ending with the terminator value; or (f) a variable length string field ending with the terminator value.
  • the method further comprising: removing, by the processing circuitry, the variable length string from each of the messages; and repeating, by the processing circuitry, (b)-(d).
  • the method further comprising: determining, by the processing circuitry, one or more parameters associated with the variable length string.
  • the method further comprising: validating, by the processing circuitry, using the parameters, that the variable length string is a valid variable length string, and wherein upon the validation being unsuccessful, repeating, by the processing circuity, (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.
  • the messages are obtained from a trace obtained from one of more computerized networks.
  • the trace includes a plurality of additional messages of one or more other types other than the given type.
  • At least two of the messages have different message length.
  • variable length string is an alphanumeric string.
  • the method further comprising: providing, by the processing circuitry, a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string.
  • the given message length is a shortest message length of the messages.
  • identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k 1 ..k y .
  • the messages are obtained from a trace obtained from one of more computerized networks.
  • the trace includes a plurality of additional messages of one or more other types other than the given type.
  • At least two of the messages have different message length.
  • the processing circuitry is further configured to provide an output including at least the list of (skip j , es j ).
  • k i has a predetermined lower threshold and a predetermined upper threshold.
  • At least one of f and y has a predetermined upper threshold.
  • pl is one of a predetermined Set of values, byte, two bytes or four bytes.
  • pl has one of: a big-endian representation or a little-endian representation.
  • a method comprising: obtaining, by a processing circuitry, a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; applying, by the processing circuitry, one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determining, by the processing circuitry, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
  • the method further comprising: generating, by the processing circuitry, a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace.
  • the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
  • chaol estimators Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
  • the traffic trace is obtained from one or more computerized networks.
  • the method further comprising: providing, by the processing circuitry, the at least one of the first estimation or the second estimation to a user of the system or to an external system.
  • the method further comprising: receiving, by the processing circuitry, a desired unobserved message types number; and recommending by the processing circuitry, the given number of additional messages to be obtained in the additional traffic trace based on the second estimation.
  • the method further comprising: providing, by the processing circuitry, the relationship model to a user of the system or to an external system.
  • a method for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length comprising: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) providing, by the processing circuity, a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores, each representing a plausibility that a respective field-type of constant length field starts at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers; (c) utilizing, by the processing circuity, the classifier
  • the repeat is performed for each marking, each time while disregarding the scores associated with the respective subset of the marked bytes starting at the respective offset.
  • marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score does not exceed an end of the messages of the given type.
  • marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type starting at one or more offsets associated with the highest score are not already marked.
  • the messages are obtained from a trace obtained from one of more computerized networks.
  • the trace includes a plurality of additional messages of one or more other types other than the given type.
  • the constant length fields are one or more of: big-endian 16 bits integer; big-endian 32 bits integer; big-endian 64 bits integer; big-endian 32 bits floating point; big-endian 64 bits floating point; little-endian 16 bits integer; little- endian 32 bits integer; little-endian 64 bits integer; little-endian 32 bits floating point; or little-endian 64 bits floating point.
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determining, by the processing circuitry, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages; (c) determining, by the processing circuitry, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, and (b)
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method comprising: obtaining, by a processing circuitry, a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; applying, by the processing circuitry, one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determining, by the processing circuitry, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the method comprising: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) providing, by the processing circuity, a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores, each representing a plausibility that a respective field-type of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifier
  • Fig. 1 is a block diagram schematically illustrating one example of a system for producing specifications for binary protocols, in accordance with the presently disclosed subject matter.
  • Fig. 2 is a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable length strings within a plurality of messages of a given type, in accordance with the presently disclosed subject matter;
  • Fig. 3 is a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable number of elements fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter;
  • Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for detecting absences in the traffic trace, in accordance with the presently disclosed subject matter;
  • Fig. 5 is a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of constant length fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter.
  • should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g., digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co-residing on a single physical machine, any other electronic computing device, and/or any combination thereof.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • non-transitory is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
  • the phrase “for example,” “such as”, “for instance” and variants thereof describe non-limiting embodiments of the presently disclosed subject matter.
  • Reference in the specification to “one case”, “some cases”, “other cases” or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter.
  • the appearance of the phrase “one case”, “some cases”, “other cases” or variants thereof does not necessarily refer to the same embodiment(s).
  • Fig. 1 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter.
  • Each module in Fig. 1 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein.
  • the modules in Fig. 1 may be centralized in one location or dispersed over more than one location.
  • the system may comprise fewer, more, and/or different modules than those shown in Fig. 1. Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
  • Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
  • Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
  • Fig. 1 a block diagram schematically illustrating one example of a system for producing specifications for binary protocols, in accordance with the presently disclosed subject matter.
  • system 200 can comprise a communications interface 220 enabling connecting the system 200 to a network or a communication channel or a radio receiver and enabling it to receive or to capture data sent thereto through the network or the communication channel, including in some cases receiving information such as: receiving traces of binary massages sent over the one or more networks.
  • the communications interface 220 can be connected to a Local Area Network (LAN), to a Wide Area Network (WAN), to a wireless communications channel, to a wireless network, to a communication bus, to a point-to-point communication channel, to a radio link, or to the Internet.
  • the communications interface 220 can connect to a wireless network or communication channel. It is to be noted that in some cases the received information, or part thereof, can be collected from one or more networks or communication channels.
  • System 200 can further comprise or be otherwise associated with a data repository 210 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, a combination of ROM and RAM or any other type of memory, etc.) configured to store data, including, inter alia, binary messages, lists of message string plausibility scores (being an indication of a plausibility that a given candidate string is actually a string field, as further detailed herein with reference to Fig. 2), lists of identified variable length strings within the messages, respective index byte of each identified variable length string (wherein the index byte can point to can point at a given byte of a message, as further detailed herein with reference to Fig.
  • a data repository 210 e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, a combination of ROM and RAM or any other type of memory, etc.
  • data repository 210 e.g., a database
  • data repository 210 can be further configured to enable retrieval and/or update and/or deletion of the data stored thereon. It is to be noted that in some cases, data repository 210 can be distributed. It is to be noted that in some cases, data repository 210 can be stored in on cloud-based storage.
  • System 200 further comprises processing circuitry 230.
  • Processing circuitry 230 can be one or more processing circuitry units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing circuitry units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system 200 resources.
  • processing circuitry units e.g., central processing units
  • microprocessors e.g., microcontroller units (MCUs)
  • MCUs microcontroller units
  • the processing circuitry 230 comprises a variable length string analysis module 240, a variable number of elements fields analysis module 250, a traffic trace absences detection module 260 and a constant length fields analysis module 270.
  • the variable length string analysis module 240 is configured to perform a variable length string analysis process, as further detailed herein, inter alia with reference to Fig. 2.
  • variable number of elements fields analysis module 250 is configured to perform a variable number of elements fields analysis process, as further detailed herein, inter alia with reference to Fig. 3.
  • the traffic trace absences detection module 260 is configured to perform a traffic trace absences detection process, as further detailed herein, inter alia with reference to Fig. 4.
  • the constant length fields analysis module 270 is configured to perform a constant length fields analysis process, as further detailed herein, inter alia with reference to Fig. 5.
  • FIG. 2 a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable length strings within a plurality of messages of a given type, in accordance with the presently disclosed subject matter.
  • system 200 can be configured to perform variable length string analysis process 300, e.g., utilizing the variable length string analysis module 240.
  • System 200 analyzes obtained messages to determine a specification of the communication protocol and specifically to determine the location and parameters of variable-length string fields within the obtained messages.
  • the communication protocol is a binary communication protocol.
  • the binary communication protocol can be a proprietary protocol.
  • the structure of the messages adhering to the binary communication is generally unknown or is unknown to the user of system 200.
  • system 200 can be configured to obtain a plurality of messages of a given type, each of the messages comprised of a sequence of bytes (block 310). Each message is a sequence of one or more fields, wherein each field comprises one or more bytes of the bytes forming the message.
  • the messages structure adheres to the communication protocol, so that the communication protocol determines the location and parameters of each field within the message.
  • Each of the obtained messages can have a specific given type, thereby all having the same structure even if not always having the same number of bytes due to variable-length fields.
  • a message of a given type has a given structure.
  • the type of the message can be encoded in one of the fields of the message. In these cases, system 200 can identify the type of message field and use it to discern between messages of different types.
  • the obtained messages are preprocessed.
  • the messages are received as a stream of bytes that is not divided into individual messages.
  • the preprocessing can include splitting the stream of bytes into messages by using methods and/or techniques known in the art.
  • the preprocessing of the obtained messages includes identifying the types of these messages (noting that in some cases this step is not required as the plurality of messages obtained at block 310 are already of the same message type).
  • the obtained messages can include a plurality of messages of a plurality of message types, and the preprocessing can include removing messages that are of a type other than a given message type.
  • the preprocessing can include in these cases the splitting of the obtained messages into message groups according to type and the execution by system 200 of variable length string analysis process 300 for each message group.
  • the splitting of the obtained messages into message groups according to message type is achieved using methods known in the art for message type identification, for example: a correlation method correlating between values of candidate message type fields and the message length.
  • the given message type is variable-length message type, wherein the messages of the variable-length message type include one or more variable-length fields.
  • the variable-length fields can be textual string fields containing a representation of a string.
  • the string can be an alphanumeric string, e.g., comprising of digits, letters, delimiters, etc.
  • the fields forming the message can include fixed-length fields, having a fixed length, or variable-length fields, wherein the length of such fields (i.e., the number of bytes comprised within the field) in one message is different than the length of such fields in a second message.
  • the preprocessing can be performed by system 200 (e.g., before or as part of the execution of variable length string analysis process 300) or by an external system.
  • the messages can be obtained by system 200 from a trace of timestamped sequence of packets captured from one or more computerized communication networks or communication channels.
  • the trace can be captured using a sniffer, a receiver, a probe, or similar tools.
  • the obtained messages, or some of them can be part of a recording of historical messages communicated over the one or more communication networks or communication channels. In some cases, the obtained messages, or some of them, are part of real-time communication currently transferred over the one or more communication networks or communication channels.
  • the obtained messages can be a combination made of a part of a recording of historical messages communicated over the one or more communication networks or communication channels and a part of real-time communication currently transferred over the one or more communication networks or communication channels.
  • the messages can be obtained from a communication trace communicated using a common communication protocol or using a plurality of communication protocols.
  • the textual string fields can have one of several possible representations. These possible representations can include, among others: (1) A constant length string field not ending with the terminator value - strings of this type have a constant length and have no terminator character.
  • a terminator character can be for example a character with an American Standard Code for Information Interchange (ASCII) code value 0.
  • ASCII American Standard Code for Information Interchange
  • Strings of this type can be with or without a length prefix.
  • a length prefix are one or more bytes storing the length (in number of bytes) of the string. The length-prefix bytes can be sequenced in big-endian or little-endian endianness.
  • strings of this type can be described as follows: "[length-prefix II] string”, wherein the [] marks an optional element and the II symbol denotates concatenation between two elements.
  • a constant length string field ending with the terminator value - strings of this type have a constant length and end with a terminator character. Strings of this type can be with or without a length prefix. Optionally there are one or more bytes after the terminator and up to the end of the string. These bytes are noise bytes that can be disregarded.
  • This type of strings can be described as follows: "[length-prefix II] string II terminator [II noise]", wherein the [] marks an optional element and tire II symbol denotates concatenation between two elements.
  • a constant length string field ending with padding values - strings of this type have a constant length and end with a one or more padding bytes. Padding bytes are used to pad the string to its constant length. In some cases, the padding bytes can be each a constant padding character (for example: ASCII code value 0). Strings of this type can be with or without a length prefix. This type of strings can be described as follows: "[length-prefix II] string II padding*", wherein the [j marks an optional element, the II symbol denotates concatenation between two elements and the * symbols repetition of one or more padding bytes.
  • a constant length string field with a length prefix ending with noise values - strings of this type have a length -prefix and optionally end with a one or more bytes after the bytes of the string's content and up to the length of the string. These bytes are noise bytes that can be disregarded. The noise bytes are used when it is necessary to complete the length of the string.
  • This type of strings can be described as follows: "length-prefix II string [II noise]", wherein the [] marks an optional element and the II symbol denotates concatenation between two elements.
  • a variable length string field with the length prefix and not ending with the terminator value - strings of this type do not have a constant length.
  • Strings of this type have a length-prefix and have no terminator character.
  • This type of strings can be described as follows: "length-prefix II string”, wherein the II symbol denotates concatenation between two elements. (6) a variable length string field ending with the terminator value - strings of this type have a variable length and end with a terminator character. Strings of this type can be with or without a length prefix.
  • This type of strings can be described as follows: [length-prefix II] string II terminator", wherein the [] marks an optional element and the II symbol denotates concatenation between two elements.
  • Each representation can be associated with parameters describing the textual string field, as further detailed below.
  • the obtained messages can include for example the following two messages: Ml ⁇ 65, 66, 67, 68, 69, 0, 100 ⁇ and M2 ⁇ 68, 69, 70, 0, 101 ⁇ .
  • the messages are represented as a sequence of bytes. In this non-limiting example, the bytes are represented using American Standard Code for Information Interchange (ASCII) 8 bit-codes.
  • the messages include a first field which is a variable-length field (with the value of "65, 66, 67, 68, 69, 0" in Ml and the value of "68, 69, 70, 0" in M2) and a second field which is a constant-length field in the length of one byte with the value 100 in Ml and 101 in M2.
  • system 200 can be further configured to determine, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages (block 315).
  • an index byte can point at the first byte of Ml ("65") and M2 ("68"). The index byte will be used to iterate over the bytes of each message as detailed below.
  • system 200 can be further configured to determine, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value (block 320).
  • a message string plausibility score indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message
  • a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value (block 320).
  • System 200 calculates a message string plausibility score for each given message of the messages obtained in block 310.
  • the message string plausibility score is calculated for a candidate string, which is the part of the given message starting at the index byte.
  • the string plausibility score of a given candidate string is an indication of the plausibility that the given candidate string is actually a string field. It is to be noted that it is possible for system 200 to execute the variable length string analysis process 300 using an alternate flow in which the message string plausibility score is calculates within each message of the obtained messages for all possible index bytes.
  • the method for calculating the message string plausibility score can be, for example, based on calculating for each candidate string a matrix and performing for each candidate string an element-wise product between the calculated matrix and a score matrix and summing the results into a string plausibility score.
  • the candidate string can be the part of the given message starting at the index byte and ending at the first null character (i.e., ASCII 0 value) or at the first out-of- range character (i.e., ASCII values 1 to 31 and 127 to 255) or at the end of the given message.
  • the calculated matrix can be calculated so that the value of each cell is a count of how many instances of the type of character represented by the column of the matrix is followed in the candidate string by an instance of the type of character represented by the row of the matrix.
  • the types of characters can be: digits (i.e., ASCII values 48 to 57), upper-case characters (i.e., ASCII values 65 to 90), lower-case characters (i.e., ASCII values 97 to 122), separators (i.e., ASCII values 32, 45, 46 and
  • a non-limiting example is the following calculated matrix, calculated for an exemplary string of "ArmyUnitlOl":
  • the score matrix can be a constant matrix giving different weights to the elements of the calculated matrix in accordance with sequences of bytes that are likely to appear in strings (for example: a digit that follows a digit is a sequence that is more likely to appear in a string) - these will have positive weights, and with sequences that are not likely to appear in strings (for example: a symbol that follows a symbol is a sequence that is less likely to appear in a string) - these will have negative weights.
  • a non-limiting example of a score matrix is the following score matrix:
  • System 200 can further determine for each of the obtained messages a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value, not including null characters (i.e., ASCII 0 value), out-of-range characters (i.e., ASCII values 1 to 31 and 127 to 255) or end of message characters.
  • a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value, not including null characters (i.e., ASCII 0 value), out-of-range characters (i.e., ASCII values 1 to 31 and 127 to 255) or end of message characters.
  • the string plausibility score which is the element-wise product between the calculated matrix and the score matrix and the summation of the results is further normalized based on the corresponding string candidate length.
  • a non-limiting example of using the corresponding string candidate length to normalize the message string plausibility score is to use as a normalized message string plausibility score the ratio between the message string plausibility score and the corresponding string candidate length minus 1. When the corresponding string candidate length is smaller than two the message string plausibility score is zero.
  • the normalized message string plausibility score is 1.6, wherein the calculation is (2-2 + 2-2 + 1- 1 + 1-(-1) + 4-2) / (11 - 1).
  • system 200 can be further configured to test if a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths is met (block 325).
  • System 200 holds the information of the determined message string plausibility scores and the corresponding string candidate lengths for the index byte for each of the obtained messages.
  • System 200 can now perform a criterion test based on that information.
  • the criterion test can be based on a normalized average of the message string plausibility scores.
  • the normalized average of the message string plausibility scores can be the average of message string plausibility scores for the obtained messages having corresponding string candidate lengths of value of more than one.
  • a non-limiting example of such a criterion test can be if (1) the 2nd percentile of the string candidate lengths is larger or equal to 1.5 and the 80th percentile of the string candidate lengths is larger or equal to 2 and the normalized average of the message string plausibility scores is larger of equal to 1.1, or (2) the 2nd percentile of the string candidate lengths is larger or equal to 1.5 and the 80th percentile of the string candidate lengths is larger or equal to 3 and the normalized average of the message string plausibility scores is larger of equal to 0.97, or (3) the 2nd percentile of the string candidate lengths is larger or equal to 1.5 and the 80th percentile of the string candidate lengths is larger or equal to 4 and the normalized average of the message string plausibility scores is larger of equal to 0.85, or (4) the 2nd percentile of the string candidate lengths is larger or equal to 3.
  • system 200 can determine that the candidate string is not a string field and thus be further configured to set the index byte to be a byte, subsequent to the index byte, if any, and return to block 320 (block 330).
  • the candidate string may be a string field, and system 200 can be further configured to determine that the index byte is a start of a variable length string (block 335). In some cases, the determination that the index byte is a start of a variable length string occurs only after successful validation performed at block 355 below.
  • System 200 can be further configured to determine for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value (block 340). In some cases, system 200 can determine the full string candidate lengths for the index byte for each of the obtained messages as part of the determination of the string candidate length in block 320.
  • the string candidate length of the string field in message Ml ⁇ 65, 66, 67, 68, 69, 0, 100 ⁇ is 5 as it includes the five bytes with character values.
  • the full string candidate length of the string field in message Ml is 6 as it also includes the three bytes with terminator value of ASCII code 0.
  • System 200 can be further configured to determine a type of the variable length string, utilizing at least the full string candidate length (block 345).
  • the type of the variable length string can be one of: constant length string field not ending with the terminator value, constant length string field ending with the terminator value, constant length string field ending with padding values, constant length string field with a length prefix ending with noise values, variable length string field with the length prefix and not ending with the terminator value or variable length string field ending with the terminator value.
  • System 200 can be further configured to determine one or more parameters associated with the variable length string (block 350).
  • the parameters can be one or more of: (1) field length - the length of the variable length string field, (2) string length - the actual length of the variable length string field within each of the obtained messages, (3) the character used as a terminator, (4) the character used as a padding character, or (5) length-prefix parameters - the existence of a length-prefix, the size of the length-prefix, the endianness of the length-prefix and if the length-prefix includes a terminator.
  • cpwl, cpwb and cpb are calculated for each message of the obtained message to represent the existence of a string length field in little or big-endian endianness representation.
  • System 200 can then be configured to validate, using the parameters, that the variable length string is a valid variable length string (block 355).
  • a non-limiting example of such a validation test can be if (1) the field length parameter is larger or equal than 5, or (2) the field length parameter is larger or equal than 3 and the length- prefix parameter is that a length-prefix exists.
  • the candidate string is determined by system 200 to not be a variable length string field and system 200 can be further configured to set the index byte to be a subsequent byte, subsequent to the index byte, if any, and return to block 320 to keep on searching for variable length string fields (block 360).
  • the candidate string is determined by system 200 to be a string field and system 200 can be further configured to remove the variable length string from each of the messages and keep on analyzing the rest of the message for additional variable length string fields (block 365).
  • the removal of the detected variable length string field from all of the obtained messages can allow system 200 to re-iterate the process for the remaining message-parts of the obtained messages.
  • system 200 can manage a set of pointers - pointing for each of the obtained messages to the location of the byte subsequent to the identified variable length string and then re-iterate the process using the set of pointers to scan the remaining parts of the messages for variable length string fields.
  • system 200 can be further configured to check if end-of-message have been reached (block 370).
  • system 200 can be further configured to return to block 315 (block 375).
  • system 200 can be further configured to provide a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string (block 380). In some cases, system 200 can provide the user of the system or an external system with a specification of the variable length string fields of the obtained messages.
  • the specification can be represented as a list of (skip, params) pairs, wherein the respective index byte number is represented by a skip - the number of bytes that are needed to be skipped from the start of the message or from the end of the previous identified variable length string field to reach the respective variable length string, and params includes the parameters of the respective constant length and variable length string.
  • some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein (for example, block 350 can be performed before block 345, etc.). It is to be further noted that some of the blocks are optional (for example, block 380. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
  • system 200 can be configured to perform a variable number of elements fields analysis process 400, e.g., utilizing the variable number of elements fields analysis module 250.
  • System 200 analyzes obtained messages to determine a specification of the communication protocol and specifically to determine the location and parameters of variable number of elements fields, being fields that have a varying number of elements within the obtained messages.
  • a variable number of elements field can be a field that represents an array. The number of elements in the array can vary from message to message.
  • the communication protocol is a binary communication protocol.
  • the binary communication protocol can be a proprietary protocol.
  • the structure of the messages adhering to the binary communication is generally unknown or is unknown to the user of system 200.
  • system 200 can be configured to obtain a plurality of messages of a given type, each of the messages comprised of a respective sequence of bytes (block 410).
  • Each message is a sequence of one or more fields, wherein each field comprises one or more bytes of the bytes forming the message.
  • the messages structure adheres to the communication protocol, so that the communication protocol determines the location and parameters of each field within the message.
  • Each of the obtained messages can have a specific given type, thereby all having the same structure even if not always having the same number of bytes due to variable-length fields.
  • a message of a given type has a given structure.
  • the type of the message can be encoded in one of the fields of the message. In these cases, system 200 can identify the type of message field and use it to discern between messages of different types.
  • the obtained messages are preprocessed.
  • the messages are received as a stream of bytes that is not divided into individual messages.
  • the preprocessing can include splitting the stream of bytes into messages by using methods and/or techniques known in the art.
  • the preprocessing of the obtained messages includes identifying the types of these messages (noting that in some cases this step is not required as the plurality of messages obtained at block 410 are already of the same message type).
  • the obtained messages can include a plurality of messages of a plurality of message types, and the preprocessing can include removing messages that are of a type other than a given message type.
  • the preprocessing can include in these cases the splitting of the obtained messages into message groups according to type and the execution by system 200 of variable number of elements fields analysis process 400 for each message group.
  • the splitting of the obtained messages into message groups according to message type is achieved using methods known in the art for message type identification, for example: a correlation method correlating between values of candidate message type fields and the message length.
  • the given message type is variable-length message type, wherein the messages of the variable-length message type include one or more variable-length fields.
  • the variable-length fields can be variable number of elements fields containing for example a representation of an array.
  • the variable number of elements field has a prefix (which can be an integer) that represents the number of elements is the field.
  • the prefix can be a one-byte, a two- bytes, or a four-bytes integer.
  • the size of the prefix pl has one of a predetermined set of values.
  • the predetermined set of values can be: byte, two bytes or four bytes.
  • the prefixes When the prefixes have more than one byte, they can be represented either in little-endian representation or big-endian representation or any other fixed length representation. It is assumed that, in some cases, the prefixes of the fields with variable number of elements have the same length.
  • the value of the prefix can be zero when there are no elements in the variable number of elements field, for example, a sensor may send a periodical message that contains information of multiple readings at a given time. The number of readings can be zero or more.
  • Each element of the variable number of elements field has an element-size.
  • the element-size is not part of the message and is not transmitted. It may be known to the message creators and receivers, as it can be part of the protocol specifications.
  • the element-size is a constant positive integer (for example: eight bytes representing a geolocation, where each variable-length field is composed of a 4-bytes latitude and a 4-bytes longitude). For the given message type, there may be several such fields with the same element- size.
  • the fields forming the message can include fixed-length fields, having a fixed length, variable length strings or variable-length fields, wherein the length of such fields (i.e., the number of bytes comprised within the field) in one message is different than the length of such fields in a second message.
  • the preprocessing can include identifying of the strings within the messages (e.g., by using the variable length string analysis process 300) and removing the strings from the messages.
  • the preprocessing can be performed by system 200 (e.g., before or as part of the execution of variable number of elements fields analysis process 400) or by an external system.
  • the messages can be obtained by system 200 from a trace of timestamped sequence of packets captured from one or more computerized communication networks or communication channels.
  • the trace can be captured using a sniffer, a receiver, a probe, or similar tools.
  • the obtained messages, or some of them can be part of a recording of historical messages communicated over the one or more communication networks or communication channels. In some cases, the obtained messages, or some of them, are part of real-time communication currently transferred over the one or more communication networks or communication channels.
  • the obtained messages, or some of them can be a combination made of a part of a recording of historical messages communicated over the one or more communication networks or communication channels and a part of real-time communication currently transferred over the one or more communication networks or communication channels.
  • the messages can be obtained from a communication trace communicated using a common communication protocol or using a plurality of communication protocols.
  • the obtained messages can include for example the following two variable number of elements messages: Ml ⁇ 3, 1, 1, 2, 2, 3, 3 ⁇ and M2 ⁇ 2, 1, 1, 2, 2 ⁇ .
  • the messages are represented as a sequence of bytes, wherein in this example the first byte of each message is a one-byte prefix representing the number of elements is the field, and wherein each element of the fields has a length of two bytes.
  • the first message Ml is an array with three elements and message M2 has two elements.
  • variable number of elements fields analysis process 400 determines the value of.pl and their representation (endianness), the value of f, and for each variable number of elements field: finds the element-size and finds the location of the variable number of elements field within the message. It is to be noted that the location may not be fixed, as it may depends on the number of elements in all preceding fields with variable number of elements.
  • those f 1 fields cumulatively have ei ⁇ 0 elements
  • those f 2 fields cumulatively have e 2 ⁇ 0 elements
  • those f y fields cumulatively have e y ⁇ 0 elements.
  • the values of ei, e2, . . ., e y may be different for each message.
  • each message After removing the variable number of elements fields from the obtained messages, each message has constant-length fields wherein the combined length of all fixed-length fields of the given message type is marked as m, where m ⁇ 0. Therefore, the equation of the length of the message is for each message: 1 is equal to e 1 .k 1 + e 2 .k 2 + ... + f.pl + m.
  • System 200 identifies the plurality of possible solutions to the equation of the length of the message.
  • additional message length e 1 .k 1 + e 2 .k 2 + ... + e y .k y + f.pl + m, so that e. ⁇ 0.
  • the identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k 1 ..k y .
  • a non-limiting example of a calculation to find the possible solution can be the following algorithm, wherein k i has a predetermined lower threshold and a predetermined upper threshold and at least one of f and y has a predetermined upper threshold:
  • k i has a predetermined lower threshold and a predetermined upper threshold and at least one of f and y has a predetermined upper threshold:
  • the algorithm finds the set of all ⁇ y, f, pl, k 1 , k 2 , ... , m ⁇ combinations that are valid for l 1 , where l 1 ⁇ e 1 .k 1 + e 2 .k 2 + ... + f.pl + m, subject to ei ⁇ 0, e 2 ⁇ 0, . . . , k 1 ⁇ minElementSize, k 2 ⁇ minElementSize, ... , f ⁇ 1 , pl ⁇ 1, m ⁇ 0, and all values are integers.
  • the algorithm limits the values of: the common pl values (for example to one of ⁇ 1, 2, 4 ⁇ ), the minElementSize value, which is the minimal element- size values (for example to 2), the maxElementSize value, which is the maximal element-size (for example to 500), the maxUniqueElementSizes value, which is the number of unique element-sizes (for example to 3) and the max Var Fields, which is the number of variable-length fields (for example to 7).
  • the common pl values for example to one of ⁇ 1, 2, 4 ⁇
  • the minElementSize value which is the minimal element- size values (for example to 2)
  • the maxElementSize value which is the maximal element-size (for example to 500)
  • the maxUniqueElementSizes value which is the number of unique element-sizes (for example to 3)
  • the max Var Fields which is the number of variable-length fields (for example to 7).
  • the algorithm searches for valid combinations:
  • maxVarFields for each/e ⁇ 1 .. maxVarFields ⁇ , where maxVarFields is predefined:
  • I -f-pl - m is represented as the sum e 1 .k 1 + e 2 +.k . 2 .. + e y -k y if there is at least one solution
  • the non-limiting exemplary algorithm can now make use of l 1 .. l 20 to denote the, for example, 20 shortest message lengths of the given message type.
  • the exemplary value 20 is chosen to optimize performance. Other values can be used as well. If there are less than 20 message lengths - the algorithm will us all lengths.
  • tire algorithm calculates the set of all solutions, hence, all ⁇ y, f, pl, k 1 ...k y , e 1 ...e y , m ⁇ solutions are saved, where e 1 ... e y are the number of elements with element-sizes k 1 . . . k y respectively for l 1 .
  • each sub-solution includes a possible combination of f 1 ..f y , so that a sum of each combination is f, wherein: f; is a number of fields with element size ki and each of f 1 ..f y is a positive integer (block 430).
  • System 200 now identifies one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters (block 460).
  • a set of prefixes is valid if the end of the message had not been overrun after finding all the variable-length fields.
  • the algorithm can construct a ( ⁇ skip, es> ⁇ list thereby generating a list of (skip j , es j ). It is to be noted that this part of the exemplary algorithm can be executed in parallel for different [y,f, pl, k 1 . . ,k y , fi . . .f y , e 1 ...e y , m ⁇ combinations.
  • system 200 can ignore some of the messages that do not adhere to the hypothesis. This can allow system 200 to disregard erroneous messages within the obtain messages.
  • an output including at least the list of (skip j , es j ) (block 470).
  • the input can be used by system 200 to support analysis of the obtained messages, can be provided to another system as input or can be provided to a user of system 200 or to an external system, external to system 200. It is to be noted that in some cases the obtained messages contain a plurality of zero-valued bytes. This can lead the algorithm to identify multiple solutions where all messages have zero elements of some variable- length fields. Therefore, for each variable number of elements field, it is required that at least three messages where the prefix has a non-zero value exist.
  • some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein. It is to be further noted that some of the blocks are optional (for example, block 470). It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
  • Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for detecting absences in the traffic trace, in accordance with the presently disclosed subject matter.
  • system 200 can be configured to perform a traffic trace absences detection process 500, e.g., utilizing the traffic trace absences detection module 260.
  • System 200 can determine, for example as part of a reverse engineering analysis of a communication protocol, which elements of a traffic trace of messages of the communication protocol do not contain enough diversity to produce a complete specification.
  • System 200 can than optionally suggest an estimation of an amount of additional traffic trace to be collected in order to enhance the completeness of the specification with additional unobserved message types.
  • system 200 can be configured to obtain a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system (block 510).
  • One or more traffic traces of messages can be obtained by system 200 from a sequence of packets captured from one or more computerized communication networks or communication channels.
  • the sequence of packets is timestamped.
  • the trace can be captured using a sniffer, a receiver, a probe, or similar tools.
  • the obtained messages, or some of them can be part of a recording of historical messages communicated over the one or more communication networks or communication channels.
  • the obtained messages, or some of them are part of real-time communication currently transferred over the one or more communication networks or communication channels.
  • the obtained messages can be a combination made of a part of a recording of historical messages communicated over the one or more communication networks or communication channels and a part of real- time communication currently transferred over the one or more communication networks or communication channels.
  • the messages can be obtained from a communication trace communicated using a common communication protocol or using a plurality of communication protocols.
  • system 200 can pre-process the obtained messages to groups according to communication protocol and analyze each of the groups.
  • system 200 can be further configured to apply one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace (block 520).
  • the unseen species problem estimators are a set of statistical techniques, originally designed for biodiversity measurements, that are used by system 200 as part of a protocol reverse engineering process, to estimate the degree of the sample (e.g., the traffic trace) representativeness of the communication protocol, and to suggest if efforts should be invested to collect more traffic traces in order to produce a more complete and accurate specification.
  • the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage- based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, any combinations thereof, or any other unseen species problem estimator.
  • chaol estimators Abundance Coverage-based Estimators (ACE), Incidence Coverage- based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, any combinations thereof, or any other unseen species problem estimator.
  • System 200 can be then configured to determine, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages (block 530).
  • a non-limiting example can be of a given binary communication protocol that includes n message types, where n in unknown to system 200, and each message type has a different unknown probability function for being included in at least one message in each traffic trace.
  • the system 200 analyzes, for example, a traffic trace that includes several million messages of the communication protocol.
  • the system 200 After discovering the location and the representation parameters of the message type field, the system 200 observes that in the message trace, each message type c; has a given prevalence of n i . Given the set of all observed messages and their prevalence ⁇ (ci, ni), (C2, U2), ...) ⁇ . System 200 can estimate the number of unobserved message types, or the number of total message types in the communication protocol. Furthermore, system 200 can estimate how much additional message types would likely be observed if an additional communication trace containing additional messages is collected.
  • System 200 can be similarly used to estimate a number of valid message lengths for a given message type, estimating a number of discrete values for an enumerated (categorical) field type, estimating any other protocol parameter with a discrete and limited set of values, etc.
  • system 200 is optionally further configured to generate a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace (block 540).
  • the relationship model can be, for example, a graph, such as discovery curve, with or without confidence intervals.
  • system 200 is optionally further configured to provide the at least one of the first estimation or the second estimation to a user of the system (block 550) or to an external system, external to system 200.
  • the provided information can be based on one or more of the unseen species problem estimators.
  • the provided information can be displayed to a user of system 200.
  • the provided estimators may include meta-estimators, integrating or combining several unseen species problem estimators.
  • estimators and meta- estimators can be evaluated against the available communications traces or against simulated traces, using Monte Carlo or other simulations methods, to predict which estimates are most accurate for the given scenario.
  • System 200 can be further configured to receive a desired unobserved message types number (block 560).
  • the received desired unobserved message types number can be provided by a user of system 200, for example: an analyst reverse engineering the communication protocol.
  • system 200 can be configured to recommend the given number of additional messages to be obtained in the additional traffic trace based on the second estimation (block 570).
  • system 200 can be optionally configured to provide the relationship model to a user of the system 200 (block 580) or to an external system, external to system 200.
  • some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein (for example, blocks 560-570 can be performed before block 550, etc.). It is to be further noted that some of the blocks are optional (for example, blocks 550-580). It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
  • FIG. 5 a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of constant length fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter.
  • system 200 can be configured to perform a constant length fields analysis process 600, e.g., utilizing the constant length fields analysis module 270.
  • System 200 analyzes obtained messages to determine a specification of the communication protocol and specifically to determine the location and parameters of constant length fields, being fields that have a constant length within the obtained messages which are all of the same constant-length message-type or have been preprocessed to detect and remove variable length fields, fields with constant values, enumerated fields (e.g., fields with a limited set of values) and monotonic fields (e.g., message sequence number fields, time tag fields, etc.) thereby leaving only messages with constant length fields representing an integer value (e.g., 16, 32 or 64 bits, signed or unsigned, or any other integer value representation) or a floating-point value (e.g., 32 or 64 bits, or any other floating-point value representation).
  • an integer value e.g., 16, 32 or 64 bits, signed or unsigned, or any other integer value representation
  • a floating-point value e.g
  • Integer fields are usually composed of one, two, four or eight bytes.
  • the integer fields values can be either signed or unsigned.
  • Floating-point fields are usually represented according to Institute of Electrical and Electronics Engineers (IEEE) 754, and are composed of a sign bit, signed exponent bits, and unsigned fraction bits.
  • the size (in bits) of the exponent and fraction varies according to the size of the floating- point field.
  • the representations for floating-point fields can be 4 bytes (1 sign bit, 8 signed exponent bits, and 23 fraction bits) and 8 bytes (1 sign bit, 11 signed exponent bits, and 52 fraction bits).
  • the communication protocol is a binary communication protocol.
  • the binary communication protocol can be a proprietary protocol.
  • the structure of the messages adhering to the binary communication is generally unknown or is unknown to the user of system 200. It is to be noted that in some cases the binary protocols can include other field types (e.g., bit-fields, etc.).
  • system 200 can be configured to obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes (block 610).
  • Each message is a sequence of one or more fields, wherein each field comprises one or more bytes of the bytes forming the message.
  • the messages structure adheres to the communication protocol, so that the communication protocol determines the location and parameters of each field within the message.
  • Each of the obtained messages can have a specific given type, thereby all having the same structure.
  • a message of a given type has a given structure.
  • the type of the message can be encoded in one of the fields of the message. In these cases, system 200 can identify the type of message field and use it to discern between messages of different types.
  • the obtained messages are preprocessed.
  • the messages are received as a stream of bytes that is not divided into individual messages.
  • the preprocessing can include splitting the stream of bytes into messages by using methods and/or techniques known in the art.
  • the preprocessing of the obtained messages includes identifying the types of these messages (noting that in some cases this step is not required as the plurality of messages obtained at block 610 are already of the same message type).
  • the obtained messages can include a plurality of messages of a plurality of message types, and the preprocessing can include removing messages that are of a type other than a given message type.
  • the preprocessing can include in these cases the splitting of the obtained messages into message groups according to type and the execution by system 200 of variable length string analysis process 300 and/or variable number of elements fields analysis process 400 for each message group containing variable number of elements fields, removing these fields thereby remaining with messages with only constant length fields. It is to be noted that constant-length textual string fields would be identified by variable length string analysis process 300 so that constant length fields analysis process 600 is left with identifying integer and/or float fields only.
  • system 200 can be further configured to provide a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores of the respective field-types of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers (block 620).
  • the classifiers used for scoring can produce scores, each represents the plausibility that a given field-type of a constant length starts at a given offset within the plurality of messages.
  • the score represents that probability.
  • System 200 does not require the scoring classifiers to be calibrated and the score can represent the plausibility of a given field-type of a constant length starts at a given offset within the plurality of messages. It is to be noted that there are a number of ways to calculate the classifiers’ performance score: precision, recall, Fl score, Youden’s J statistic, etc.
  • the classifiers are machine learning classifiers, each machine learning classifiers is trained, utilizing training data from a plurality of data sources with known fields locations and fields types, to score the plausibility of a given field-type starting at the given offset.
  • the field-type can be one of: big-endian 16 bits integer, big-endian 32 bits integer, big-endian 64 bits integer, big-endian 32 bits floating point, big-endian 64 bits floating point, little-endian 16 bits integer, little -endian 32 bits integer, little - endian 64 bits integer, little -endian 32 bits floating point, little -endian 64 bits floating point or any other constant field.
  • the constant length fields can have different endianness, and correspondingly different classifiers are used for each endianness.
  • a non-limiting example of a 64-bit integer classifier can be described as follows:
  • R set ⁇ m most frequent values in S 64 ⁇ Measure cardinality in S 64 of each of the values in R.
  • F n list of n-bit floating-point values extracted from the given offset of all inputmembers range90(S n ) p95(S n ) - p5(S n ) + 1 // avoiding outliers range90(U n ) p95(U n ) - p5(U n ) + 1 // avoiding outliers avg90(S n ) avg(v ⁇ S n , p5( S n ) ⁇ v ⁇ p95(S n ) ⁇ ) // avoiding outliers avg90((U n ) avg(v ⁇ U n , p5(S n ) ⁇ v ⁇ p95(U n ) ⁇ ) // avoiding outliers avg90((U n ) avg(v ⁇ U n , p5(S n ) ⁇ v ⁇ p95(
  • I y and I x are consecutive (same message source, short time difference) U n members
  • Classifiers can be any mathematical models that can produce the plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages, for example: a model based on machine-learning, a model based on neural-networks, a model based on decision-tree algorithms, etc.
  • the classifier performance score can be based on heuristically testing against a large set of simulated binary protocols, encompassing tens of thousands of fields, and real data, collected from many sources. These tests show that some field types can be detected with higher precision and recall - the classifiers for these field types will receive higher performance score than other classifiers thereby creating a cascading classifier architecture.
  • a 64-bit floating-point classifier has a higher performance score than a 64-bit integer classifier
  • system 200 can be further configured to utilize the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages (block 630).
  • a non-limiting example can be utilizing the 1164 classifier described above to score the plausibility of each offset being the start of a 64-bit integer field and the F164 classifier described above to score the plausibility of each offset being the start of a 64-bit floating-point field.
  • system 200 can be further configured to identify the highest score based on (i) each classifier’s performance score of available classifiers, being the classifiers, and (ii) the highest plausibility score produced by each classifier (block 640). Continuing our above non-limiting example, it can be assumed that the highest score for a given offset was produced by the 1164 classifier.
  • system 200 can be further configured to mark a number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score (block 650). In our non-limiting example, system 200 will mark eight bytes within the message, beginning from the given offset, as a constant length field of type 64-bit integer in accordance with the 1164 classifier.
  • System 200 than repeats (d) and (e) while (i) disregarding the scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers (block 660). It is to be noted that upon one or more of the bytes being marked more than once, each in association with a marking from a respective offset, the repeat is performed for each marking, each time while disregarding the scores associated with the respective subset of the marked bytes starting at the respective offset. In these cases, system 200 can produce more than one possible specification for the constant length fields within the messages. For example: if the given offset got the same highest score from 1164 and from F164.
  • marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score does not exceed an end of the messages of the given type. For example: if the given offset is only seven bytes from the end of the message - it cannot be marked in according to the score produced by 1164 as there are not enough bytes for a 64-bit integer. Marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score are not already marked. Continuing the above example: if six bytes from the given offset there is a marked byte, it cannot be marked in according to the score produced by 1164 as there are not enough bytes for a 64-bit integer until the marked byte.
  • system 200 can be further configured to provide a user of the system or an external system, external to system 200, with an indication of a start position of the constant length fields and the field-type of the constant length fields, based on the markings (block 670). It is to be noted that in some cases the indication of a start position of the constant length fields and the field-type of the constant length fields is provided or communicated to one or more external systems, external to system 200. In some cases, system 200 can provide and/or communicate a specification of the constant length fields of the obtained messages.
  • the specification can be represented as one or more lists of (skip, params) pairs, wherein the respective index byte number is represented by a skip - the number of bytes that are needed to be skipped from the start of the message or from the end of the previous identified constant length field to reach the respective constant length field, and params includes the parameters of the respective constant length field - type of the field, endianness of the field, sign of the field (where applicable), etc. It is to be noted that in some cases, the specification will include a plurality of (skip, params) pairs as two or more markings have the same or nearly the same score. In these cases, the user or an external system is provided with a number of ways to interpret the constant length fields within the messages.
  • some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein. It is to be further noted that some of the blocks are optional (for example, block 670). It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
  • system can be implemented, at least partly, as a suitably programmed computer.
  • the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method.
  • the presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Abstract

A system for determining location and parameters of constant length and variable length string fields within a plurality of messages of a given type, the system comprising a processing circuitry configured to: (a) obtain the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determine, for each of the messages of the given type, an index byte, being a first byte of a sequence of bytes of each of the respective messages; (c) determine, for the index byte of each of the messages: (A) a message string plausibility score, indicating a plausibility that a part of the respective messages starting at the index byte is a string field, based on analysis of a content of the part of the respective messages, and (B) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value; (d) upon a criterion based on (A) the message string plausibility scores and (B) the string candidate lengths, being met, determine that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.

Description

A SYSTEM AND METHOD FOR PRODUCING SPECIFICATIONS FOR FIELDS WITH VARIABLE NUMBER OF ELEMENTS
TECHNICAL FIELD
The invention relates to a system and method for producing specifications for fields with variable number of elements.
BACKGROUND
Current solutions for reverse engineering of proprietary binary communication protocols utilize various mathematical models to analyze a trace of messages of an unknown binary protocol and to deduce a specification for that binary protocol. Part of the unknown binary protocol are variable-length fields within the messages. Current reverse engineering solutions do not solve the problem of deducing a specification for variable number of elements fields, constant length fields, variable length fields and specifically for constant and variable-length string fields.
In current solutions, an analyst cannot be sure that the collected trace is representative of the binary communication protocols, hence, contains the information needed to produce a complete specification thereof.
These current solutions are not capable of detecting absences in the trace supporting a user considering if additional messages are required in order to complete the analysis of the binary protocol in order to produce a complete and accurate specification. There is a thus a need to employ statistical tests, models, and indicators into the reverse engineering process in order to estimate the degree of the sample representativeness.
These current solutions do not exploit the internal logic of the binary protocol when reverse engineering the trace of messages.
There is thus a need in the art for a new method and system for producing specifications for binary protocols based on the internal logic of the binary protocol and specifically for detecting and producing specifications for variable-length string fields, for constant length fields and for variable number of elements fields and for detecting absences in the trace.
References considered to be relevant as background to the presently disclosed subject matter are listed below. Acknowledgement of the references herein is not to be inferred as meaning that these are in any way relevant to the patentability of the presently disclosed subject matter.
US Patent application No. 2009/0006645 (Cui et al.) published on January 1, 2009, discloses a system for automatic inference of message formats from network packets is described. Each network message from a set of network messages is split into one or more tokens based on the types of bytes in the network messages. The set of network messages can then be classified into clusters based on token patterns. The network messages in each cluster can then be further sub-clustered recursively based on the message formats. Further, the messages with a similar message format across the sub-clusters can be merged into a cluster. The set of formatted clusters thus obtained correspond to a set of message formats that can be used further for protocol reverse engineering.
US Patent application No. 2019/0296935 (HONG et al.) published on September 26, 2019, discloses a device and method for dividing a field boundary of a CAN trace. The method for dividing a field boundary of a CAN trace according to an embodiment of the present disclosure includes: collecting a CAN trace of a CAN bus; dividing the CAN trace into multiple blocks including multiple frames of the CAN trace; performing first static field division to each of the multiple blocks; and performing second static field division based on the result of the first static field division to divide a final field boundary of the CAN trace.
US Patent No. 9,100,326 (Iliofotou et al.) published on August 4, 2015, discloses a method for analyzing an application protocol of a network. The method includes extracting non-alphanumeric tokens from conversations of the network, selecting frequently occurring non-alphanumeric token as a field delimiter candidate for dividing each conversation into a slice-set, analyzing slice-sets of the conversations to determine a statistical measure of matched slices for each conversation, and -o determine a field delimiter candidate score by aggregating the statistical measure of matched slices for all conversations, and selecting the non-alphanumeric token as the field delimiter of the protocol based on the field delimiter candidate score associated with the non-alphanumeric token.
US Patent No. 6,931,574 (Coupal et al.) published on August 4, 2015, discloses preferred embodiments of the current invention are directed to a protocol analyzer for interpreting data frames captured on a communications network. The protocol analyzer includes a network interface connection for providing the electrical and physical connection to the communications network and for receiving data frames from the network in a particular physical layer protocol format. The protocol analyzer further includes analysis software for providing an interpretation of received data frames. The interpretation of a frame is based upon a series of definition constructs that are stored in a protocol definition file and a protocol database of the protocol analyzer. The definition constructs collectively define the characteristics of a data frame for a given physical layer protocol. Also, the constructs provide a means for identifying any one of a number of higher-level protocols that may be embedded within the data frame. Also disclosed is a graphical user interface for use as a protocol editor for assembling the necessary definition constructs for inclusion in a protocol definition file. Further, embodiments of a graphical interface for displaying the results of interpreted frames is also disclosed.
US Patent application No. 2015/0363215 (Versteeg et al.) published on December 17, 2015, discloses a method of service emulation, a plurality of messages communicated between a system under test and a target system for emulation are recorded in a computer-readable memory. Ones of the messages are clustered to define a plurality of message clusters, and respective cluster prototypes are generated for the message clusters. The respective cluster prototypes include a commonality among the ones of the messages of the corresponding message clusters. One of the message clusters is identified as corresponding to a request from the system under test based on a comparison of the request with the respective cluster prototypes, and a response to the request for transmission to the system under test is generated based on the one of the message clusters that was identified. Related computer systems and computer program products are also discussed. GENERAL DESCRIPTION
In accordance with a first aspect of the presently disclosed subject matter, there is provided a system for determining location and parameters of constant length and variable length string fields within a plurality of messages of a given type, the system comprising a processing circuitry configured to: (a) obtain the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determine, for each of the messages of the given type, an index byte, being a first byte of a sequence of bytes of each of the respective messages; (c) determine, for the index byte of each of the messages: (A) a message string plausibility score, indicating a plausibility that a part of the respective messages starting at the index byte is a string field, based on analysis of a content of the part of the respective messages, and (B) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value; (d) upon a criterion based on (A) the message string plausibility scores and (B) the string candidate lengths, being met, determine that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.
In some cases, the processing circuitry is further configured to: determine for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value; and upon the criterion being met, determine a type of the variable length string, utilizing at least the full string candidate length.
In some cases, the type is one of: (a) a constant length string field not ending with the terminator value; (b) a constant length string field ending with the terminator value; (c) a constant length string field ending with padding values; (d) a constant length string field with a length prefix ending with noise values; (e) a variable length string field with the length prefix and not ending with the terminator value; or (f) a variable length string field ending with the terminator value.
In some cases, upon the criterion being met, the processing circuitry is further configured to: remove the variable length string from each of the messages; and repeat (b)-(d). In some cases, upon the criterion being met, the processing circuitry is further configured to determine one or more parameters associated with the variable length string.
In some cases, upon the criterion being met, the processing circuitry is further configured to validate, using the parameters, that the variable length string is a valid variable length string, and wherein upon the validation being unsuccessful, repeat (c)- (d) with the index byte being a byte, subsequent to the index byte, if any.
In some cases, the messages are obtained from a trace obtained from one of more computerized networks.
In some cases, the trace includes a plurality of additional messages of one or more other types other than the given type.
In some cases, at least two of the messages have different message length.
In some cases, the variable length string is an alphanumeric string.
In some cases, the processing circuitry is further configured to provide a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string.
In accordance with a second aspect of the presently disclosed subject matter, there is provided a system for determining parameters of variable length fields, being fields with a variable number of elements, within a plurality of messages of a given type, the system comprising a processing circuitry configured to: (a) obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) for a given message length, being a length of one or more of the messages, identify a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = e1.k1 + e2.k2 + . . . + ey.ky + f.pl + m wherein: m is a length of non-variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of a prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an ith distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y, each solution includes a possible combination of y≥1 , e1..ey, k1..ky, f≥1, pl≥1, and m≥0, all are integers, each of e1..ey is positive and each of k1..ky is positive; (c) for each possible solution of the given message length equation, identify one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: fi is a number of fields with element size ki and each of f1..fy is a positive integer; (d) select a given message of the messages, having the given message length; (e) for each of the sub solutions, identify one or more hypothesis defining a list of ( skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from a message-start or from an end of a preceding variable-length field, to a jth variable length field; esj is an element size of the jth variable-length field; and given message length = p1.es1 + p2.es2 + ... + pf -esf + f.pl + m, wherein p, is a value of tire prefix of the jth variable length field, representing the number of elements in the jth variable length field; (f) identify one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters.
In some cases, the given message length is a shortest message length of the messages.
In some cases, identifying the plurality of possible solutions also includes testing the possible solutions against a given number of additional message lengths, being lengths of one or more other messages of the messages other than the given message, and wherein for each of the additional message lengths at least one solution exists to a given additional message length equation, wherein the given additional message length equation is: additional message length ~ e1.k1 + e2-k2 + ... + ey.ky + f.pl + m, so that e;≥ 0.
In some cases, identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k1..ky.
In some cases, the messages are obtained from a trace obtained from one of more computerized networks.
In some cases, the trace includes a plurality of additional messages of one or more other types other than the given type.
In some cases, at least two of the messages have different message length.
In some cases, the processing circuitry is further configured to provide an output including at least the list of (skipj, esj). In some cases, ki has a predetermined lower threshold and a predetermined upper threshold.
In some cases, at least one of f and y has a predetermined upper threshold.
In some cases, pl is one of a predetermined set of values, byte, two bytes or four bytes.
In some cases, pl has one of: a big-endian representation or a little-endian representation.
In some cases, some of the messages are ignored by the system.
In accordance with a third aspect of the presently disclosed subject matter, there is provided a system comprising a processing circuitry configured to: obtain a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; apply one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determine, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
In some cases, the processing circuitry is further configured to generate a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace.
In some cases, the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
In some cases, the traffic trace is obtained from one or more computerized networks. In some cases, the processing circuitry is further configured to provide the at least one of the first estimation or the second estimation to a user of die system or an external system.
In some cases, the processing circuitry is further configured to: receive a desired unobserved message types number; and recommend the given number of additional messages to be obtained in the additional traffic trace based on the second estimation.
In some cases, the processing circuitry is further configured to provide the relationship model to a user of the system or to an external system.
In accordance with a fourth aspect of the presently disclosed subject matter, there is provided a system for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the system comprising a processing circuitry configured to: (a) obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) provide a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores, each representing a plausibility that a respective field-type of constant length field starts at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers; (c) utilize the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages; (d) identify the highest score based on (i) each classifier’s performance score of available classifiers, being the classifiers, and (ii) the highest plausibility score produced by each classifier; (e) mark a number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score; (f) repeat (d) and (e) while (i) disregarding the scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers; and (g) upon the available classifiers not including any of the classifiers (e.g. all available classifiers have been exhausted), provide a user of the system or an external system with an indication of a start position of each of the constant length fields and the field-type of each of the constant length fields, based on the markings. In some cases, identify more than one highest score field types and upon one or more of the bytes being marked more than once, each in association with a marking from a respective offset, the repeat is performed for each marking, each time while disregarding the scores associated with the respective subset of the marked bytes starting at the respective offset.
In some cases, marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score does not exceed an end of the messages of the given type.
In some cases, marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score are not already marked.
In some cases, the messages are obtained from a trace obtained from one of more computerized networks.
In some cases, the trace includes a plurality of additional messages of one or more other types other than the given type.
In some cases, some of the messages are ignored by the system.
In some cases, the constant length fields are one or more of: big-endian 16 bits integer; big-endian 32 bits integer; big-endian 64 bits integer; big-endian 32 bits floating point; big-endian 64 bits floating point; little-endian 16 bits integer; little- endian 32 bits integer; little-endian 64 bits integer; little-endian 32 bits floating point; or little-endian 64 bits floating point.
In accordance with a fifth aspect of the presently disclosed subject matter, there is provided a method for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determining, by the processing circuitry, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages; (c) determining, by the processing circuitry, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value; (d) upon a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths, being met, determining, by the processing circuitry, that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.
In some cases, the method further comprising: determining, by the processing circuitry, for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value; and upon the criterion being met, determining, by the processing circuitry, a type of the variable length string, utilizing at least the full string candidate length.
In some cases, the type is one of: (a) a constant length string field not ending with the terminator value; (b) a constant length string field ending with the terminator value; (c) a constant length string field ending with padding values; (d) a constant length string field with a length prefix ending with noise values; (e) a variable length string field with the length prefix and not ending with the terminator value; or (f) a variable length string field ending with the terminator value.
In some cases, upon the criterion being met, the method further comprising: removing, by the processing circuitry, the variable length string from each of the messages; and repeating, by the processing circuitry, (b)-(d).
In some cases, upon the criterion being met, the method further comprising: determining, by the processing circuitry, one or more parameters associated with the variable length string.
In some cases, upon the criterion being met, the method further comprising: validating, by the processing circuitry, using the parameters, that the variable length string is a valid variable length string, and wherein upon the validation being unsuccessful, repeating, by the processing circuity, (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.
In some cases, the messages are obtained from a trace obtained from one of more computerized networks. In some cases, the trace includes a plurality of additional messages of one or more other types other than the given type.
In some cases, at least two of the messages have different message length.
In some cases, the variable length string is an alphanumeric string.
In some cases, the method further comprising: providing, by the processing circuitry, a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string.
In accordance with a sixth aspect of the presently disclosed subject matter, there is provided a method for determining parameters of variable length fields, being fields with a variable number of elements, within a plurality of messages of a given type, the method comprising: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) for a given message length, being a length of one or more of the messages, identifying, by the processing circuitry, a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = e1.k1 + e2.k2 + ... + ey.ky + f.pl + m wherein: m is a length of non- variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of a prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an ith distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y, each solution includes a possible combination of y≥1 , e1..ey, k1..ky, f≥1, pl ≥1 , and m>0, all are integers, each of e1..ey is positive and each of k1..ky is positive; (c) for each possible solution of the given message length equation, identifying, by the processing circuitry, one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: ft is a number of fields with element size ki and each of f1..fy is a positive integer; (d) selecting, by the processing circuitry, a given message of the messages, having the given message length; (e) for each of the sub solutions, identifying, by the processing circuitry, one or more hypothesis defining a list of (skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from a message-start or from an end of a preceding variable-length field, to a jth variable length field; esj is an element size of the jth variable-length field; and given message length = p1.es1 + p2.es2 + ... + pf.esf + f.pl + m, wherein pj is a value of the prefix of the jth variable length field, representing the number of elements in the jth variable length field; (f) identifying, by the processing circuitry, one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to the respective hypothesis of the hypotheses, giving rise to the parameters.
In some cases, the given message length is a shortest message length of the messages.
In some cases, identifying the plurality of possible solutions also includes testing the possible solutions against a given number of additional message lengths, being lengths of one or more other messages of the messages other than the given message, and wherein for each of the additional message lengths at least one solution exists to a given additional message length equation, wherein the given additional message length equation is: additional message length = e1.k1 + e2.k2 + ... + ey.ky + f.pl + m, so that ei≥ 0.
In some cases, identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k1..ky.
In some cases, the messages are obtained from a trace obtained from one of more computerized networks.
In some cases, the trace includes a plurality of additional messages of one or more other types other than the given type.
In some cases, at least two of the messages have different message length.
In some cases, the processing circuitry is further configured to provide an output including at least the list of (skipj, esj).
In some cases, ki has a predetermined lower threshold and a predetermined upper threshold.
In some cases, at least one of f and y has a predetermined upper threshold.
In some cases, pl is one of a predetermined Set of values, byte, two bytes or four bytes.
In some cases, pl has one of: a big-endian representation or a little-endian representation.
In some cases, some of the messages are ignored by the system. In accordance with a seventh aspect of the presently disclosed subject matter, there is provided a method comprising: obtaining, by a processing circuitry, a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; applying, by the processing circuitry, one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determining, by the processing circuitry, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
In some cases, the method further comprising: generating, by the processing circuitry, a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace.
In some cases, the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
In some cases, the traffic trace is obtained from one or more computerized networks.
In some cases, the method further comprising: providing, by the processing circuitry, the at least one of the first estimation or the second estimation to a user of the system or to an external system.
In some cases, the method further comprising: receiving, by the processing circuitry, a desired unobserved message types number; and recommending by the processing circuitry, the given number of additional messages to be obtained in the additional traffic trace based on the second estimation.
In some cases, the method further comprising: providing, by the processing circuitry, the relationship model to a user of the system or to an external system. In accordance with an eight aspect of the presently disclosed subject matter, there is provided a method for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the method comprising: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) providing, by the processing circuity, a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores, each representing a plausibility that a respective field-type of constant length field starts at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers; (c) utilizing, by the processing circuity, the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages; (d) identifying, by the processing circuity, the highest score based on (i) each classifier’s performance score of available classifiers, being the classifiers, and (ii) the highest plausibility score produced by each classifier; (e) marking, by the processing circuity, a number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score; (f) repeating, by the processing circuity, (d) and (e) while (i) disregarding the scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers; and (g) upon the available classifiers not including any of the classifiers (e.g. all available classifiers have been exhausted), providing, by the processing circuity, a user of the system or an external system with an indication of a start position of each of the constant length fields and the field-type of each of the constant length fields, based on the markings.
In some cases, identify more than one highest score field types and upon one or more of the bytes being marked more than once, each in association with a marking from a respective offset, the repeat is performed for each marking, each time while disregarding the scores associated with the respective subset of the marked bytes starting at the respective offset. In some cases, marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score does not exceed an end of the messages of the given type.
In some cases, marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type starting at one or more offsets associated with the highest score are not already marked.
In some cases, the messages are obtained from a trace obtained from one of more computerized networks.
In some cases, the trace includes a plurality of additional messages of one or more other types other than the given type.
In some cases, some of the messages are ignored by the system.
In some cases, the constant length fields are one or more of: big-endian 16 bits integer; big-endian 32 bits integer; big-endian 64 bits integer; big-endian 32 bits floating point; big-endian 64 bits floating point; little-endian 16 bits integer; little- endian 32 bits integer; little-endian 64 bits integer; little-endian 32 bits floating point; or little-endian 64 bits floating point.
In accordance with a ninth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a sequence of bytes; (b) determining, by the processing circuitry, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages; (c) determining, by the processing circuitry, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value; (d) upon a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths, being met, determining, by the processing circuitry, that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.
In accordance with a tenth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for determining parameters of variable length fields, being fields with a variable number of elements, within a plurality of messages of a given type, the method comprising: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) for a given message length, being a length of one or more of the messages, identifying, by the processing circuitry, a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = e1.k1 + e2-k2 + ... + ey.ky + f.pl + m wherein: m is a length of non-variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of a prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an ith distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y, each solution includes a possible combination of y>l, e1..ey, k1..ky, f≥l, pl≥1 , and m>0, all are integers, each of e1..ey is positive and each of k1..ky is positive; (c) for each possible solution of the given message length equation, identifying, by the processing circuitry, one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: f; is a number of fields with element size ki and each of f1..fy is a positive integer; (d) selecting, by the processing circuitry, a given message of the messages, having the given message length; (e) for each of the sub solutions, identifying, by the processing circuitry, one or more hypothesis defining a list of (skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from a message-start or from an end of a preceding variable-length field, to a jth variable length field; esj is an element size of the jth variable- length field; and given message length = p1.es1 + p2.es2 + . . . + pf .esf + f. pl + m, wherein pj is a value of the prefix of the jth variable length field, representing the number of elements in the j* variable length field; (f) identifying, by the processing circuitry, one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters.
In accordance with an eleventh aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method comprising: obtaining, by a processing circuitry, a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; applying, by the processing circuitry, one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determining, by the processing circuitry, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
In accordance with a twelfth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the method comprising: (a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) providing, by the processing circuity, a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores, each representing a plausibility that a respective field-type of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers; (c) utilizing, by the processing circuity, the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages; (d) identifying, by the processing circuity, the highest score based on (i) each classifier’s performance score of available classifiers, being the classifiers, and (ii) the highest plausibility score produced by each classifier; (e) marking, by the processing circuity, a number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score; (f) repeating, by the processing circuity, (d) and (e) while (i) disregarding the scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers; and (g) upon the available classifiers not including any of the classifiers (e.g. all available classifiers have been exhausted), providing, by the processing circuity, a user of the system or an external system with an indication of a start position of each of the constant length fields and the field-type of each of the constant length fields, based on the markings.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subject matter will now be described, by way of non- limiting examples only, with reference to the accompanying drawings, in which:
Fig. 1 is a block diagram schematically illustrating one example of a system for producing specifications for binary protocols, in accordance with the presently disclosed subject matter; and
Fig. 2 is a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable length strings within a plurality of messages of a given type, in accordance with the presently disclosed subject matter;
Fig. 3 is a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable number of elements fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter; Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for detecting absences in the traffic trace, in accordance with the presently disclosed subject matter; and
Fig. 5 is a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of constant length fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter.
DETAILED DESCRIPTION
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the presently disclosed subject matter. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the presently disclosed subject matter.
In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "obtaining", "determining", "meeting", "validating", "removing", "providing" or the like, include action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g., such as electronic quantities, and/or said data representing the physical objects. The terms “computer”, “processor”, “processing resource”, "processing circuitry" and “controller” should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g., digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co-residing on a single physical machine, any other electronic computing device, and/or any combination thereof. The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non-transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
As used herein, the phrase "for example," "such as", "for instance" and variants thereof describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to "one case", "some cases", "other cases" or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment(s).
It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in Figs. 2-5 may be executed. In embodiments of the presently disclosed subject matter one or more stages illustrated in Figs. 2-5 may be executed in a different order and/or one or more groups of stages may be executed simultaneously. Fig. 1 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter. Each module in Fig. 1 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein. The modules in Fig. 1 may be centralized in one location or dispersed over more than one location. In other embodiments of the presently disclosed subject matter, the system may comprise fewer, more, and/or different modules than those shown in Fig. 1. Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
Bearing this in mind, attention is drawn to Fig. 1, a block diagram schematically illustrating one example of a system for producing specifications for binary protocols, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, system 200 can comprise a communications interface 220 enabling connecting the system 200 to a network or a communication channel or a radio receiver and enabling it to receive or to capture data sent thereto through the network or the communication channel, including in some cases receiving information such as: receiving traces of binary massages sent over the one or more networks. In some cases, the communications interface 220 can be connected to a Local Area Network (LAN), to a Wide Area Network (WAN), to a wireless communications channel, to a wireless network, to a communication bus, to a point-to-point communication channel, to a radio link, or to the Internet. In some cases, the communications interface 220 can connect to a wireless network or communication channel. It is to be noted that in some cases the received information, or part thereof, can be collected from one or more networks or communication channels.
System 200 can further comprise or be otherwise associated with a data repository 210 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, a combination of ROM and RAM or any other type of memory, etc.) configured to store data, including, inter alia, binary messages, lists of message string plausibility scores (being an indication of a plausibility that a given candidate string is actually a string field, as further detailed herein with reference to Fig. 2), lists of identified variable length strings within the messages, respective index byte of each identified variable length string (wherein the index byte can point to can point at a given byte of a message, as further detailed herein with reference to Fig. 2), respective parameters of each identified variable length string, score matrices (being matrices used to provide different weights to elements during a score calculation, as further detailed herein with reference to Fig. 2), message lengths, possible solutions to Diophantine equations representing a length of a message as the sum of the multiplication of the amount of fields with a given element size by the respective given element size (as further detailed herein with reference to Fig. 3), approximations of unobserved message types number (as further detailed herein with reference to Fig. 4), training data from a plurality of data sources with known fields locations and fields types used to train a plurality of classifiers, scores given by one or more classifier to each offset of the message, markings of one or more offsets of the message (as further detailed herein with reference to Fig. 5), etc.
In some cases, data repository 210 can be further configured to enable retrieval and/or update and/or deletion of the data stored thereon. It is to be noted that in some cases, data repository 210 can be distributed. It is to be noted that in some cases, data repository 210 can be stored in on cloud-based storage.
System 200 further comprises processing circuitry 230. Processing circuitry 230 can be one or more processing circuitry units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing circuitry units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system 200 resources.
The processing circuitry 230 comprises a variable length string analysis module 240, a variable number of elements fields analysis module 250, a traffic trace absences detection module 260 and a constant length fields analysis module 270. The variable length string analysis module 240 is configured to perform a variable length string analysis process, as further detailed herein, inter alia with reference to Fig. 2.
The variable number of elements fields analysis module 250 is configured to perform a variable number of elements fields analysis process, as further detailed herein, inter alia with reference to Fig. 3.
The traffic trace absences detection module 260 is configured to perform a traffic trace absences detection process, as further detailed herein, inter alia with reference to Fig. 4.
The constant length fields analysis module 270 is configured to perform a constant length fields analysis process, as further detailed herein, inter alia with reference to Fig. 5.
Having described an exemplary system for producing specifications for binary protocols, attention is drawn to Fig. 2, a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable length strings within a plurality of messages of a given type, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, system 200 can be configured to perform variable length string analysis process 300, e.g., utilizing the variable length string analysis module 240. System 200 analyzes obtained messages to determine a specification of the communication protocol and specifically to determine the location and parameters of variable-length string fields within the obtained messages. In some cases, the communication protocol is a binary communication protocol. The binary communication protocol can be a proprietary protocol. The structure of the messages adhering to the binary communication is generally unknown or is unknown to the user of system 200.
For this purpose, system 200 can be configured to obtain a plurality of messages of a given type, each of the messages comprised of a sequence of bytes (block 310). Each message is a sequence of one or more fields, wherein each field comprises one or more bytes of the bytes forming the message. The messages structure adheres to the communication protocol, so that the communication protocol determines the location and parameters of each field within the message. Each of the obtained messages can have a specific given type, thereby all having the same structure even if not always having the same number of bytes due to variable-length fields. A message of a given type has a given structure. In a non-limiting example, the type of the message can be encoded in one of the fields of the message. In these cases, system 200 can identify the type of message field and use it to discern between messages of different types.
In some cases, the obtained messages are preprocessed. For example, in some cases, the messages are received as a stream of bytes that is not divided into individual messages. In these cases, the preprocessing can include splitting the stream of bytes into messages by using methods and/or techniques known in the art. As another example, the preprocessing of the obtained messages includes identifying the types of these messages (noting that in some cases this step is not required as the plurality of messages obtained at block 310 are already of the same message type). In such cases, the obtained messages can include a plurality of messages of a plurality of message types, and the preprocessing can include removing messages that are of a type other than a given message type. The preprocessing can include in these cases the splitting of the obtained messages into message groups according to type and the execution by system 200 of variable length string analysis process 300 for each message group. The splitting of the obtained messages into message groups according to message type is achieved using methods known in the art for message type identification, for example: a correlation method correlating between values of candidate message type fields and the message length. In some cases, the given message type is variable-length message type, wherein the messages of the variable-length message type include one or more variable-length fields. The variable-length fields can be textual string fields containing a representation of a string. The string can be an alphanumeric string, e.g., comprising of digits, letters, delimiters, etc.
In some cases, the fields forming the message can include fixed-length fields, having a fixed length, or variable-length fields, wherein the length of such fields (i.e., the number of bytes comprised within the field) in one message is different than the length of such fields in a second message.
The preprocessing can be performed by system 200 (e.g., before or as part of the execution of variable length string analysis process 300) or by an external system. The messages can be obtained by system 200 from a trace of timestamped sequence of packets captured from one or more computerized communication networks or communication channels. The trace can be captured using a sniffer, a receiver, a probe, or similar tools. The obtained messages, or some of them, can be part of a recording of historical messages communicated over the one or more communication networks or communication channels. In some cases, the obtained messages, or some of them, are part of real-time communication currently transferred over the one or more communication networks or communication channels. The obtained messages, or some of them, can be a combination made of a part of a recording of historical messages communicated over the one or more communication networks or communication channels and a part of real-time communication currently transferred over the one or more communication networks or communication channels. The messages can be obtained from a communication trace communicated using a common communication protocol or using a plurality of communication protocols.
The textual string fields can have one of several possible representations. These possible representations can include, among others: (1) A constant length string field not ending with the terminator value - strings of this type have a constant length and have no terminator character. A terminator character can be for example a character with an American Standard Code for Information Interchange (ASCII) code value 0. Strings of this type can be with or without a length prefix. A length prefix are one or more bytes storing the length (in number of bytes) of the string. The length-prefix bytes can be sequenced in big-endian or little-endian endianness. This type of strings can be described as follows: "[length-prefix II] string", wherein the [] marks an optional element and the II symbol denotates concatenation between two elements. (2) A constant length string field ending with the terminator value - strings of this type have a constant length and end with a terminator character. Strings of this type can be with or without a length prefix. Optionally there are one or more bytes after the terminator and up to the end of the string. These bytes are noise bytes that can be disregarded. This type of strings can be described as follows: "[length-prefix II] string II terminator [II noise]", wherein the [] marks an optional element and tire II symbol denotates concatenation between two elements. (3) A constant length string field ending with padding values - strings of this type have a constant length and end with a one or more padding bytes. Padding bytes are used to pad the string to its constant length. In some cases, the padding bytes can be each a constant padding character (for example: ASCII code value 0). Strings of this type can be with or without a length prefix. This type of strings can be described as follows: "[length-prefix II] string II padding*", wherein the [j marks an optional element, the II symbol denotates concatenation between two elements and the * symbols repetition of one or more padding bytes. (4) A constant length string field with a length prefix ending with noise values - strings of this type have a length -prefix and optionally end with a one or more bytes after the bytes of the string's content and up to the length of the string. These bytes are noise bytes that can be disregarded. The noise bytes are used when it is necessary to complete the length of the string. This type of strings can be described as follows: "length-prefix II string [II noise]", wherein the [] marks an optional element and the II symbol denotates concatenation between two elements. (5) A variable length string field with the length prefix and not ending with the terminator value - strings of this type do not have a constant length. Strings of this type have a length-prefix and have no terminator character. This type of strings can be described as follows: "length-prefix II string", wherein the II symbol denotates concatenation between two elements. (6) a variable length string field ending with the terminator value - strings of this type have a variable length and end with a terminator character. Strings of this type can be with or without a length prefix. This type of strings can be described as follows: [length-prefix II] string II terminator", wherein the [] marks an optional element and the II symbol denotates concatenation between two elements.
Each representation can be associated with parameters describing the textual string field, as further detailed below.
In a non-limiting example, the obtained messages can include for example the following two messages: Ml {65, 66, 67, 68, 69, 0, 100} and M2 {68, 69, 70, 0, 101 }. The messages are represented as a sequence of bytes. In this non-limiting example, the bytes are represented using American Standard Code for Information Interchange (ASCII) 8 bit-codes. The messages include a first field which is a variable-length field (with the value of "65, 66, 67, 68, 69, 0" in Ml and the value of "68, 69, 70, 0" in M2) and a second field which is a constant-length field in the length of one byte with the value 100 in Ml and 101 in M2. After obtaining the plurality of messages, system 200 can be further configured to determine, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages (block 315). Continuing the above non- limiting example, the index byte can point at the first byte of Ml ("65") and M2 ("68"). The index byte will be used to iterate over the bytes of each message as detailed below.
After determining the index byte, system 200 can be further configured to determine, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value (block 320).
System 200 calculates a message string plausibility score for each given message of the messages obtained in block 310. The message string plausibility score is calculated for a candidate string, which is the part of the given message starting at the index byte. The string plausibility score of a given candidate string is an indication of the plausibility that the given candidate string is actually a string field. It is to be noted that it is possible for system 200 to execute the variable length string analysis process 300 using an alternate flow in which the message string plausibility score is calculates within each message of the obtained messages for all possible index bytes.
The method for calculating the message string plausibility score can be, for example, based on calculating for each candidate string a matrix and performing for each candidate string an element-wise product between the calculated matrix and a score matrix and summing the results into a string plausibility score.
The candidate string can be the part of the given message starting at the index byte and ending at the first null character (i.e., ASCII 0 value) or at the first out-of- range character (i.e., ASCII values 1 to 31 and 127 to 255) or at the end of the given message. The calculated matrix can be calculated so that the value of each cell is a count of how many instances of the type of character represented by the column of the matrix is followed in the candidate string by an instance of the type of character represented by the row of the matrix. The types of characters can be: digits (i.e., ASCII values 48 to 57), upper-case characters (i.e., ASCII values 65 to 90), lower-case characters (i.e., ASCII values 97 to 122), separators (i.e., ASCII values 32, 45, 46 and
95), and symbols (i.e., all other ASCII values that are not null or out-of-range).
A non-limiting example is the following calculated matrix, calculated for an exemplary string of "ArmyUnitlOl":
Figure imgf000030_0001
The score matrix can be a constant matrix giving different weights to the elements of the calculated matrix in accordance with sequences of bytes that are likely to appear in strings (for example: a digit that follows a digit is a sequence that is more likely to appear in a string) - these will have positive weights, and with sequences that are not likely to appear in strings (for example: a symbol that follows a symbol is a sequence that is less likely to appear in a string) - these will have negative weights.
A non-limiting example of a score matrix is the following score matrix:
Figure imgf000030_0002
Figure imgf000031_0001
System 200 can further determine for each of the obtained messages a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value, not including null characters (i.e., ASCII 0 value), out-of-range characters (i.e., ASCII values 1 to 31 and 127 to 255) or end of message characters.
In some cases, the string plausibility score which is the element-wise product between the calculated matrix and the score matrix and the summation of the results is further normalized based on the corresponding string candidate length. A non-limiting example of using the corresponding string candidate length to normalize the message string plausibility score is to use as a normalized message string plausibility score the ratio between the message string plausibility score and the corresponding string candidate length minus 1. When the corresponding string candidate length is smaller than two the message string plausibility score is zero.
For the exemplary string of "ArmyUnitlOl" the normalized message string plausibility score is 1.6, wherein the calculation is (2-2 + 2-2 + 1- 1 + 1-(-1) + 4-2) / (11 - 1).
After determining the message string plausibility score and the string candidate length for the index byte of each of the messages, system 200 can be further configured to test if a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths is met (block 325). System 200 holds the information of the determined message string plausibility scores and the corresponding string candidate lengths for the index byte for each of the obtained messages. System 200 can now perform a criterion test based on that information. The criterion test can be based on a normalized average of the message string plausibility scores. The normalized average of the message string plausibility scores can be the average of message string plausibility scores for the obtained messages having corresponding string candidate lengths of value of more than one. A non-limiting example of such a criterion test can be if (1) the 2nd percentile of the string candidate lengths is larger or equal to 1.5 and the 80th percentile of the string candidate lengths is larger or equal to 2 and the normalized average of the message string plausibility scores is larger of equal to 1.1, or (2) the 2nd percentile of the string candidate lengths is larger or equal to 1.5 and the 80th percentile of the string candidate lengths is larger or equal to 3 and the normalized average of the message string plausibility scores is larger of equal to 0.97, or (3) the 2nd percentile of the string candidate lengths is larger or equal to 1.5 and the 80th percentile of the string candidate lengths is larger or equal to 4 and the normalized average of the message string plausibility scores is larger of equal to 0.85, or (4) the 2nd percentile of the string candidate lengths is larger or equal to 3.
If the criterion is not met, system 200 can determine that the candidate string is not a string field and thus be further configured to set the index byte to be a byte, subsequent to the index byte, if any, and return to block 320 (block 330).
If the criterion is met, the candidate string may be a string field, and system 200 can be further configured to determine that the index byte is a start of a variable length string (block 335). In some cases, the determination that the index byte is a start of a variable length string occurs only after successful validation performed at block 355 below.
System 200 can be further configured to determine for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value (block 340). In some cases, system 200 can determine the full string candidate lengths for the index byte for each of the obtained messages as part of the determination of the string candidate length in block 320.
Continuing the above non-limiting example, the string candidate length of the string field in message Ml {65, 66, 67, 68, 69, 0, 100} is 5 as it includes the five bytes with character values. The full string candidate length of the string field in message Ml is 6 as it also includes the three bytes with terminator value of ASCII code 0.
System 200 can be further configured to determine a type of the variable length string, utilizing at least the full string candidate length (block 345). The type of the variable length string can be one of: constant length string field not ending with the terminator value, constant length string field ending with the terminator value, constant length string field ending with padding values, constant length string field with a length prefix ending with noise values, variable length string field with the length prefix and not ending with the terminator value or variable length string field ending with the terminator value.
System 200 can be further configured to determine one or more parameters associated with the variable length string (block 350). The parameters can be one or more of: (1) field length - the length of the variable length string field, (2) string length - the actual length of the variable length string field within each of the obtained messages, (3) the character used as a terminator, (4) the character used as a padding character, or (5) length-prefix parameters - the existence of a length-prefix, the size of the length-prefix, the endianness of the length-prefix and if the length-prefix includes a terminator.
A non-limiting example of a calculation to determine the type and parameters of the variable length string field type can be: offset = candidate offset cpwl = cpwb = cpb = 0 for each obtained message:
I = length of candidate string that starts at offset if (offset - 1 ) is a valid index pb = value of byte at [offset-1] if I ∈ [pb - 1, pb + 9] {cpb++; apb = I - pb + 1;} if (offset - 2) is a valid index pwl = value of little-endian word at [offset-2 ] if I ∈ [pwl - 1, pwl + 9] (cpwl-r+; apwl = I - pwl + 1;} pwb = value of big-endian word at [ offset-2 ] if I ∈ [pwb - f pwb + 9] (cpwb++; aphl = I - pwb + 1;}
11 = p1-size(input) = obtained messages')
12 = p2-size(input) id = p3-size(input) wherein il < i2 < id < sizxfinput) if (cpwl < il) and (cpwb < il) and (cpb ≥ il) lengthPrefixType = 1 byte lengthPrefixlncludesTerminator = (p2(apb) == 0) else if (cpb ≥ il) and (cpwb ≥ il)
// check if there is evidence that the prefix is indeed 2 bytes: if p98(pwb) > 255: lengthPrefixType - 2 bytes, big-endian lengthPrefixIncludesTerminator = (p2(apwb) == 0) else lengthPrefixType = 1 byte lengthPrefixIncludesTerminator = (p2(apb) == 0) else if (cpwb ≥ il) lengthPrefixType = 2 bytes, big-endian lengthPrefixIncludesTerminator = (p2(apwb) == 0) else if (cpwl ≥ il) lengthPrefixType = 2 bytes, little-endian lengthPrefixIncludesTerminator = (p2(apwl) == 0) else lengthPrefixType = none
Wherein cpwl, cpwb and cpb are calculated for each message of the obtained message to represent the existence of a string length field in little or big-endian endianness representation.
The type can be calculated as follows, wherein (a) 1 is the string candidate length, indicating the number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value, (b) t is the termination character in the respective message: first null / out-of-range character of -1 if end-of-message, and (c) 12 is the full string candidate length, indicating the number of bytes of the respective message, starting at the index byte, having a character value or a terminator value: if (t is identical for ≥ 12 input members) and (p80(l) - p20(l) > 2) // terminator/padding if (12 is identical for ≥ 12 input members) // padding with >0 instances fieldLength = ceil(p7(l2)) if (fieldLength ≥ 5) or ((fieldLength ≥ 3) and (lengthPrefixType f none)) stringRepresentationType = 3 else fieldLength = ceil(p98(l)) + 1 // with terminator if fieldLength ≥ 3 if message type has a constant length stringRepresentationType = 2 else try to detect the next fields under both assumptions: stringRepresentationType - 2 or stringRepresentationType = 6 if lengthPrefixType == none shortl2 = avg(Z2) of all input members with I ≤ p20(l) longl2 = avg(Z2) of all input member with p20(l) < Z ≤ p80(l) if (p80(l) ≥ 5) and (p80(l) - p20(l) ≤ 2) stringRepresentationType - 1 else if (longl2 - shortl2 < 2.1) and (p20(l) - p2(l) < 3.2) fieldLength = ceil (p7 (12)) if fieldLength ≥ 5 stringRepresentationType = 3// const pad else stringRepresentationType = 1 if (stringRepresentationType == 1)
// determine the field length and the prefix parameters (if any) offsetFieldLength = (offset where p2(l) ≤ 1) + 1 - offset if offsetFieldLength ≥ 3 fieldLength = offsetFieldLength // check for near-const-prefix == fieldLength if (fieldLength < 256) v = most frequent value of pb in all input members if (v = fieldLength) and (cardinality(v) ≥ i3) lengthPrefixType = 1 byte v = most frequent value of pwl in all input members if (v == fieldLength) and (cardinality(v) ≥ i3) lengthPrefixType = 2 byte, little endian else // fieldLength ≥ 256 v = most frequent value of pwb in all input members if (y =- fieldLength) and (cardinality(v) ≥ i3") lengthPrefixType - 2 byte, big endian v = most frequent value of pwl in all input members if (y == fieldLength) and (cardinality(v) ≥ i3") lengthPrefixType ~ 2 byte, little endian
Infer values of prefix (according to lengthPrefixType) fieldLength - ceil (p98 (prefix values)) if lengthPrefixlncludesTerminator fieldLength = fieldLength + 1 if not fieldLength ≥ 3 fieldLength = 0 if cardinality of most frequent prefix value ≥ i3 stringRepresentationType = 1 // const-length else
// for each input member with at least one pad char (l < fieldLength) // check if pad char consistent (offsets [I .. fieldLength)) pl = number of input members with consistent pad char p2 = number of input members with inconsistent pad char map: pad char c number of input members with consistent pad c based on map:
C = the most prevalent pad char n = number of input members with pad char C if n / (p1 + p2) ≥ 0.93 padChar = C stringRepresentationType = 3// const pad-char else if message type has a constant length stringRepresentationType = 4 else try to detect the next fields under both assumptions: stringRepresentationType = 4 or stringRepresentationType = 5
System 200 can then be configured to validate, using the parameters, that the variable length string is a valid variable length string (block 355). A non-limiting example of such a validation test can be if (1) the field length parameter is larger or equal than 5, or (2) the field length parameter is larger or equal than 3 and the length- prefix parameter is that a length-prefix exists.
Upon a failed validation, the candidate string is determined by system 200 to not be a variable length string field and system 200 can be further configured to set the index byte to be a subsequent byte, subsequent to the index byte, if any, and return to block 320 to keep on searching for variable length string fields (block 360).
Upon a successful validation, the candidate string is determined by system 200 to be a string field and system 200 can be further configured to remove the variable length string from each of the messages and keep on analyzing the rest of the message for additional variable length string fields (block 365). The removal of the detected variable length string field from all of the obtained messages can allow system 200 to re-iterate the process for the remaining message-parts of the obtained messages. It is to be noted that in some cases, system 200 can manage a set of pointers - pointing for each of the obtained messages to the location of the byte subsequent to the identified variable length string and then re-iterate the process using the set of pointers to scan the remaining parts of the messages for variable length string fields.
After removing the variable length string from each of the messages, system 200 can be further configured to check if end-of-message have been reached (block 370).
If end-of-message has not been reached, system 200 can be further configured to return to block 315 (block 375).
If end-of-message have been reached, system 200 can be further configured to provide a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string (block 380). In some cases, system 200 can provide the user of the system or an external system with a specification of the variable length string fields of the obtained messages. The specification can be represented as a list of (skip, params) pairs, wherein the respective index byte number is represented by a skip - the number of bytes that are needed to be skipped from the start of the message or from the end of the previous identified variable length string field to reach the respective variable length string, and params includes the parameters of the respective constant length and variable length string.
It is to be noted that, with reference to Fig. 2, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein (for example, block 350 can be performed before block 345, etc.). It is to be further noted that some of the blocks are optional (for example, block 380. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
Attention is drawn to Fig. 3, a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of variable number of elements fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter. According to certain examples of the presently disclosed subject matter, system 200 can be configured to perform a variable number of elements fields analysis process 400, e.g., utilizing the variable number of elements fields analysis module 250. System 200 analyzes obtained messages to determine a specification of the communication protocol and specifically to determine the location and parameters of variable number of elements fields, being fields that have a varying number of elements within the obtained messages. For example: a variable number of elements field can be a field that represents an array. The number of elements in the array can vary from message to message. In some cases, the communication protocol is a binary communication protocol. The binary communication protocol can be a proprietary protocol. The structure of the messages adhering to the binary communication is generally unknown or is unknown to the user of system 200.
For this purpose, system 200 can be configured to obtain a plurality of messages of a given type, each of the messages comprised of a respective sequence of bytes (block 410).
Each message is a sequence of one or more fields, wherein each field comprises one or more bytes of the bytes forming the message. The messages structure adheres to the communication protocol, so that the communication protocol determines the location and parameters of each field within the message. Each of the obtained messages can have a specific given type, thereby all having the same structure even if not always having the same number of bytes due to variable-length fields. A message of a given type has a given structure. In a non-limiting example, the type of the message can be encoded in one of the fields of the message. In these cases, system 200 can identify the type of message field and use it to discern between messages of different types.
In some cases, the obtained messages are preprocessed. For example, in some cases, the messages are received as a stream of bytes that is not divided into individual messages. In these cases, the preprocessing can include splitting the stream of bytes into messages by using methods and/or techniques known in the art. As another example, the preprocessing of the obtained messages includes identifying the types of these messages (noting that in some cases this step is not required as the plurality of messages obtained at block 410 are already of the same message type). In such cases, the obtained messages can include a plurality of messages of a plurality of message types, and the preprocessing can include removing messages that are of a type other than a given message type. The preprocessing can include in these cases the splitting of the obtained messages into message groups according to type and the execution by system 200 of variable number of elements fields analysis process 400 for each message group.
The splitting of the obtained messages into message groups according to message type is achieved using methods known in the art for message type identification, for example: a correlation method correlating between values of candidate message type fields and the message length. In some cases, the given message type is variable-length message type, wherein the messages of the variable-length message type include one or more variable-length fields. The variable-length fields can be variable number of elements fields containing for example a representation of an array. The variable number of elements field has a prefix (which can be an integer) that represents the number of elements is the field. The prefix can be a one-byte, a two- bytes, or a four-bytes integer. The size of the prefix pl has one of a predetermined set of values. The predetermined set of values can be: byte, two bytes or four bytes. When the prefixes have more than one byte, they can be represented either in little-endian representation or big-endian representation or any other fixed length representation. It is assumed that, in some cases, the prefixes of the fields with variable number of elements have the same length. The value of the prefix can be zero when there are no elements in the variable number of elements field, for example, a sensor may send a periodical message that contains information of multiple readings at a given time. The number of readings can be zero or more.
Each element of the variable number of elements field has an element-size. In some cases, the element-size is not part of the message and is not transmitted. It may be known to the message creators and receivers, as it can be part of the protocol specifications. For each variable number of elements field, the element-size is a constant positive integer (for example: eight bytes representing a geolocation, where each variable-length field is composed of a 4-bytes latitude and a 4-bytes longitude). For the given message type, there may be several such fields with the same element- size. In some cases, the fields forming the message can include fixed-length fields, having a fixed length, variable length strings or variable-length fields, wherein the length of such fields (i.e., the number of bytes comprised within the field) in one message is different than the length of such fields in a second message. In these cases, the preprocessing can include identifying of the strings within the messages (e.g., by using the variable length string analysis process 300) and removing the strings from the messages.
The preprocessing can be performed by system 200 (e.g., before or as part of the execution of variable number of elements fields analysis process 400) or by an external system.
The messages can be obtained by system 200 from a trace of timestamped sequence of packets captured from one or more computerized communication networks or communication channels. The trace can be captured using a sniffer, a receiver, a probe, or similar tools. The obtained messages, or some of them, can be part of a recording of historical messages communicated over the one or more communication networks or communication channels. In some cases, the obtained messages, or some of them, are part of real-time communication currently transferred over the one or more communication networks or communication channels. The obtained messages, or some of them, can be a combination made of a part of a recording of historical messages communicated over the one or more communication networks or communication channels and a part of real-time communication currently transferred over the one or more communication networks or communication channels. The messages can be obtained from a communication trace communicated using a common communication protocol or using a plurality of communication protocols.
In a non-limiting example, the obtained messages can include for example the following two variable number of elements messages: Ml {3, 1, 1, 2, 2, 3, 3} and M2 {2, 1, 1, 2, 2}. The messages are represented as a sequence of bytes, wherein in this example the first byte of each message is a one-byte prefix representing the number of elements is the field, and wherein each element of the fields has a length of two bytes. In this non-limiting example, the first message Ml is an array with three elements and message M2 has two elements. After obtaining the plurality of messages, system 200 can be further configured to identify, for a given message length, being a length of one or more of the messages (for example: the given message can be the length is a length of a shortest message of the messages), a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = ei-kj + e2-k2 + ... + ey.ky + f.pl + m wherein: m is a length of non-variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of the prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an i111 distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y, each solution includes a possible combination of y>l, e1..ey, k1..ky, f>l, pl ≥1, and m>0, all are integers, each of e1..ey is positive and each of k1..ky is positive (block 420). In order to detect the fields with variable number of elements, the variable number of elements fields analysis process 400 determines the value of.pl and their representation (endianness), the value of f, and for each variable number of elements field: finds the element-size and finds the location of the variable number of elements field within the message. It is to be noted that the location may not be fixed, as it may depends on the number of elements in all preceding fields with variable number of elements.
The following notation is used herein to represent the equation - a message type with a variable number of elements has: f1 ≥1 fields, each with element-size k1 ≥1 , f2 ≥ 1 fields, each with element-size k2 ≥1 , . .., fy ≥1 fields, each with element-size ky ≥1 where all ki’s are distinct, and wherein f = f1 + f2 + ... + fy. For each message of the given type: those f1 fields cumulatively have ei ≥ 0 elements, those f2 fields cumulatively have e2 ≥ 0 elements, ..., those fy fields cumulatively have ey ≥ 0 elements. The values of ei, e2, . . ., ey may be different for each message.
After removing the variable number of elements fields from the obtained messages, each message has constant-length fields wherein the combined length of all fixed-length fields of the given message type is marked as m, where m ≥ 0. Therefore, the equation of the length of the message is for each message: 1 is equal to e1.k1 + e2.k2 + ... + f.pl + m. System 200 identifies the plurality of possible solutions to the equation of the length of the message. The identification of the plurality of possible solution can also include testing the possible solutions against a given number of additional message lengths, being lengths of one or more other messages of the messages other than the given message, and wherein for each of the additional message lengths at least one solution exists to a given additional message length equation, wherein the given additional message length equation is: additional message length = e1.k1 + e2.k2 + ... + ey.ky + f.pl + m, so that e. ≥ 0. Please note that the identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k1..ky.
A non-limiting example of a calculation to find the possible solution can be the following algorithm, wherein ki has a predetermined lower threshold and a predetermined upper threshold and at least one of f and y has a predetermined upper threshold: We will use l1 to denote shortest message length of the given message type.
Wherein the algorithm finds the set of all {y, f, pl, k1, k2, ... , m} combinations that are valid for l1, where l1 ~ e1.k1 + e2.k2 + ... + f.pl + m, subject to ei ≥ 0, e2 ≥ 0, . . . , k1 ≥ minElementSize, k2 ≥ minElementSize, ... , f ≥1 , pl ≥ 1, m≥ 0, and all values are integers.
Specifically, the algorithm limits the values of: the common pl values (for example to one of { 1, 2, 4}), the minElementSize value, which is the minimal element- size values (for example to 2), the maxElementSize value, which is the maximal element-size (for example to 500), the maxUniqueElementSizes value, which is the number of unique element-sizes (for example to 3) and the max Var Fields, which is the number of variable-length fields (for example to 7).
The algorithm searches for valid combinations:
- for each y e { 1 .. maxUniqueElementSizes}, where maxUniqueElementSizes is predefined:
- for each/e { 1 .. maxVarFields} , where maxVarFields is predefined:
- for each pl from a predefined set (e.g., { 1, 2, 4}):
- for each valid combination of distinct element-sizes ki. . .ky: where max(ki)=min(l - f.pl, maxElementSize)) and maxElementSize is predefined.
- for each m ∈ {0 .. I - f.pl} wherein I -f-pl - m is represented as the sum e1.k1 + e2 +.k .2.. + ey-ky if there is at least one solution
{ e1 ≥ 0, e2 ≥ 0, ... , ey ≥ 0} to the Diophantine equation I - f.pl - m = e1.k1 + e2.k2 + ...+ ey.ky if there is at least one solution - system 200 saves {y,f, pl, k;.. ,ky, m] .
It is to be noted that the identification of the existence of such a solution can be optionally optimized by first calculating the Frobenius Number, fb for ( k1, k2, ... . ky). If the Frobenius Number is positive, the equation has a solution if I > fb, and has no solution if I = fb.
The non-limiting exemplary algorithm can now make use of l1 .. l20 to denote the, for example, 20 shortest message lengths of the given message type.
For each of the solutions of {y,f, pl, k1...ky, m} combinations found above, the algorithm permits only the solutions that have a solution for each of l1 .. ho- Specifically, the Diophantine equation ln = e1.k1 + e2.k2 + f-pl + m that has at least one { e1 ≥ 0, e2 ≥ 0, ... ey ≥ 0 } solution for each n e [1 .... 20] .
The exemplary value 20 is chosen to optimize performance. Other values can be used as well. If there are less than 20 message lengths - the algorithm will us all lengths.
For each of the {y,f, pl, k1... , ky, m} combinations with at least one solution to the Diophantine equation, tire algorithm calculates the set of all solutions, hence, all {y, f, pl, k1...ky, e1...ey, m} solutions are saved, where e1... ey are the number of elements with element-sizes k1. . . ky respectively for l1.
After identifying the plurality of possible solutions, identify for each of the plurality of possible solutions, one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: f; is a number of fields with element size ki and each of f1..fy is a positive integer (block 430).
After identifying the sub-solutions, select a given message of the messages, having the given message length (block 440) and for each of the sub solutions, identify one or more hypothesis defining a list of (skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from the message-start or from an end of a preceding variable-length field, to the jth variable length field; esj is the element size of the jth variable-length field; and given message length = p1.es1 + p2.es2 + ... + pf .esf + f.pl + m, wherein pj is a value of the prefix of the jth variable length field, representing the number of elements in the jth variable length field (block 450). System 200 now identifies one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters (block 460).
Continuing the above exemplary algorithm, for each selected given message I with length l1 do:
- for each {y,f, pl, k1...ky ,f1...fy, e1...ey, m} combination that is valid for l1: Find all {<skip, es>} lists, where skip is the number of bytes to skip from the message-start / from the end of the preceding variable-length field - to reach the prefix of the current variable-length field, and es is the element-size of the current variable-length field, where for each such list - die content of the prefixes pi, pi, ... (subject to their length and endianness) in I is such that: l1 = p1.es1 + p2.es2 + ... + pf.esf + f.pl + m where p„ is the value of the prefix of variable-length field n - hence - the number of elements in this field, and esn is the element-size of variable- length field n there are f1...fy fields with element-size k1...ky respectively there are e1. . ,ey elements with element-size k1. . .ky respectively
Finding the valid prefix fields:
- for each candidate offset of the first variable-length field, ofsi e [0 .. l1 - f.pl - e1.k1 - e2.k2 - ... - ey-ky]:
- it is assumed, in turn, a field with element-size k1...ky and then, skipping this field according to the prefix value and the assumed element-size, solve recursively for the remaining bytes of I.
A set of prefixes is valid if the end of the message had not been overrun after finding all the variable-length fields. Once a set of prefixes has been found, the algorithm can construct a (<skip, es>} list thereby generating a list of (skipj, esj). It is to be noted that this part of the exemplary algorithm can be executed in parallel for different [y,f, pl, k1. . ,ky, fi . . .fy, e1...ey, m} combinations.
In some cases, system 200 can ignore some of the messages that do not adhere to the hypothesis. This can allow system 200 to disregard erroneous messages within the obtain messages.
Optionally, provide an output including at least the list of (skipj, esj) (block 470). The input can be used by system 200 to support analysis of the obtained messages, can be provided to another system as input or can be provided to a user of system 200 or to an external system, external to system 200. It is to be noted that in some cases the obtained messages contain a plurality of zero-valued bytes. This can lead the algorithm to identify multiple solutions where all messages have zero elements of some variable- length fields. Therefore, for each variable number of elements field, it is required that at least three messages where the prefix has a non-zero value exist.
It is to be noted that, with reference to Fig. 3, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein. It is to be further noted that some of the blocks are optional (for example, block 470). It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for detecting absences in the traffic trace, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, system 200 can be configured to perform a traffic trace absences detection process 500, e.g., utilizing the traffic trace absences detection module 260. System 200 can determine, for example as part of a reverse engineering analysis of a communication protocol, which elements of a traffic trace of messages of the communication protocol do not contain enough diversity to produce a complete specification. System 200 can than optionally suggest an estimation of an amount of additional traffic trace to be collected in order to enhance the completeness of the specification with additional unobserved message types.
For this purpose, system 200 can be configured to obtain a traffic trace of messages of a communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system (block 510).
One or more traffic traces of messages can be obtained by system 200 from a sequence of packets captured from one or more computerized communication networks or communication channels. In some cases, the sequence of packets is timestamped. The trace can be captured using a sniffer, a receiver, a probe, or similar tools. The obtained messages, or some of them, can be part of a recording of historical messages communicated over the one or more communication networks or communication channels. In some cases, the obtained messages, or some of them, are part of real-time communication currently transferred over the one or more communication networks or communication channels. The obtained messages, or some of them, can be a combination made of a part of a recording of historical messages communicated over the one or more communication networks or communication channels and a part of real- time communication currently transferred over the one or more communication networks or communication channels. The messages can be obtained from a communication trace communicated using a common communication protocol or using a plurality of communication protocols. In these cases, system 200 can pre-process the obtained messages to groups according to communication protocol and analyze each of the groups.
After obtaining the traffic trace, system 200 can be further configured to apply one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace (block 520). The unseen species problem estimators are a set of statistical techniques, originally designed for biodiversity measurements, that are used by system 200 as part of a protocol reverse engineering process, to estimate the degree of the sample (e.g., the traffic trace) representativeness of the communication protocol, and to suggest if efforts should be invested to collect more traffic traces in order to produce a more complete and accurate specification. The unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage- based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, any combinations thereof, or any other unseen species problem estimator.
System 200 can be then configured to determine, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages (block 530). A non-limiting example can be of a given binary communication protocol that includes n message types, where n in unknown to system 200, and each message type has a different unknown probability function for being included in at least one message in each traffic trace. The system 200 analyzes, for example, a traffic trace that includes several million messages of the communication protocol. After discovering the location and the representation parameters of the message type field, the system 200 observes that in the message trace, each message type c; has a given prevalence of ni. Given the set of all observed messages and their prevalence {(ci, ni), (C2, U2), ...)}. System 200 can estimate the number of unobserved message types, or the number of total message types in the communication protocol. Furthermore, system 200 can estimate how much additional message types would likely be observed if an additional communication trace containing additional messages is collected.
System 200 can be similarly used to estimate a number of valid message lengths for a given message type, estimating a number of discrete values for an enumerated (categorical) field type, estimating any other protocol parameter with a discrete and limited set of values, etc.
After determining at least one of the first or second estimations, system 200 is optionally further configured to generate a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace (block 540). The relationship model can be, for example, a graph, such as discovery curve, with or without confidence intervals. After determining at least one of the first or second estimations, system 200 is optionally further configured to provide the at least one of the first estimation or the second estimation to a user of the system (block 550) or to an external system, external to system 200. The provided information can be based on one or more of the unseen species problem estimators. The provided information can be displayed to a user of system 200. The provided estimators may include meta-estimators, integrating or combining several unseen species problem estimators. In addition, estimators and meta- estimators can be evaluated against the available communications traces or against simulated traces, using Monte Carlo or other simulations methods, to predict which estimates are most accurate for the given scenario.
System 200 can be further configured to receive a desired unobserved message types number (block 560). The received desired unobserved message types number can be provided by a user of system 200, for example: an analyst reverse engineering the communication protocol.
After receiving the desired unobserved message types number, system 200 can be configured to recommend the given number of additional messages to be obtained in the additional traffic trace based on the second estimation (block 570).
After generating the relationship model, system 200 can be optionally configured to provide the relationship model to a user of the system 200 (block 580) or to an external system, external to system 200.
It is to be noted that, with reference to Fig. 4, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein (for example, blocks 560-570 can be performed before block 550, etc.). It is to be further noted that some of the blocks are optional (for example, blocks 550-580). It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
Attention is drawn to Fig. 5, a flowchart illustrating one example of a sequence of operations carried out for determining location and parameters of constant length fields within a plurality of messages of a given type, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, system 200 can be configured to perform a constant length fields analysis process 600, e.g., utilizing the constant length fields analysis module 270. System 200 analyzes obtained messages to determine a specification of the communication protocol and specifically to determine the location and parameters of constant length fields, being fields that have a constant length within the obtained messages which are all of the same constant-length message-type or have been preprocessed to detect and remove variable length fields, fields with constant values, enumerated fields (e.g., fields with a limited set of values) and monotonic fields (e.g., message sequence number fields, time tag fields, etc.) thereby leaving only messages with constant length fields representing an integer value (e.g., 16, 32 or 64 bits, signed or unsigned, or any other integer value representation) or a floating-point value (e.g., 32 or 64 bits, or any other floating-point value representation). Integer fields are usually composed of one, two, four or eight bytes. The integer fields values can be either signed or unsigned. Floating-point fields are usually represented according to Institute of Electrical and Electronics Engineers (IEEE) 754, and are composed of a sign bit, signed exponent bits, and unsigned fraction bits. The size (in bits) of the exponent and fraction varies according to the size of the floating- point field. The representations for floating-point fields can be 4 bytes (1 sign bit, 8 signed exponent bits, and 23 fraction bits) and 8 bytes (1 sign bit, 11 signed exponent bits, and 52 fraction bits).
In some cases, the communication protocol is a binary communication protocol. The binary communication protocol can be a proprietary protocol. The structure of the messages adhering to the binary communication is generally unknown or is unknown to the user of system 200. It is to be noted that in some cases the binary protocols can include other field types (e.g., bit-fields, etc.).
For this purpose, system 200 can be configured to obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes (block 610). Each message is a sequence of one or more fields, wherein each field comprises one or more bytes of the bytes forming the message. The messages structure adheres to the communication protocol, so that the communication protocol determines the location and parameters of each field within the message. Each of the obtained messages can have a specific given type, thereby all having the same structure. A message of a given type has a given structure. In a non-limiting example, the type of the message can be encoded in one of the fields of the message. In these cases, system 200 can identify the type of message field and use it to discern between messages of different types.
In some cases, the obtained messages are preprocessed. For example, in some cases, the messages are received as a stream of bytes that is not divided into individual messages. In these cases, the preprocessing can include splitting the stream of bytes into messages by using methods and/or techniques known in the art. As another example, the preprocessing of the obtained messages includes identifying the types of these messages (noting that in some cases this step is not required as the plurality of messages obtained at block 610 are already of the same message type). In such cases, the obtained messages can include a plurality of messages of a plurality of message types, and the preprocessing can include removing messages that are of a type other than a given message type. The preprocessing can include in these cases the splitting of the obtained messages into message groups according to type and the execution by system 200 of variable length string analysis process 300 and/or variable number of elements fields analysis process 400 for each message group containing variable number of elements fields, removing these fields thereby remaining with messages with only constant length fields. It is to be noted that constant-length textual string fields would be identified by variable length string analysis process 300 so that constant length fields analysis process 600 is left with identifying integer and/or float fields only.
After obtaining the plurality of messages, system 200 can be further configured to provide a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores of the respective field-types of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers (block 620). In some cases, the classifiers used for scoring. Each scoring classifier can produce scores, each represents the plausibility that a given field-type of a constant length starts at a given offset within the plurality of messages. In some cases, wherein the classifiers are calibrated, the score represents that probability. System 200 does not require the scoring classifiers to be calibrated and the score can represent the plausibility of a given field-type of a constant length starts at a given offset within the plurality of messages. It is to be noted that there are a number of ways to calculate the classifiers’ performance score: precision, recall, Fl score, Youden’s J statistic, etc. In some cases, the classifiers are machine learning classifiers, each machine learning classifiers is trained, utilizing training data from a plurality of data sources with known fields locations and fields types, to score the plausibility of a given field-type starting at the given offset. The field-type can be one of: big-endian 16 bits integer, big-endian 32 bits integer, big-endian 64 bits integer, big-endian 32 bits floating point, big-endian 64 bits floating point, little-endian 16 bits integer, little -endian 32 bits integer, little - endian 64 bits integer, little -endian 32 bits floating point, little -endian 64 bits floating point or any other constant field. It is to be noted that the constant length fields can have different endianness, and correspondingly different classifiers are used for each endianness. In some cases, it can be assumed that all constant length fields of the given message type have the same endianness, thereby enabling system 200 to optimize the constant length fields analysis process 600 by disqualify solutions that indicate field- types of the constant length fields with mixed-endianness. It is to be noted that in some cases the classifiers for integer fields do not differentiate between signed or unsigned integers, unless there is explicit data hinting at a signed or unsigned integer.
A non-limiting example of a 64-bit integer classifier can be described as follows:
1164
Minimal size(input): 5,000 m = min(0.01-size(S64), 100,000)
R = set {m most frequent values in S64} Measure cardinality in S64 of each of the values in R.
C = sum(cardinality(r ∈ R)) scorell64 = min(l, C / (0.95-size(S64))) wherein: pn(X) n-percentile value of list X
Figure imgf000051_0001
Values: Sn list of n-bit signed integer values extracted from the given offset of all
Figure imgf000052_0001
inp yfmembers
U„ list of n-bit unsigned integer values extracted from the given offset of
Figure imgf000052_0002
all inputmembers
Fn
Figure imgf000052_0003
list of n-bit floating-point values extracted from the given offset of all inputmembers range90(Sn) p95(Sn) - p5(Sn) + 1 // avoiding outliers range90(Un)
Figure imgf000052_0004
p95(Un) - p5(Un) + 1 // avoiding outliers avg90(Sn) avg(v ∈ Sn, p5( Sn) < v ≤ p95(Sn) }) // avoiding outliers avg90((Un)
Figure imgf000052_0005
avg(v ∈ Un, p5(Sn) ≤ v ≤ p95(Un) }) // avoiding outliers
Value differences:
Dns
Figure imgf000052_0006
list of ail abs( Iy - Ix) where // avoiding outliers Iy and Ix are consecutive (same message source, short time difference) Sn members Ix ∈ [ p5(Sn) .. p95(Sn)] Iy ∈ [ p5(Sn) .. p95(Sn)]
Dnu
Figure imgf000052_0007
list of all abs( /y - /x) where // avoiding outliers
Iy and Ix are consecutive (same message source, short time difference) Un members
Ix ∈ [p5(Un) p95(Un)]
Iy ∈ [p5(Un ) .. p95(Un)] range90(DnJ) max( Dns) range90( Dnu)
Figure imgf000052_0008
max( Dnu) avg90( Dns) avg(d ∈ Dns}) avg90( Dnu) avg(d ∈ Dnu })
Figure imgf000052_0009
It is to be noted that when the lists are very large, pn(X) can be approximated, e.g., using known in the art methods.
Another non-limiting example of a 64-bit floating-point classifier can be described as follows:
F164
Minimal size(input): 30,000 score F164 (for detecting 64-bit floating point fields)
For each F64 member: b value of the unsigned integer in the 11 exponent bits score F164 = % of members where b ∈1023±16
The above classifiers are non-limiting examples. Classifiers can be any mathematical models that can produce the plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages, for example: a model based on machine-learning, a model based on neural-networks, a model based on decision-tree algorithms, etc.
The classifier performance score can be based on heuristically testing against a large set of simulated binary protocols, encompassing tens of thousands of fields, and real data, collected from many sources. These tests show that some field types can be detected with higher precision and recall - the classifiers for these field types will receive higher performance score than other classifiers thereby creating a cascading classifier architecture. For a non-limiting example: a 64-bit floating-point classifier has a higher performance score than a 64-bit integer classifier
After providing the plurality of classifiers, system 200 can be further configured to utilize the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages (block 630). A non-limiting example can be utilizing the 1164 classifier described above to score the plausibility of each offset being the start of a 64-bit integer field and the F164 classifier described above to score the plausibility of each offset being the start of a 64-bit floating-point field.
After scoring the plausibilities, system 200 can be further configured to identify the highest score based on (i) each classifier’s performance score of available classifiers, being the classifiers, and (ii) the highest plausibility score produced by each classifier (block 640). Continuing our above non-limiting example, it can be assumed that the highest score for a given offset was produced by the 1164 classifier.
After identifying the highest score, system 200 can be further configured to mark a number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score (block 650). In our non-limiting example, system 200 will mark eight bytes within the message, beginning from the given offset, as a constant length field of type 64-bit integer in accordance with the 1164 classifier.
System 200 than repeats (d) and (e) while (i) disregarding the scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers (block 660). It is to be noted that upon one or more of the bytes being marked more than once, each in association with a marking from a respective offset, the repeat is performed for each marking, each time while disregarding the scores associated with the respective subset of the marked bytes starting at the respective offset. In these cases, system 200 can produce more than one possible specification for the constant length fields within the messages. For example: if the given offset got the same highest score from 1164 and from F164.
It is also noted that marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score does not exceed an end of the messages of the given type. For example: if the given offset is only seven bytes from the end of the message - it cannot be marked in according to the score produced by 1164 as there are not enough bytes for a 64-bit integer. Marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest score are not already marked. Continuing the above example: if six bytes from the given offset there is a marked byte, it cannot be marked in according to the score produced by 1164 as there are not enough bytes for a 64-bit integer until the marked byte.
Upon the available classifiers not including any of the classifiers, system 200 can be further configured to provide a user of the system or an external system, external to system 200, with an indication of a start position of the constant length fields and the field-type of the constant length fields, based on the markings (block 670). It is to be noted that in some cases the indication of a start position of the constant length fields and the field-type of the constant length fields is provided or communicated to one or more external systems, external to system 200. In some cases, system 200 can provide and/or communicate a specification of the constant length fields of the obtained messages. The specification can be represented as one or more lists of (skip, params) pairs, wherein the respective index byte number is represented by a skip - the number of bytes that are needed to be skipped from the start of the message or from the end of the previous identified constant length field to reach the respective constant length field, and params includes the parameters of the respective constant length field - type of the field, endianness of the field, sign of the field (where applicable), etc. It is to be noted that in some cases, the specification will include a plurality of (skip, params) pairs as two or more markings have the same or nearly the same score. In these cases, the user or an external system is provided with a number of ways to interpret the constant length fields within the messages.
It is to be noted that, with reference to Fig. 5, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein. It is to be further noted that some of the blocks are optional (for example, block 670). It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter.
It will also be understood that the system according to the presently disclosed subject matter can be implemented, at least partly, as a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method. The presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Claims

CLAIMS:
1. A system for determining parameters of variable length fields, being fields with a variable number of elements, within a plurality of messages of a given type, the system comprising a processing circuitry configured to:
(a) obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes;
(b) for a given message length, being a length of one or more of the messages, identify a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = e1k1 + e2-k2 + . , . + ey.ky + f. pl + m wherein: m is a length of non- variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of a prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an ith distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y;
(c) for each possible solution of the given message length equation, identify one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: fi is a number of fields with element size ki and each of f1..fy is a positive integer;
(d) select a given message of the messages, having the given message length;
(e) for each of the sub solutions, identify one or more hypothesis defining a list of (skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from a message-start or from an end of a preceding variable-length field, to a jth variable length field; esj is an element size of the jth variable-length field; and given message length = p1 -es1 + p2.es2 + .. . + pf -esf + f.pl + m, wherein Pj is a value of the prefix of the jth variable length field, representing the number of elements in the jth variable length field;
(f) identify one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters.
2. The system of claim 1, wherein the given message length is a shortest message length of the messages.
3. The system of claim 1, wherein identifying the plurality of possible solutions also includes testing the possible solutions against a given number of additional message lengths, being lengths of one or more other messages of the messages other than the given message, and wherein for each of the additional message lengths at least one solution exists to a given additional message length equation, wherein the given additional message length equation is: additional message length = e1.k1 + e2.k2 + ... + ey.ky + f.pl + m, so that e, ≥ 0.
4. The system of claim 1, wherein identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k1..ky.
5. The system of claim 1, wherein the messages are obtained from a trace obtained from one of more computerized networks.
6. The system of claim 5, wherein the trace includes a plurality of additional messages of one or more other types other than the given type.
7. The system of claim 1, wherein at least two of the messages have different message length.
8. The system of claim 1, wherein the processing circuitry is further configured to provide an output including at least the list of (skipj, esj).
9. The system of claim 1, wherein ki has a predetermined lower threshold and a predetermined upper threshold.
10. The system of claim 1, wherein at least one of f and y has a predetermined upper threshold.
11. The system of claim 1, wherein pl is one of a predetermined set of values: byte, two bytes or four bytes.
12. The system of claim 1, wherein pl has one of: a big-endian representation or a little-endian representation.
13. The system of claim 1, wherein some of the messages are ignored by the system.
14. A method for determining parameters of variable length fields, being fields with a variable number of elements, within a plurality of messages of a given type, the method comprising:
(a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes;
(b) for a given message length, being a length of one or more of the messages, identifying, by the processing circuitry, a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = e1.k1 + e2.k2 + ... + ey.ky + f.pl + m wherein: m is a length of non- variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of a prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an ith distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y;
(c) for each possible solution of the given message length equation, identifying, by the processing circuitry, one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: fi is a number of fields with element size ki and each of f1..fy is a positive integer;
(d) selecting, by the processing circuitry, a given message of the messages, having the given message length;
(e) for each of the sub solutions, identifying, by the processing circuitry, one or more hypothesis defining a list of (skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from a message-start or from an end of a preceding variable-length field, to a jth variable length field; esj is an element size of the jth variable-length field; and given message length = p1.es1 + p2.es2 + ... + pf .esf + f.pl + m, wherein
Pj is a value of the prefix of the jth variable length field, representing the number of elements in the jth variable length field;
(f) identifying, by the processing circuitry, one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters.
15. The method of claim 14, wherein the given message length is a shortest message length of the messages.
16. The method of claim 14, wherein identifying the plurality of possible solutions also includes testing the possible solutions against a given number of additional message lengths, being lengths of one or more other messages of the messages other than the given message, and wherein for each of the additional message lengths at least one solution exists to a given additional message length equation, wherein the given additional message length equation is: additional message length = e1.k1 + e2.k2 + ... + ey.ky + f.pl + m, so that ei≥ 0.
17. The method of claim 14, wherein identification of the plurality of possible solutions is optimized by calculating the Frobenius number for k1..ky.
18. The method of claim 14, wherein the messages are obtained from a trace obtained from one of more computerized networks.
19. The method of claim 18, wherein the trace includes a plurality of additional messages of one or more other types other than the given type.
20. The method of claim 14, wherein at least two of the messages have different message length.
21. The method of claim 14, wherein the processing circuitry is further configured to provide an output including at least the list of (skipj, esj).
22. The method of claim 14, wherein ki has a predetermined lower threshold and a predetermined upper threshold.
23. The method of claim 14, wherein at least one of f and y has a predetermined upper threshold.
24. The method of claim 14, wherein pl is one of a predetermined set of values: byte, two bytes or four bytes.
25. The method of claim 14, wherein pl has one of: a big-endian representation or a little-endian representation.
26. The method of claim 14, wherein some of the messages are ignored by the system.
27. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for determining parameters of variable length fields, being fields with a variable number of elements, within a plurality of messages of a given type, the method comprising:
(a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes;
(b) for a given message length, being a length of one or more of the messages, identifying, by the processing circuitry, a plurality of possible solutions which meet a given message length equation, wherein the given message length equation is: given message length = e1.k1 + e2.k2 + . . . + ey.ky + f.pl + m wherein: m is a length of non- variable fields within the messages; f is a number of variable length fields within the messages; pl is a length of a prefix of each of the variable length fields; y is a number of distinct element sizes of the elements of the variable length fields; ki is an ith distinct element size, wherein i = 1..y and ki ≥1 ; and e; is a number of distinct elements of ki distinct element size, wherein i = 1..y;
(c) for each possible solution of the given message length equation, identifying, by the processing circuitry, one or more sub-solutions, each sub-solution includes a possible combination of f1..fy, so that a sum of each combination is f, wherein: fi is a number of fields with element size ki and each of f1..fy is a positive integer;
(d) selecting, by the processing circuitry, a given message of the messages, having the given message length; (e) for each of the sub solutions, identifying, by the processing circuitry, one or more hypothesis defining a list of (skipj, esj) pairs that meet the sub solution and a message structure of the given message, wherein: j = 1..f; skipj is a number of bytes to skip from a message-start or from an end of a preceding variable-length field, to a jth variable length field; esj is an element size of the jth variable-length field; and given message length = p1.es1 + p2.es2 + ... - + pf -esf + f.pl + m, wherein pj is a value of the prefix of the jfll variable length field, representing the number of elements in the jth variable length field;
(f) identifying, by the processing circuitry, one or more hypotheses wherein a content of each of the plurality of messages meets a structure corresponding to a respective hypothesis of the hypotheses, giving rise to the parameters.
28. A system for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the system comprising a processing circuitry configured to:
(a) obtain the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes;
(b) provide a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores of one or more of respective field-types of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers;
(c) utilize the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages;
(d) identify the highest plausibility score based on (i) each classifier’s performance score of available classifiers, being the classifiers, and (ii) the highest plausibility score produced by each classifier; (e) mark a number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest plausibility score;
(f) repeat (d) and (e) while (i) disregarding the plausibility scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest classifier performance score of the available classifiers; and
(g) upon the available classifiers not including any of the classifiers, provide a user of the system or an external system with an indication of a start position of the constant length fields and the field-type of the constant length fields, based on the markings.
29. The system of claim 28, wherein upon one or more of the bytes being marked more than once, each in association with a marking from a respective offset, the repeat is performed for each marking, each time while disregarding the plausibility scores associated with the respective subset of the marked bytes starting at the respective offset.
30. The system of claim 28, wherein marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest plausibility score does not exceed an end of the messages of the given type.
31. The system of claim 28, wherein marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest plausibility score are not already marked.
32. The system of claim 28, wherein the messages are obtained from a trace obtained from one of more computerized networks.
33. The system of claim 32, wherein the trace includes a plurality of additional messages of one or more other types other than the given type.
34. The system of claim 28, wherein some of the messages are ignored by the system.
35. The system of claim 28, wherein the constant length fields are one or more of: a. big-endian 16 bits integer; b. big-endian 32 bits integer; c. big-endian 64 bits integer; d. big-endian 32 bits floating point; e. big-endian 64 bits floating point; f. little-endian 16 bits integer; g. little -endian 32 bits integer; h. little -endian 64 bits integer; i. little -endian 32 bits floating point; or j. little -endian 64 bits floating point.
36. A method for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the method comprising:
(a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes;
(b) providing, by the processing circuity, a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores of a respective field-types of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers;
(c) utilizing, by the processing circuity, the classifiers to produce the plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages; (d) identifying, by the processing circuity, the highest plausibility score based on
(i) each classifier’s performance score of available classifiers, being the classifiers, and
(ii) the highest plausibility score produced by each classifier;
(e) marking, by the processing circuity, a number of bytes defined by the field- type of constant length field associated with the respective classifier and the respective field types and starting at one or more offsets associated with the highest plausibility score;
(f) repeating, by the processing circuity, (d) and (e) while (i) disregarding the plausibility scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers; and
(g) upon the available classifiers not including any of the classifiers, providing, by the processing circuity, a user of the system or an external system with an indication of a start position of the constant length fields and the field-type of the constant length fields, based on the markings.
37. The method of claim 36, wherein upon one or more of the bytes being marked more than once, each in association with a marking from a respective offset, the repeat is performed for each marking, each time while disregarding the plausibility scores associated with the respective subset of the marked bytes starting at the respective offset.
38. The method of claim 36, wherein marking is performed only if the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest plausibility score does not exceed an end of the messages of the given type.
39. The method of claim 36, wherein marking is performed only if bytes within the number of bytes defined by the field-type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest plausibility score are not already marked.
40. The method of claim 36, wherein the messages are obtained from a trace obtained from one of more computerized networks.
41. The method of claim 40, wherein the trace includes a plurality of additional messages of one or more other types other than the given type.
42. The method of claim 36, wherein some of the messages are ignored by the system.
43. The method of claim 36, wherein the constant length fields are one or more of: k. big-endian 16 bits integer; l. big-endian 32 bits integer; m. big-endian 64 bits integer; n. big-endian 32 bits floating point; o. big-endian 64 bits floating point; p. little-endian 16 bits integer; q. little -endian 32 bits integer; r. little -endian 64 bits integer; s. little -endian 32 bits floating point; or t. little -endian 64 bits floating point.
44. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for detecting and determining parameters of constant length fields within a plurality of messages of a given type having a constant length, the method comprising:
(a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a respective sequence of bytes; (b) providing, by the processing circuity, a plurality of classifiers, each classifier configured to receive the plurality of messages and to produce plausibility scores of the respective field-types of constant length field starting at a given offset within the plurality of messages of the given type, wherein each of the classifiers is associated with a respective classifier performance score, indicative of a position of the respective classifier in a hierarchy of the classifiers;
(c) utilizing, by the processing circuity, the classifiers to produce plausibility scores of one or more of the respective field-types of constant length field starting at each offset within the plurality of messages;
(d) identifying, by the processing circuity, the highest plausibility score based on
(i) each classifier’s performance score of available classifiers, being the classifiers, and
(ii) the highest plausibility score produced by each classifier;
(e) marking, by the processing circuity, a number of bytes defined by the field- type of constant length field associated with the respective classifier and the respective field type and starting at one or more offsets associated with the highest plausibility score;
(f) repeating, by the processing circuity, (d) and (e) while (i) disregarding the scores associated with the marked bytes and (ii) removing from the available classifiers the classifier having the highest performance score of the available classifiers; and
(g) upon the available classifiers not including any of the classifiers, providing, by the processing circuity, a user of the system or an external system with an indication of a start position of the constant length fields and the field-type of the constant length fields, based on the markings.
45. A system for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type, the system comprising a processing circuitry configured to:
(a) obtain the plurality of messages of the given type, each of the messages comprised of a sequence of bytes;
(b) determine, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective message; (c) determine, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, wherein the plausibility score is calculated utilizing a calculated matrix that is determined based on one or more sequences of character types associated with each of one or more sequences of characters represented in the content, and irrespective of other content of other messages of the plurality of messages, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value;
(d) upon a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths, being met, determine that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a subsequent byte, subsequent to the index byte, if any.
46. The system of claim 45, wherein the processing circuitry is further configured to: determine for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value; and upon the criterion being met, determine a type of the variable length string, utilizing at least the full string candidate length.
47. The system of claim 46, wherein the type is one of:
(a) a constant length string field not ending with the terminator value;
(b) a constant length string field ending with the terminator value;
(c) a constant length string field ending with padding values;
(d) a constant length string field with a length prefix ending with noise values;
(e) a variable length string field with the length prefix and not ending with the terminator value; or
(f) a variable length string field ending with the terminator value.
48. The system of claim 46, wherein upon the criterion being met, the processing circuitry is further configured to: remove the variable length string from each of the messages; and repeat (b)-(d).
49. The system of claim 46, wherein upon the criterion being met, the processing circuitry is further configured to determine one or more parameters associated with the variable length string.
50. The system of claim 49, wherein upon the criterion being met, the processing circuitry is further configured to validate, using the parameters, that the variable length string is a valid variable length string, and wherein upon the validation being unsuccessful, repeat (c)-(d) with the index byte being a subsequent byte, subsequent to the index byte, if any.
51. The system of claim 45, wherein the messages are obtained from a trace obtained from one of more computerized networks.
52. The system of claim 51, wherein the trace includes a plurality of additional messages of one or more other types other than the given type.
53. The system of claim 45, wherein at least two of the messages have different message length.
54. The system of claim 45, wherein the variable length string is an alphanumeric string.
55. The system of claim 49, wherein the processing circuitry is further configured to provide a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string.
56. A method for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type:
(a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a sequence of bytes;
(b) determining, by the processing circuitry, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages;
(c) determining, by the processing circuitry, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, wherein the plausibility score is calculated utilizing a calculated matrix that is determined based on one or more sequences of character types associated with each of one or more sequences of characters represented in the content, and irrespective of other content of other messages of the plurality of messages, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value;
(d) upon a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths, being met, determining, by the processing circuitry, that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a subsequent byte, subsequent to the index byte, if any.
57. The method of claim 56, further comprising: determining, by the processing circuitry, for the index byte of each of the messages a full string candidate length indicating a number of bytes of the respective message, starting at the index byte, having a character value or a terminator value; and upon the criterion being met, determining, by the processing circuitry, a type of the variable length string, utilizing at least the full string candidate length.
58. The method of claim 57, wherein the type is one of:
(a) a constant length string field not ending with the terminator value;
(b) a constant length string field ending with the terminator value; (c) a constant length string field ending with padding values;
(d) a constant length string field with a length prefix ending with noise values;
(e) a variable length string field with the length prefix and not ending with the terminator value; or
(f) a variable length string field ending with the terminator value.
59. The method of claim 57, wherein upon the criterion being met, the method further comprising: removing, by the processing circuitry, the variable length string from each of the messages; and repeating, by the processing circuitry, (b)-(d).
60. The method of claim 57, wherein upon the criterion being met, the method further comprising: determining, by the processing circuitry, one or more parameters associated with the variable length string.
61. The method of claim 60, wherein upon the criterion being met, the method further comprising: validating, by the processing circuitry, using the parameters, that the variable length string is a valid variable length string, and wherein upon the validation being unsuccessful, repeating, by the processing circuitry, (c)-(d) with the index byte being a subsequent byte, subsequent to the index byte, if any.
62. The method of claim 56, wherein the messages are obtained from a trace obtained from one of more computerized networks.
63. The method of claim 62, wherein the trace includes a plurality of additional messages of one or more other types other than the given type.
64. The method of claim 56, wherein at least two of the messages have different message length.
65. The method of claim 56, wherein the variable length string is an alphanumeric string.
66. The method of claim 60, further comprising: providing, by the processing circuitry, a user of the system or an external system with a list of identified variable length strings, being the variable length strings, the respective index byte of each variable length string, and the respective parameters of each variable length string.
67. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method for determining location and parameters of constant length and variable length strings within a plurality of messages of a given type:
(a) obtaining, by a processing circuitry, the plurality of messages of the given type, each of the messages comprised of a sequence of bytes;
(b) determining, by the processing circuitry, for each of the messages of the given type, an index byte, being a first byte of the bytes of the respective messages;
(c) determining, by the processing circuitry, for the index byte of each of the messages: (a) a message string plausibility score, indicating a plausibility that a part of the respective message starting at the index byte is a string, based on analysis of a content of the part of the respective message, wherein the plausibility score is calculated utilizing a calculated matrix that is determined based on one or more sequences of character types associated with each of one or more sequences of characters represented in the content, and irrespective of other content of other messages of the plurality of messages, and (b) a string candidate length indicating a number of character bytes of the bytes of the respective message, starting at the index byte, each of the character bytes having a character value;
(d) upon a criterion based on (a) the message string plausibility scores and (b) the string candidate lengths, being met, determining, by the processing circuitry, that the index byte is a start of a variable length string, and upon the criterion not being met, repeat (c)-(d) with the index byte being a byte, subsequent to the index byte, if any.
68. A system comprising a processing circuitry configured to: obtain a traffic trace of messages of a proprietary communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; apply one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determine, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
69. The system of claim 68, wherein the processing circuitry is further configured to generate a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace.
70. The system of claim 68, wherein the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
71. The system of claim 68, wherein the traffic trace is obtained from one or more computerized networks.
72. The system of claim 68, wherein the processing circuitry is further configured to provide the at least one of the first estimation or the second estimation to a user of the system or to an external system.
73. The system of claim 68, wherein the processing circuitry is further configured to: receive a desired unobserved message types number; and recommend the given number of additional messages to be obtained in the additional traffic trace based on the second estimation.
74. The system of claim 69, wherein the processing circuitry is further configured to provide the relationship model to a user of the system or to an external system.
75. A method comprising: obtaining, by a processing circuitry, a traffic trace of messages of a proprietary communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; applying, by the processing circuitry, one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace: and determining, by the processing circuitry, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
76. The method of claim 75, further comprising: generating, by the processing circuitry, a relationship model, modeling a relationship between the given number of additional messages and the unobserved message types number to be observed upon obtaining the additional traffic trace.
77. The method of claim 75, wherein the unseen species problem estimators are one or more of: chaol estimators, chao2 estimators, Abundance Coverage-based Estimators (ACE), Incidence Coverage-based Estimators (ICE), coverage-duplication estimators, penalized nonparametric maximum likelihood estimators or jackknife estimators, or any combinations thereof.
78. The method of claim 75, wherein the traffic trace is obtained from one or more computerized networks.
79. The method of claim 75, further comprising: providing, by the processing circuitry, the at least one of the first estimation or the second estimation to a user of the system or to an external system.
80. The method of claim 75, further comprising: receiving, by the processing circuitry, a desired unobserved message types number; and recommending by the processing circuitry, the given number of additional messages to be obtained in the additional traffic trace based on the second estimation.
81. The method of claim 76, further comprising: providing, by the processing circuitry, the relationship model to a user of the system or to an external system.
82. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processing circuitry of a computer to perform a method comprising: obtaining, by a processing circuitry, a traffic trace of messages of a proprietary communication protocol defining a plurality of message types, wherein a total number of the message types is unknown to the system; applying, by the processing circuitry, one or more unseen species problem estimators to the traffic trace giving rise to approximations of an unobserved message types number that is a number of the message types unobserved in the traffic trace; and determining, by the processing circuitry, based on the approximations, at least one of: (a) a first estimation of the total number of the message types, or (b) a second estimation of unobserved message types number expected to be observed upon obtaining an additional traffic trace comprising a given number of additional messages.
PCT/IL2022/050004 2021-01-26 2022-01-02 A system and method for producing specifications for fields with variable number of elements WO2022162655A1 (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
IL280435 2021-01-26
IL280433A IL280433B (en) 2021-01-26 2021-01-26 A system and method for detecting absences in traffic trace
IL280435A IL280435B (en) 2021-01-26 2021-01-26 A system and method for producing specifications for variable length strings
IL280437A IL280437B (en) 2021-01-26 2021-01-26 A system and method for producing specifications for constant length messages
IL280436A IL280436B (en) 2021-01-26 2021-01-26 A system and method for producing specifications for fields with variable number of elements
IL280436 2021-01-26
IL280433 2021-01-26
IL280437 2021-01-26

Publications (1)

Publication Number Publication Date
WO2022162655A1 true WO2022162655A1 (en) 2022-08-04

Family

ID=82654233

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2022/050004 WO2022162655A1 (en) 2021-01-26 2022-01-02 A system and method for producing specifications for fields with variable number of elements

Country Status (1)

Country Link
WO (1) WO2022162655A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120210426A1 (en) * 2009-10-30 2012-08-16 Sun Yat-Sen University Analysis system for unknown application layer protocols
US20180054403A1 (en) * 2013-12-10 2018-02-22 International Business Machines Corporation Opaque Message Parsing
CN109040081A (en) * 2018-08-10 2018-12-18 哈尔滨工业大学(威海) A kind of protocol fields conversed analysis system and method based on BWT
KR102069142B1 (en) * 2018-12-11 2020-02-11 국방과학연구소 Apparatus and method for automatic extraction of accurate protocol specifications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120210426A1 (en) * 2009-10-30 2012-08-16 Sun Yat-Sen University Analysis system for unknown application layer protocols
US20180054403A1 (en) * 2013-12-10 2018-02-22 International Business Machines Corporation Opaque Message Parsing
CN109040081A (en) * 2018-08-10 2018-12-18 哈尔滨工业大学(威海) A kind of protocol fields conversed analysis system and method based on BWT
KR102069142B1 (en) * 2018-12-11 2020-02-11 국방과학연구소 Apparatus and method for automatic extraction of accurate protocol specifications

Similar Documents

Publication Publication Date Title
CN109587008B (en) Method, device and storage medium for detecting abnormal flow data
CN112765324B (en) Concept drift detection method and device
CN114422267B (en) Flow detection method, device, equipment and medium
CN109685092B (en) Clustering method, equipment, storage medium and device based on big data
CN111523588B (en) Method for classifying APT attack malicious software traffic based on improved LSTM
CN110808738B (en) Data compression method, device, equipment and computer readable storage medium
CN113656254A (en) Abnormity detection method and system based on log information and computer equipment
US10114839B2 (en) Format identification for fragmented image data
CN108234452B (en) System and method for identifying network data packet multilayer protocol
CN112364014A (en) Data query method, device, server and storage medium
CN106446102B (en) Terminal positioning method and device based on map fence
CN112087450B (en) Abnormal IP identification method, system and computer equipment
CN115599830A (en) Method, device, equipment and medium for determining data association relation
WO2022162655A1 (en) A system and method for producing specifications for fields with variable number of elements
IL280433B (en) A system and method for detecting absences in traffic trace
IL280435B (en) A system and method for producing specifications for variable length strings
IL280437B (en) A system and method for producing specifications for constant length messages
IL280436B (en) A system and method for producing specifications for fields with variable number of elements
KR102014234B1 (en) Method and Apparatus for automatic analysis for Wireless protocol
CN115622926A (en) Industrial control protocol reverse analysis method based on network traffic
CN111539576B (en) Risk identification model optimization method and device
CN114064434A (en) Early warning method and device for log abnormity, electronic equipment and storage medium
CN108154179B (en) Data error detection method and system
CN115297189B (en) Method and system for reversely analyzing man-machine cooperation fast industrial control protocol
CN114625786B (en) Dynamic data mining method and system based on wind control technology

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745500

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22745500

Country of ref document: EP

Kind code of ref document: A1