WO2019242443A1 - 一种基于字符串的恶意软件识别方法、系统及相关装置 - Google Patents

一种基于字符串的恶意软件识别方法、系统及相关装置 Download PDF

Info

Publication number
WO2019242443A1
WO2019242443A1 PCT/CN2019/087563 CN2019087563W WO2019242443A1 WO 2019242443 A1 WO2019242443 A1 WO 2019242443A1 CN 2019087563 W CN2019087563 W CN 2019087563W WO 2019242443 A1 WO2019242443 A1 WO 2019242443A1
Authority
WO
WIPO (PCT)
Prior art keywords
importance
character string
file
string
malware
Prior art date
Application number
PCT/CN2019/087563
Other languages
English (en)
French (fr)
Inventor
章明星
位凯志
Original Assignee
深信服科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深信服科技股份有限公司 filed Critical 深信服科技股份有限公司
Publication of WO2019242443A1 publication Critical patent/WO2019242443A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Definitions

  • the present application relates to the field of malware identification, and in particular, to a string-based malware identification method, system, and device, and a computer-readable storage medium.
  • the purpose of this application is to provide a string-based malware identification method, which uses TFIDF technology to perform statistics on the importance of each string extracted from the PE file included in the software under test, and selects from it a high classification capability. And recognition of high importance strings, and then use the high importance strings as malware identification features to complete malware identification.
  • This recognition method does not need to manually rely on its own experience to complete the identification of malicious content, and it is more efficient and malicious to complete automation. Content recognition is more accurate and chances of omission are lower.
  • Another object of the present application is to provide a character string-based malware recognition system, device, and computer-readable storage medium.
  • the present application provides a string-based malware identification method, which includes:
  • the corresponding character string is filtered according to the importance evaluation parameter according to the importance evaluation parameter to obtain a first preset number of high importance character strings, and each of the high importance character strings is used as a malware identification feature;
  • the magnitude of the importance evaluation parameter is directly proportional to the importance of the string;
  • Malware identification is performed on the software under test by using each of the malware identification characteristics.
  • TFIDF use TFIDF technology to calculate the importance of each string extracted from the PE file included in the software under test to each PE file, and obtain the importance evaluation parameters, including:
  • the TF is the number of occurrences of the character string in each of the PE files
  • the IDF is the number of all PE files containing the character string in the software under test
  • N is The total number of PE files included in the software under test.
  • the method further includes:
  • the PE files included in the software under test are similarly merged to obtain the PE file set after each category is merged;
  • a second preset number of high importance character strings are selected corresponding to each of the merged importance parameters in descending order, and are also used as the malware identification feature; wherein the second preset number ⁇ The first preset number.
  • the method before using the TFIDF technology to calculate the importance of each string extracted from the PE file included in the software under test to each PE file, the method further includes:
  • a corresponding PE document is generated according to a preset combination format from the first character string set, the second character string set, and the third character string set extracted from each of the PE files to reduce the number of invalid character strings.
  • the method before performing malware identification on the software under test by using each of the malware identification features, the method further includes:
  • the corresponding PE file is converted into a topic distribution vector according to the identification features of the topics included in each of the PE files.
  • performing malware identification on the software under test by using each of the malware identification features includes:
  • a linear classification algorithm is used to classify each of the topic distribution vectors.
  • extract a string containing a preset function name from the decompiled data including:
  • a regular expression is used to extract a character string containing the preset function name from the decompiled data.
  • the method further includes:
  • a statistical result is generated in the form of a map from each character string and the corresponding number of PE files containing the character string.
  • the present application also provides a string-based malware identification system, which includes:
  • the TFIDF calculation unit is used to calculate the importance of each string extracted from the PE file included in the software under test to each PE file by using the TFIDF technology, and obtain the importance evaluation parameter;
  • a malware identification feature selection unit is configured to filter the corresponding character string according to the importance evaluation parameter according to the importance evaluation parameter to obtain a first preset number of high importance character strings, and each of the high importance character strings As a malware identification feature; wherein the size of the importance evaluation parameter is directly proportional to the importance of the string;
  • a malware identification unit is configured to identify the software under test by using each of the malware identification features.
  • the TFIDF calculation unit includes:
  • Formula calculation sub-unit for using Calculate the importance of each string extracted from the PE file included in the software under test to each PE file to obtain the importance evaluation parameter; wherein the TF is the value of the string in each of the PE files The number of occurrences, the IDF is the number of all PE files containing the character string in the software under test, and N is the total number of PE files included in the software under test.
  • the system also includes:
  • An ordering unit configured to arrange each of the importance evaluation parameters in a descending order from top to bottom, to obtain an importance evaluation parameter ranking queue
  • a first recognition feature selection subunit configured to select the first preset number of importance evaluation parameters from the ranking queue of the importance evaluation parameters, and use the corresponding character string as the high importance Sex string.
  • the system also includes:
  • the homogeneous merging subunit is configured to perform homogeneous merging of the PE files included in the software under test according to different categories included in the malware to obtain the PE file set after merging in each category;
  • a post-merger parameter calculation subunit configured to use the TFIDF technology to separately calculate the importance of each of the character strings to each of the PE files in the post-merger PE file set to obtain the post-merger importance parameter;
  • a second recognition feature selection subunit configured to select a second preset number of high importance character strings corresponding to each of the merged importance parameters in descending order, and also serve as the malware recognition feature;
  • the second preset number is less than the first preset number.
  • the system also includes:
  • a PE file content splitting unit configured to divide each of the PE files into a header metadata section, a code section, and a data section according to a specific file format constituting the PE file;
  • a head content extraction unit configured to extract a content in the head metadata segment by using a metadata parser to obtain a first string set
  • a code segment content extraction unit configured to decompile the content in the code segment, and extract a string containing a preset function name from the decompiled data to obtain a second string set;
  • a data segment content extraction unit configured to extract a character string including at least one of an IP address, an email address, and a web address from the content contained in the data segment to obtain a third character string set;
  • a PE document generating unit is configured to generate a corresponding PE document according to a preset combination format from the first character string set, the second character string set, and the third character string set extracted from each of the PE files, so as to reduce invalidity.
  • the number of strings is configured to generate a corresponding PE document according to a preset combination format from the first character string set, the second character string set, and the third character string set extracted from each of the PE files, so as to reduce invalidity. The number of strings.
  • the system also includes:
  • a topic analysis classification unit which is used to classify each of the malware identification features by using a topic analysis technology to obtain the identification features after the topic classification;
  • a theme distribution vector generating unit is configured to convert a corresponding PE file into a theme distribution vector according to the identification features of the topics included in each of the PE files.
  • the malware identification unit includes:
  • a linear classification algorithm classification subunit is configured to classify each of the topic distribution vectors by using a linear classification algorithm.
  • the code segment content extraction unit includes:
  • a regular expression matching extraction subunit is configured to use a regular expression to extract a character string containing the preset function name from the decompiled data.
  • the system also includes:
  • the statistical result generating unit is configured to generate a statistical result in the form of a map between each character string and a corresponding number of PE files containing the character string.
  • the present application also provides a character string-based malware identification device.
  • the device includes:
  • a processor configured to implement the steps of the malware identification method as described above when the computer program is executed.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the malware identification as described above is implemented. Method steps.
  • a string-based malware identification method provided by the present application: using TFIDF technology to calculate the importance of each string extracted from the PE file included in the software under test to each PE file, and obtain importance evaluation parameters; The corresponding character string is filtered according to the importance evaluation parameter according to the importance evaluation parameter to obtain a first preset number of high importance character strings, and each of the high importance character strings is used as a malware identification feature; The magnitude of the importance evaluation parameter is directly proportional to the importance of the character string; malware identification is performed on the software under test by using each of the malware identification features.
  • the technical solution provided in this application is to provide a method for identifying malware based on character strings, and to use TFIDF technology to perform statistics on the importance of each character string extracted from the PE file included in the software under test. , And select a high-importance character string with higher classification ability and recognition degree, and then use the high-importance character string as the malware identification feature to complete the malware identification.
  • the identification method does not need to manually complete malicious content based on its own experience. Identification, automated completion is more efficient, malicious content identification is more accurate, and the chance of omission is lower.
  • This application also provides a character string-based malware recognition system, device, and computer-readable storage medium, which have the above-mentioned beneficial effects, and are not repeated here.
  • FIG. 1 is a flowchart of a character string-based malware identification method according to an embodiment of the present application
  • FIG. 2 is a flowchart of extracting a character string from a PE file in a character string-based malware identification method according to an embodiment of the present application
  • FIG. 3 is a flowchart of another string-based malware identification method according to an embodiment of the present application.
  • FIG. 4 is a structural block diagram of a character string-based malware recognition system according to an embodiment of the present application.
  • the core of the application is to provide a string-based malware identification method, system, device, and computer-readable storage medium, and use TFIDF technology to perform statistics on the importance of each string extracted from the PE file included in the software under test. , And select a high-importance character string with higher classification ability and recognition degree, and then use the high-importance character string as the malware identification feature to complete the malware identification.
  • the identification method does not need to manually complete malicious content based on its own experience. Identification, automated completion is more efficient, malicious content identification is more accurate, and the chance of omission is lower.
  • FIG. 1 is a flowchart provided by an embodiment of the present application.
  • FIG. 1 is a character string-based malware recognition method provided by an embodiment of the present application.
  • S101 Calculate the importance of each string extracted from the PE file included in the software under test to each PE file by using the TFIDF technology, and obtain an importance evaluation parameter;
  • TFIDF term frequency-inverse document frequency
  • TFIDF technology if a word or phrase appears frequently in a certain article (TF), but rarely appears in other articles, the word or phrase is considered to have a good class discrimination ability, Suitable for classification.
  • TF refers to how often a given word appears in the document, and this number is a normalization of the number of occurrences of the word to prevent it from favoring long documents ( (Because the same word may be used more often in short files than in short files).
  • TF For a word in a particular document, its TF can be expressed as:
  • the numerator in the above formula represents the number of occurrences of the word in the specific file, and the denominator represents the sum of the occurrences of all words in the specific file;
  • the IDF of the word can be obtained by dividing the total number of files contained in the file set by the number of files containing the word, and then taking the logarithm of the obtained quotient:
  • TFIDF technology tends to filter out common words and retain important words with classification.
  • This application applies the TFIDF technology commonly used in the field of natural language processing to the field of malware identification technology.
  • the purpose is to identify each PE file (Portable Executable, portable executable file, common EXE, DLL) included in the software under test.
  • OCX, SYS, COM are PE files
  • PE files are program files on Microsoft Windows operating system
  • TFIDF technology is used because on the one hand, it has a natural format when programming in an application programming language, and the number of occurrences Many strings are likely to be meaningless strings under format requirements (useless for malware identification), and this technology can be used to filter them out well; on the other hand, existing malware is usually based on normal software.
  • the method of embedding malicious content is made, and the amount of malicious content data is small, and the number of occurrences of specific strings is small, but it has a strong attack ability and uses this to evade inspection.
  • the present application addresses the above-mentioned phenomenon by applying the TFIDF technology commonly used in the field of natural language processing to the field of malware recognition technology to find strings with strong classification capabilities in the software under test.
  • TFIDF technology commonly used in the field of natural language processing
  • malware recognition technology to find strings with strong classification capabilities in the software under test.
  • each string calculated in this step can be extracted based on the simplest method, that is, all content in the PE file is converted to a string, but this method will contain a large number of meaningless strings (identifying malware (Useless), so you can also extract the parts that make up the PE file according to the specific file format that makes up the PE file, so as to reduce the meaningless strings obtained and reduce the amount of subsequent calculations using TFIDF technology.
  • a specific character string extraction method is provided, but is not limited to the one described in the subsequent embodiments. A method that is slightly different from this embodiment based on this idea should also be within the protection scope of this application.
  • S102 Filter the corresponding strings according to importance according to the importance evaluation parameters, obtain a first preset number of high importance strings, and use each high importance string as a malware identification feature;
  • this step aims to filter out strings that meet the requirements based on the size of the calculated importance assessment parameters.
  • the size of the importance assessment parameters is proportional to the degree of importance. It is sufficient to filter out high-importance strings with larger importance evaluation parameters from the strings, and finally use them as malware identification features for subsequent steps.
  • the importance evaluation parameters of all strings are integrated, the gears of different importance levels are divided according to stages, and the strings in the most important gear are selected.
  • malware can also be classified according to different types, such as ransomware, Trojan horses, worms, and even finer family types. Then use the TFIDF technology to calculate the importance of the strings for each type of file set, and retain the strings with larger important evaluation parameters in a single class as malware identification features to reduce the chance of omission. To improve malware recognition.
  • the PE files included in the software under test are similarly merged to obtain the PE file set after each category is merged;
  • TFIDF Use the TFIDF technology to calculate the importance of each string in each PE file set after merging in each category, and obtain the importance parameter after merging
  • a second preset number of high-importance character strings are selected corresponding to each of the merged importance parameters in descending order, and are also used as malware identification features; wherein the second preset number ⁇ the first preset number ( (Because the number of files in the more detailed file set is smaller than the number of files in the original file set that is not categorized).
  • S103 Perform malware identification on the software under test by using each malware identification feature
  • this step aims to realize the malware identification of the software under test by using the obtained malware identification characteristics. Specifically, it can be implemented by using a classifier constructed based on the identification characteristics of each malware.
  • the classifier can be built based on linear or non-linear classification algorithms, including but not limited to logistic regression, support vector machines, decision trees, and so on.
  • a string-based malware recognition method uses TFIDF technology to perform statistical statistics on the importance of each string extracted from the PE file included in the software under test, and selects it from Highly important strings with high classification ability and recognition, and then use the high importance strings as malware identification features to complete the identification of malware.
  • the identification method does not need to rely on human experience to complete the identification of malicious content, which is automated. More efficient, more accurate malicious content identification, and less chance of omission.
  • FIG. 2 is a flowchart of extracting a character string from a PE file in a method for character string-based malware identification provided by an embodiment of the present application.
  • the methods for extracting strings from PE files include, but are not limited to, the following methods:
  • Each PE file is divided into a header metadata section, a code section, and a data section according to a specific file format constituting the PE file;
  • the PE file composition format released by Microsoft it can be roughly divided into a header metadata segment, a code segment, and a data segment.
  • the data in the header metadata segment follows a more standardized format, and the information it contains includes program segments. Descriptive information (section name, size, permissions, etc.); the code section mainly contains the function implementation content built based on the programming language, and the data section mainly contains the visual content data (similar to the body of a letter) presented to the user.
  • the header metadata section follows a more standardized format and is very important, all the information in it can be extracted by a specific parser, and the extracted content is treated as a special format, such as a number or not, as Strings are processed uniformly.
  • S203 Decompiling the content in the code segment, and extracting a string containing a preset function name from the decompiled data to obtain a second string set;
  • S204 Extract a character string including at least one of an IP address, an email address, and a website address from the content contained in the data segment to obtain a third character string set;
  • the code section and the data section contain a large number of meaningless strings. If all of them are extracted, it will only interfere with subsequent steps and greatly increase the calculation time. Part of the data only needs to extract a specified area or a specified type of string from it.
  • the strings in special areas such as function names are extracted from it after decompilation.
  • the extracted key content is mainly strings containing special formats such as IP addresses, email addresses, and web addresses.
  • One of the data extraction methods is to use regular expressions for matching. Of course, other same or similar methods can be used to achieve the same purpose of information extraction, which will not be repeated here.
  • S205 Generate a corresponding PE document according to a preset combination format from the first character string set, the second character string set, and the third character string set extracted from each PE file to reduce the number of invalid character strings.
  • this step aims to generate a corresponding PE document according to a preset combination format from the first character string set, the second character string set, and the third character string set extracted from each PE file.
  • replacing the original PE file with a PE file may greatly reduce the number of meaningless strings. Because the same character string appears only once in the PE file, when using TFIDF technology in this way, the corresponding string can only be 1 or 0 according to whether the string appears in the software under test.
  • map is used to store a set of one-to-one mapping relationships.
  • a key can only correspond to a value, where key represents a specific string, and value represents the number of PE files containing the key.
  • this embodiment provides a method for selectively extracting meaningful strings from a PE file. Since some redundant and useless strings are removed, the work of subsequent processing steps can be effectively reduced. the amount.
  • FIG. 3 is a flowchart of another method for identifying malware based on a character string provided by an embodiment of the present application.
  • This step aims to provide a specific implementation manner for the first embodiment, and specifically includes the following steps:
  • TF refers to the number of occurrences of a particular character string in a particular PE file
  • IDF is the number of all PE files containing the character string in the software under test
  • N is the total number of PE files included in the software under test .
  • the importance evaluation parameters are arranged from top to bottom in order from large to small to obtain a queue of importance evaluation parameters
  • S303 Select the first preset number of importance evaluation parameters from the ranking queue of importance evaluation parameters, use the corresponding character string as a high importance character string, and treat each high importance character string as malicious.
  • a method of establishing a queue is used to filter out high-importance character strings according to different degrees of importance, and use them as a malware identification feature.
  • the first preset number can be flexibly set according to an actual situation, for example, it is set to 3 or 5, so as to select a character string with a ranking of top 3 or 5 from the queue.
  • S304 Use topic analysis technology to classify the identification characteristics of each malware, and obtain the identification characteristics after topic classification;
  • the subject analysis technology is also used to classify each malware identification feature, and the theme distribution vector is finally obtained to classify the obtained similar malware identification features.
  • the topic analysis technology is simply that given a large number of documents, even without any manual input, the topic analysis technology can be used to automatically classify the words contained in these documents according to the topic.
  • the topic analysis technology can automatically extract the topics of K and the words related to this topic.
  • the words related to the subject of "sports” may include words such as "football”, “basketball”, and "Manchester United” and "NBA”.
  • a topic analysis technique can be used to convert a document into a K-dimensional vector to represent the distribution of each topic. For example, 30% in "Document No. 1" is about sports, and 40% It's about entertainment and 30% is about Europe.
  • each PE file will be converted into a feature vector, so it can be directly classified using traditional linear classification algorithms.
  • Specific methods include, but are not limited to, logistic regression and support vector machines. And decision trees.
  • a method for identifying malware based on a character string provided in the embodiments of the present application firstly performs targeted information extraction on each part based on a specific format of a PE file, which greatly reduces meaningless strings. And then use TFIDF technology to calculate the importance of each extracted string, and select a high importance string with high classification ability and recognition from it, and combine the topic analysis technology with high importance for the same category Strings are classified, and malware recognition is finally completed based on the converted feature vectors.
  • This recognition method does not need to rely on its own experience to complete the identification of malicious content. It has higher automation completion efficiency, more accurate malicious content identification, and lower probability of omission.
  • S301 in this embodiment is a specific implementation including but not limited to S101
  • S302 and S303 are specific implementations including but not limited to S102
  • S306 is Aiming at a specific implementation method given in S103 to use a linear classifier to classify the topic distribution vectors obtained after the conversion processing of S304 and S305 to realize the purpose of malware identification
  • these three parts can be implemented separately based on the corresponding independent claim Example 1 forms the corresponding specific embodiment.
  • the scheme of extracting meaningful strings is flexibly combined to obtain different specific embodiments. This embodiment only exists as a preferred embodiment in which the three specific implementation schemes are used simultaneously and arranged in order according to the execution order.
  • FIG. 4 is a structural block diagram of a character string-based malware recognition system provided by an embodiment of the present application.
  • the malware identification system can include:
  • the TFIDF calculation unit 100 is configured to calculate the importance of each string extracted from the PE file included in the software under test to each PE file by using the TFIDF technology, and obtain importance evaluation parameters;
  • the malware identification feature selection unit 200 is configured to filter the corresponding character string according to the importance evaluation parameter according to the importance degree to obtain a first preset number of high importance character strings, and use each high importance character string as malware. Identify features; where the size of the importance evaluation parameter is directly proportional to the importance of the string;
  • the malware identification unit 300 is configured to perform malware identification on the software to be tested by using each malware identification feature.
  • the TFIDF calculation unit 100 includes:
  • Formula calculation sub-unit for using Calculate the importance of each string extracted from the PE file included in the software under test to each PE file, and obtain the importance evaluation parameters; where TF is the number of occurrences of the string in each PE file, and IDF is between The number of all PE files containing character strings in the software under test, N is the total number of PE files included in the software under test.
  • the system also includes:
  • An ordering unit which is used to arrange the importance evaluation parameters from top to bottom in order from large to small to obtain a queue of importance evaluation parameters
  • the first recognition feature selection subunit is used to select a first preset number of importance evaluation parameters from the ranking queue of importance evaluation parameters, and use the corresponding character string as a high importance character string.
  • the system also includes:
  • the homogeneous merging subunit is used to perform homogeneous merging of PE files included in the software under test according to different categories of malware, to obtain the PE file set after merging each category;
  • Post-merger parameter calculation sub-unit used to use TFIDF technology to calculate the importance of each string in each category of the PE file set after merging in each category, to obtain the post-merger importance parameter
  • the second recognition feature selection subunit is used to select a second preset number of high importance character strings corresponding to each merged importance parameter in descending order, and also serves as a malware recognition feature; wherein, the second Preset number ⁇ first preset number.
  • the system also includes:
  • PE file content splitting unit which is used to divide each PE file into a header metadata section, a code section, and a data section according to the specific file format constituting the PE file;
  • a header content extraction unit configured to extract the content in the header metadata segment by using a metadata parser to obtain a first string set
  • a code segment content extraction unit configured to decompile the content in the code segment, and extract a string containing a preset function name from the decompiled data to obtain a second string set;
  • a data segment content extraction unit configured to extract a character string including at least one of an IP address, an email address, and a web address from the content contained in the data segment to obtain a third character string set;
  • a PE document generating unit is configured to generate a corresponding PE document according to a preset combination format from the first character string set, the second character string set, and the third character string set extracted from each PE file, so as to reduce invalid character strings. quantity.
  • the system also includes:
  • Topic analysis and classification unit which is used to classify the identification characteristics of each malware by using the topic analysis technology to obtain the identification characteristics after topic classification;
  • a theme distribution vector generating unit is configured to convert a corresponding PE file into a theme distribution vector according to the classification features included in each PE file.
  • the malware identification unit 300 includes:
  • a linear classification algorithm classification subunit is used to classify each topic distribution vector using a linear classification algorithm.
  • the code segment content extraction unit may include:
  • the regular expression matching extraction subunit is used to extract a string containing a preset function name from the decompiled data by using the regular expression.
  • system may further include:
  • a statistical result generating unit is configured to generate a statistical result in the form of a map from each character string and the corresponding number of PE files containing the character string.
  • the present application further provides a character string-based malware identification device.
  • the malware identification device may include a memory and a processor, where a computer program is stored in the memory, and the processor calls the memory.
  • the steps provided in the foregoing embodiments can be implemented.
  • the malware identification device may also include various necessary network interfaces, power supplies, and other components.
  • the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the storage medium may include: a U disk, a mobile hard disk, a read-only memory (Read-Only Memory (ROM)), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, which can store program codes.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于字符串的恶意软件识别方法、系统、设备及计算机可读存储介质,该恶意软件识别方法利用TFIDF技术对从待测软件包含的PE文件中提取出的各字符串进行重要程度统计,并从中选取出具有较高分类能力和识别度的高重要性字符串,再借助该高重要性字符串作为恶意软件识别特征完成恶意软件的识别。该识别方法无需人工凭借自身经验完成恶意内容的识别,自动化完成效率更高、恶意内容识别更准确、遗漏几率更低。

Description

一种基于字符串的恶意软件识别方法、系统及相关装置
本申请要求于2018年06月20日提交至中国专利局、申请号为201810639502.7、发明名称为“一种基于字符串的恶意软件识别方法、系统及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及恶意软件识别领域,特别涉及一种基于字符串的恶意软件识别方法、系统、装置及计算机可读存储介质。
背景技术
随着计算机编程技术的不断发展,基于各式计算机语言编程得到的软件也使得人们能够更加方便的在计算机中完成各式任务和工作,但携带恶意内容的恶意软件也随之出现,恶意的攻击正常数据文件或盗取他人劳动成果。因此,对待测软件进行是否为恶意软件的识别是十分重要的。
现有识别恶意软件的方法,通常由本领域技术人员凭借自身经验对待测软件中包含的庞大数量的代码或字符串进行人工分辨,有些恶意软件是基于正常软件制作的,即会存在大量的无用代码或字符串,而极少数的恶意内容就“淹没”在由庞大的无用代码或字符串组成的“海洋”中,甚至有些恶意软件为躲避识别,还会对恶意内容进行伪装。如此种种,传统的人工识别方法效率极低且非常不稳定,同时基于经验的分辨方法无法对新式恶意内容或不易发现的恶意内容进行准确识别,实际效果不好。
所以,如何克服现有由人工完成的恶意软件识别方式存在的各项技术缺陷,提供一种无须本领域技术人员进行人工识别,效率更高、恶意软件识别更加准确的恶意软件识别方法是本领域技术人员亟待解决的问题。
发明内容
本申请的目的是提供一种基于字符串的恶意软件识别方法,利用TFIDF技术对从待测软件包含的PE文件中提取出的各字符串进行重要程度统计,并从中选取出具有较高分类能力和识别度的高重要性字符串,再借 助该高重要性字符串作为恶意软件识别特征完成恶意软件的识别,该识别方法无需人工凭借自身经验完成恶意内容的识别,自动化完成效率更高、恶意内容识别更准确、遗漏几率更低。
本申请的另一目的在于提供了一种基于字符串的恶意软件识别系统、装置及计算机可读存储介质。
为实现上述目的,本申请提供一种基于字符串的恶意软件识别方法,该方法包括:
利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
根据所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,并将各所述高重要性字符串作为恶意软件识别特征;其中,所述重要性评估参数的大小与所述字符串的重要程度成正比;
利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别。
可选的,利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数,包括:
利用
Figure PCTCN2019087563-appb-000001
计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
其中,所述TF为所述字符串在每份所述PE文件中的出现次数、所述IDF为在所述待测软件中所有包含有所述字符串的PE文件个数、所述N为所述待测软件包含的PE文件总个数。
可选的,利用所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,包括:
将各所述重要性评估参数按照从大到小的顺序自上而下进行排列,得到重要性评估参数排列队列;
从所述重要性评估参数排列队列中自上而下的选取所述第一预设数量的重要性评估参数,并将对应的字符串作为所述高重要性字符串。
可选的,该方法还包括:
按照恶意软件包含的不同类别对所述待测软件包含的PE文件进行同类归并,得到各类别归并后PE文件集;
利用所述TFIDF技术分别计算各所述字符串在所述各类别归并后PE文件集中对其中的每份PE文件的重要程度,得到归并后重要性参数;
将各所述归并后重要性参数按从大到小的顺序对应选取第二预设数量的高重要性字符串,并也作为所述恶意软件识别特征;其中,所述第二预设数量<第一预设数量。
可选的,在利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度之前,还包括:
按照组成所述PE文件的特定文件格式将每个所述PE文件分为头部元数据段、代码段以及数据段;
利用元数据解析器提取所述头部元数据段中的内容,得到第一字符串集;
对所述代码段中的内容进行反编译,并从反编译后数据中提取包含预设函数名的字符串,得到第二字符串集;
从所述数据段所包含的内容中提取包括IP地址、邮箱地址以及网址中至少一项在内的字符串,得到第三字符串集;
将从每个所述PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以减少无效字符串的数量。
可选的,在利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别之前,还包括:
利用主题分析技术对各所述恶意软件识别特征进行归类,得到主题归类后识别特征;
根据每个所述PE文件中包含的主题归类后识别特征将对应的PE文件转化为主题分布向量。
可选的,利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别,包括:
利用线性分类算法对每个所述主题分布向量进行分类。
可选的,从反编译后数据中提取包含预设函数名的字符串,包括:
利用正则表达式从所述反编译后数据中提取得到包含所述预设函数名的字符串。
可选的,该方法还包括:
将每个字符串与对应包含所述字符串的PE文件数量以map形式生成统计结果。
为实现上述目的,本申请还提供了一种基于字符串的恶意软件识别系统,该系统包括:
TFIDF计算单元,用于利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
恶意软件识别特征选取单元,用于根据所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,并将各所述高重要性字符串作为恶意软件识别特征;其中,所述重要性评估参数的大小与所述字符串的重要程度成正比;
恶意软件识别单元,用于利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别。
可选的,所述TFIDF计算单元包括:
公式计算子单元,用于利用
Figure PCTCN2019087563-appb-000002
计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;其中,所述TF为所述字符串在每份所述PE文件中的出现次数、所述IDF为在所述待测软件中所有包含有所述字符串的PE文件个数、所述N为所述待测软件包含的PE文件总个数。
可选的,该系统还包括:
顺序排列单元,用于将各所述重要性评估参数按照从大到小的顺序自上而下进行排列,得到重要性评估参数排列队列;
第一识别特征选取子单元,用于从所述重要性评估参数排列队列中自上而下的选取所述第一预设数量的重要性评估参数,并将对应的字符串作为所述高重要性字符串。
可选的,该系统还包括:
同类归并子单元,用于按照恶意软件包含的不同类别对所述待测软件包含的PE文件进行同类归并,得到各类别归并后PE文件集;
归并后参数计算子单元,用于利用所述TFIDF技术分别计算各所述字符串在所述各类别归并后PE文件集中对其中的每份PE文件的重要程度,得到归并后重要性参数;
第二识别特征选取子单元,用于将各所述归并后重要性参数按从大到小的顺序对应选取第二预设数量的高重要性字符串,并也作为所述恶意软件识别特征;其中,所述第二预设数量<第一预设数量。
可选的,该系统还包括:
PE文件内容拆分单元,用于按照组成所述PE文件的特定文件格式将每个所述PE文件分为头部元数据段、代码段以及数据段;
头部内容提取单元,用于利用元数据解析器提取所述头部元数据段中的内容,得到第一字符串集;
代码段内容提取单元,用于对所述代码段中的内容进行反编译,并从反编译后数据中提取包含预设函数名的字符串,得到第二字符串集;
数据段内容提取单元,用于从所述数据段所包含的内容中提取包括IP地址、邮箱地址以及网址中至少一项在内的字符串,得到第三字符串集;
PE文档生成单元,用于将从每个所述PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以减少无效字符串的数量。
可选的,该系统还包括:
主题分析归类单元,用于利用主题分析技术对各所述恶意软件识别特征进行归类,得到主题归类后识别特征;
主题分布向量生成单元,用于根据每个所述PE文件中包含的主题归类后识别特征将对应的PE文件转化为主题分布向量。
可选的,所述恶意软件识别单元包括:
线性分类算法分类子单元,用于利用线性分类算法对每个所述主题分布向量进行分类。
可选的,所述代码段内容提取单元包括:
正则表达式匹配提取子单元,用于利用正则表达式从所述反编译后数据中提取得到包含所述预设函数名的字符串。
可选的,该系统还包括:
统计结果生成单元,用于将每个字符串与对应包含所述字符串的PE文件数量以map形式生成统计结果。
为实现上述目的,本申请还提供了一种基于字符串的恶意软件识别装置,该装置包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上述内容所描述的恶意软件识别方法的步骤。
为实现上述目的,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述内容所描述的恶意软件识别方法的步骤。
本申请所提供的一种基于字符串的恶意软件识别方法:利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;根据所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,并将各所述高重要性字符串作为恶意软件识别特征;其中,所述重要性评估参数的大小与所述字符串的重要程度成正比;利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别。
显然,本申请所提供的技术方案,本申请的目的是提供一种基于字符串的恶意软件识别方法,利用TFIDF技术对从待测软件包含的PE文件中提取出的各字符串进行重要程度统计,并从中选取出具有较高分类能力和识别度的高重要性字符串,再借助该高重要性字符串作为恶意软件识别特征完成恶意软件的识别,该识别方法无需人工凭借自身经验完成恶意内容的识别,自动化完成效率更高、恶意内容识别更准确、遗漏几率更低。本申请同时还提供了一种基于字符串的恶意软件识别系统、装置及计算机可读存储介质,具有上述有益效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例所提供的一种基于字符串的恶意软件识别方法的流程图;
图2为本申请实施例所提供的基于字符串的恶意软件识别方法中一种从PE文件中提取字符串的流程图;
图3为本申请实施例所提供的另一种基于字符串的恶意软件识别方法的流程图;
图4为本申请实施例所提供的一种基于字符串的恶意软件识别系统的结构框图。
具体实施方式
本申请的核心是提供一种基于字符串的恶意软件识别方法、系统、装置及计算机可读存储介质,利用TFIDF技术对从待测软件包含的PE文件中提取出的各字符串进行重要程度统计,并从中选取出具有较高分类能力和识别度的高重要性字符串,再借助该高重要性字符串作为恶意软件识别特征完成恶意软件的识别,该识别方法无需人工凭借自身经验完成恶意内容的识别,自动化完成效率更高、恶意内容识别更准确、遗漏几率更低。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
实施例一
以下结合图1,图1为本申请实施例所提供的图1为本申请实施例所提供的一种基于字符串的恶意软件识别方法的流程图。
其具体包括以下步骤:
S101:利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
TFIDF(term frequency–inverse document frequency)是一种统计方法,常用于自然语言处理领域,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度:该字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF意思是词频(Term Frequency),IDF意思是逆向文件频率(Inverse Document Frequency)。
TFIDF技术的主要思想是:如果某个词或短语在某一篇文章中出现的频率(TF)高,但在其它文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。在一份给定的文件里,TF指的是某一个给定的词语在该文件中出现的频率,且这个数字是对出现该词语的次数的归一化,以防止它偏向长的文件(因为同一个词语在长文件里可能会比短文件拥有更大的次数)。
对于在某一特定文件里的词语来说,它的TF可表示为:
Figure PCTCN2019087563-appb-000003
以上式子中的分子表示该词语在该特定文件中的出现次数,而分母则表示在该特定文件中所有字词的出现次数之和;
该词语的IDF可由文件集中包含的文件的总数目除以包含有该词语的文件的数目,再将得到的商取对数得到:
Figure PCTCN2019087563-appb-000004
其中,分子中的|D|表示在该文件集中包含的文件的总数目;分母中的|{j:t i∈d j}|则表示所有包含有该词语的文件数目,由于可能存在当该词语不存在于该文件集中时导致分母为零的特殊情况,因此一般情况下使用 1+|{j:t i∈d j}|作为分母。
最后计算TF与IDF的乘积,某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TFIDF结果。因此,TFIDF技术倾向于过滤掉常见的词语,保留具有分类性的重要的词语。
本申请将常用于自然语言处理领域的TFIDF技术应用于恶意软件识别技术领域,旨在辨别待测软件中包含的每个PE文件(Portable Executable,可移植的可执行的文件,常见的EXE、DLL、OCX、SYS、COM都是PE文件,PE文件是微软Windows操作系统上的程序文件)中的代码或字符串的重要性,以便基于重要性结果找出可以作为恶意软件识别特征的高重要性字符串。
依照上述方式,可从待测软件中包含的所有PE文件中找到具有分类性较强的字符串,之所以采用TFIDF技术,一方面是因为在应用编程语言进行编程时拥有固有格式,出现次数较多的字符串很有可能是格式要求下的无意义字符串(对恶意软件识别无用),采用此技术可以将其很好的筛去;另一方面是现有的恶意软件通常基于正常软件以嵌入恶意内容的方式制成,且恶意内容数据量较少,特定字符串出现次数较少,但拥有强攻击能力,并以此逃避检查。
因此,本申请针对上述现象,将常用于自然语言处理领域的TFIDF技术应用于恶意软件识别技术领域,以在待测软件中寻找拥有强分类能力的字符串,当然,由于应用于不同的领域,也需要进行一些相应的改变,会在后续步骤中进行详细说明。
进一步的,本步骤计算的各字符串可以是基于最简单的方式提取得到的,即将PE文件中的所有内容转换为字符串得到,但此种方式会包含大量无意义字符串(对恶意软件识别无用),因此还可以根据组成PE文件的特定文件格式对组成PE文件的各部分进行有针对性的提取,以此减少获得的无意义字符串,减少后续利用TFIDF技术的计算量,后续实施例会提供一种具体的字符串提取方式,但不限于后续实施例所描述的一种,基于本思想下获得的略微区别于该实施例的方式也应在本申请的保护范围内。
S102:根据重要性评估参数对对应的字符串按重要程度进行筛选,得 到第一预设数量的高重要性字符串,并将各高重要性字符串作为恶意软件识别特征;
在S101的基础上,本步骤旨在根据计算得到的重要性评估参数的大小筛选出符合要求的字符串,基于上述计算方式,得到的重要性评估参数的大小与重要程度成正比,因此从所有字符串中筛选出拥有较大的重要性评估参数的高重要性字符串即可,最终将其作为恶意软件识别特征以便后续步骤使用。
其中,如何筛选出高重要性字符串的方式多种多样,例如,综合所有字符串的重要性评估参数,按阶段划分不同重要程度的档位,将处于最重要档位中的字符串选出即可;也可以设定某个阈值,将拥有高于该阈值的重要性评估参数的字符串筛选出;还可以根据所有重要性评估参数建立一个排序队列,排序原则为自上而下的按照从大到小的顺序,这样只需要从该队列中选取前N个重要性评估参数对应的字符串即可,等等方式,可以根据实际情况的不同,选择能够实现同样的目的且最合适的方式即可,此处并不做具体限定。
进一步的,为防止忽略掉一些出现次数较少但指标意义很强的字符串,还可以将恶意软件根据不同的种类进行分类,例如勒索软件、木马、蠕虫甚至更细的家族类型进行归类,然后针对每一类的文件集分别利用该TFIDF技术计算字符串的重要程度,并将在单类中拥有较大重要评估参数的字符串也作为恶意软件识别特征被保留下来,以降低遗漏的几率,提高恶意软件识别效果。
其中一种包括但不限于的实现方式如下:
按照恶意软件包含的不同类别对待测软件包含的PE文件进行同类归并,得到各类别归并后PE文件集;
利用TFIDF技术分别计算各字符串在各类别归并后PE文件集中对其中的每份PE文件的重要程度,得到归并后重要性参数;
将各归并后重要性参数按从大到小的顺序对应选取第二预设数量的高重要性字符串,并也作为恶意软件识别特征;其中,第二预设数量<第一预设数量(因为更细分类下的文件集中包含的文件数小于未分类的原文件集中包含的文件数)。
在筛去低重要性字符串的基础上,为确定筛选出的高重要性字符串间的关系,还可以对同类型高重要性字符串进行归类,例如采用主题分析的方式实现归类,能够将拥有同一主题的同类高重要性字符串进行归并,例如可以将足球、篮球、曼联等词语均归于“体育”这一主题下,并最终将其转换一个为多维向量来表示各主题内容的分布,以方便确定所属分类。当然,还可以采用其它诸如聚类、K-means等的算法实现同样效果。
S103:利用各恶意软件识别特征对待测软件进行恶意软件识别;
在S102的基础上,本步骤旨在利用得到的各恶意软件识别特征实现对待测软件的恶意软件识别。具体的,可以采用基于各恶意软件识别特征构建的分类器实现,该分类器可以基于线性或非线性分类算法搭建,包括但不限于逻辑回归、支持向量机、决策树等等。
基于上述技术方案,本申请实施例提供的一种基于字符串的恶意软件识别方法,利用TFIDF技术对从待测软件包含的PE文件中提取出的各字符串进行重要程度统计,并从中选取出具有较高分类能力和识别度的高重要性字符串,再借助该高重要性字符串作为恶意软件识别特征完成恶意软件的识别,该识别方法无需人工凭借自身经验完成恶意内容的识别,自动化完成效率更高、恶意内容识别更准确、遗漏几率更低。
实施例二
以下结合图2,图2为本申请实施例所提供的基于字符串的恶意软件识别方法中一种从PE文件中提取字符串的流程图。
从PE文件中提取字符串的方式包括但不限于下述方式:
S201:按照组成PE文件的特定文件格式将每个PE文件分为头部元数据段、代码段以及数据段;
根据微软公布的PE文件组成格式,可以将其大体分为头部元数据段、代码段以及数据段,其中头部元数据段中的数据遵照较为规范的格式,其中包含的信息包括程序区段的描述信息(节的名字、大小、权限等);代码段则主要包含基于编程语言构建的功能实现内容,数据段则主要包含呈现在用户面前的可视内容数据(类似于信件的正文)。
因此,根据上述各部分中包括的内容和特点,可以有针对性的从各部 分中提取有用的字符串,去除无意义字符串。
S202:利用元数据解析器提取头部元数据段中的内容,得到第一字符串集;
由于头部元数据段中的数据遵照较为规范的格式,且十分重要,因此可以通过特定的解析器将其中的所有信息都提取出来,并将提取出的内容无论是否为数字等特殊格式均作为字符串统一处理。
S203:对代码段中的内容进行反编译,并从反编译后数据中提取包含预设函数名的字符串,得到第二字符串集;
S204:从数据段所包含的内容中提取包括IP地址、邮箱地址以及网址中至少一项在内的字符串,得到第三字符串集;
相对的,除头部元数据段中的内容,代码段和数据段中包含了大量的无意义字符串,如果全部提取出来只会干扰后续的步骤,且大大增加的计算时间,因此对于这两部分数据只需要从中提取指定区域或指定类型的字符串即可。
对于代码段,在进行反编译后只从中提取函数名等特殊区域的字符串;对于数据段,提取的关键内容主要为包含IP地址、邮箱地址以及网址等特殊格式的字符串。
其中一种数据的提取方式为使用正则表达式进行匹配,当然还可以利用其它相同或相似方式实现同样目的的信息提取,此处不再赘述。
S205:将从每个PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以减少无效字符串的数量。
基于上述提取步骤,本步骤旨在将从每个PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以在后续TFIDF的计算过程中,使用PE文档替换原PE文件,可能极大的减少无意义字符串的数量。由于该PE文档中相同的字符串只会出现一次,因此基于此种方式下使用TFIDF技术时,根据该字符串有无出现在该待测软件中相应的只能为1或0。
进一步的,在从各部分中完成字符串的提取操作后,还需要统计包含某个特定字符串的PE文件的数目,以map形式(map是用来存放一组一一 映射关系的,在map中,一个key只能对应一个value,此处key表示某个特定字符串,value则表示包含有key的PE文件数目)返回最终的统计结果。
本实施例在实施例一的基础上,提供一种从PE文件中有针对性提取有意义字符串的方法,由于去除了一些冗余的、无用的字符串,可有效减少后续处理步骤的工作量。
实施例三
以下结合图3,图3为本申请实施例所提供的另一种基于字符串的恶意软件识别方法的流程图。
本步骤旨在针对实施例一提供一种具体的实现方式,具体包括以下步骤:
S301:用于利用
Figure PCTCN2019087563-appb-000005
计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
其中,TF指某个特定字符串在某份特定PE文件中的出现次数、IDF为在待测软件中所有包含有字符串的PE文件个数、N为待测软件包含的PE文件总个数。
例如:一篇文件的总词语数是100个,而词语“原子能”出现了3次,那么“原子能”一词在该文件中的TF(词频)就是3/100=0.03,而一个计算IDF的方法是文件集里包含的文件总数除以测定有多少份文件出现过“原子能”一词。所以,如果“原子能”一词在1000份文件出现过,而文件总数是10000000份的话,其IDF就是lg(10000000/1000)=4,最后该词语“原子能”的TFIDF得分为0.03×4=0.12。
S302:将各重要性评估参数按照从大到小的顺序自上而下进行排列,得到重要性评估参数排列队列;
S303:从重要性评估参数排列队列中自上而下的选取第一预设数量的重要性评估参数,并将对应的字符串作为高重要性字符串,并将各高重要性字符串作为恶意软件识别特征;
本实施例采用建立队列的方式按照重要程度的不同筛选出高重要性字 符串,并将其作为恶意软件识别特征。其中,该第一预设数量可以根据实际情况灵活设定,例如将其设定为3或5,以从该队列中选择重要程度排名前3或前5的字符串。
S304:利用主题分析技术对各恶意软件识别特征进行归类,得到主题归类后识别特征;
S305:根据每个PE文件中包含的主题归类后识别特征将对应的PE文件转化为主题分布向量;
本实施例在完成恶意软件识别特征的选取的基础上,还利用主题分析技术对各恶意软件识别特征进行归类,并最终得到主题分布向量,以将得到的同类恶意软件识别特征进行归类。
主题分析技术简单的说,在给定大量文档的情况下,即便没有任何人工的输入,利用主题分析技术也可以自动的将这些文档中包含的单词按照主题进行归类,在给定参数K的情况下,主题分析技术可以自动提取出K各主题以及这一主题相关的单词。比如“体育”这一主题相关的单词就可能会有“足球”、“篮球”以及“曼联”和“NBA”等单词。在这一结果的基础上,利用主题分析技术就可以将一个文档转换为一个K维的向量来表示其中各个主题的分布,比如“1号文档”中的30%是关于体育的、40%时关于娱乐的以及30%是关于欧洲的。
S306:利用线性分类算法对每个主题分布向量进行分类。
在上述字符串的提取、筛选以及归类完成后,每个PE文件都会被转换为一个特征向量,因此可以直接使用传统的线性分类算法进行分类,具体方法包括但不限于逻辑回归、支持向量机以及决策树等等。
基于上述技术方案,本申请实施例提供的一种基于字符串的恶意软件识别方法,首先基于PE文件的特定格式,对各部分进行有针对性的信息提取,极大的减少了无意义字符串的数量,再利用TFIDF技术对提取出的各字符串进行重要程度统计,并从中选取出具有较高分类能力和识别度的高重要性字符串,并结合主题分析技术对同一类别的高重要性字符串进行归类,最终基于转换得到的特征向量完成恶意软件的识别,该识别方法无需人工凭借自身经验完成恶意内容的识别,自动化完成效率更高、恶意内容识别更准确、遗漏几率更低。
需要说明的是,本实施例中的S301是针对S101给出的一种包括但不限于的具体实现方式,S302和S303是针对S102给出的一种包含但不限于的具体实现方式,S306是针对S103给出的一种具体使用线性分类器对经过S304和S305转换处理后得到的主题分布向量进行分类实现恶意软件识别目的的实现方式,这三部分均可以单独基于独立权利要求一对应的实施例一形成相应的具体实施例,也可以根据实际情景中所有可能存在的特殊要求将这三部分与实施例一中以增加方式给出的一些实现方案和实施例二给出的从PE文件中提取有意义字符串的方案进行灵活组合,以得到不同的具体实施例,本实施例仅作为同时采用这三种具体实现方案,并按照执行顺序依次排列的一个优选实施例存在。
因为情况复杂,无法一一列举进行阐述,本领域技术人员应能意识到根据本申请提供的基本方法原理结合实际情况可以存在很多的例子,在不付出足够的创造性劳动下,应均在本申请的保护范围内。
下面请参见图4,图4为本申请实施例所提供的一种基于字符串的恶意软件识别系统的结构框图。
该恶意软件识别系统可以包括:
TFIDF计算单元100,用于利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
恶意软件识别特征选取单元200,用于根据重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,并将各高重要性字符串作为恶意软件识别特征;其中,重要性评估参数的大小与字符串的重要程度成正比;
恶意软件识别单元300,用于利用各恶意软件识别特征对待测软件进行恶意软件识别。
其中,TFIDF计算单元100包括:
公式计算子单元,用于利用
Figure PCTCN2019087563-appb-000006
计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参 数;其中,TF为字符串在每份PE文件中的出现次数、IDF为在待测软件中所有包含有字符串的PE文件个数、N为待测软件包含的PE文件总个数。
其中,该系统还包括:
顺序排列单元,用于将各重要性评估参数按照从大到小的顺序自上而下进行排列,得到重要性评估参数排列队列;
第一识别特征选取子单元,用于从重要性评估参数排列队列中自上而下的选取第一预设数量的重要性评估参数,并将对应的字符串作为高重要性字符串。
其中,该系统还包括:
同类归并子单元,用于按照恶意软件包含的不同类别对待测软件包含的PE文件进行同类归并,得到各类别归并后PE文件集;
归并后参数计算子单元,用于利用TFIDF技术分别计算各字符串在各类别归并后PE文件集中对其中的每份PE文件的重要程度,得到归并后重要性参数;
第二识别特征选取子单元,用于将各归并后重要性参数按从大到小的顺序对应选取第二预设数量的高重要性字符串,并也作为恶意软件识别特征;其中,第二预设数量<第一预设数量。
其中,该系统还包括:
PE文件内容拆分单元,用于按照组成PE文件的特定文件格式将每个PE文件分为头部元数据段、代码段以及数据段;
头部内容提取单元,用于利用元数据解析器提取头部元数据段中的内容,得到第一字符串集;
代码段内容提取单元,用于对代码段中的内容进行反编译,并从反编译后数据中提取包含预设函数名的字符串,得到第二字符串集;
数据段内容提取单元,用于从数据段所包含的内容中提取包括IP地址、邮箱地址以及网址中至少一项在内的字符串,得到第三字符串集;
PE文档生成单元,用于将从每个PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以减少无效字符串的数量。
其中,该系统还包括:
主题分析归类单元,用于利用主题分析技术对各恶意软件识别特征进行归类,得到主题归类后识别特征;
主题分布向量生成单元,用于根据每个PE文件中包含的主题归类后识别特征将对应的PE文件转化为主题分布向量。
其中,恶意软件识别单元300包括:
线性分类算法分类子单元,用于利用线性分类算法对每个主题分布向量进行分类。
其中,该代码段内容提取单元可以包括:
正则表达式匹配提取子单元,用于利用正则表达式从反编译后数据中提取得到包含预设函数名的字符串。
进一步的,该系统还可以包括:
统计结果生成单元,用于将每个字符串与对应包含字符串的PE文件数量以map形式生成统计结果。
基于上述实施例,本申请还提供了一种基于字符串的恶意软件识别装置,该恶意软件识别装置可以包括存储器和处理器,其中,该存储器中存有计算机程序,该处理器调用该存储器中的计算机程序时,可以实现上述实施例所提供的步骤。当然,该恶意软件识别装置还可以包括各种必要的网络接口、电源以及其它零部件等。
本申请还提供了一种计算机可读存储介质,其上存有计算机程序,该计算机程序被执行终端或处理器执行时可以实现上述实施例所提供的步骤。该存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (20)

  1. 一种基于字符串的恶意软件识别方法,其特征在于,包括:
    利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
    根据所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,并将各所述高重要性字符串作为恶意软件识别特征;其中,所述重要性评估参数的大小与所述字符串的重要程度成正比;
    利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别。
  2. 根据权利要求1所述方法,其特征在于,利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数,包括:
    利用
    Figure PCTCN2019087563-appb-100001
    计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
    其中,所述TF为所述字符串在每份所述PE文件中的出现次数、所述IDF为在所述待测软件中所有包含有所述字符串的PE文件个数、所述N为所述待测软件包含的PE文件总个数。
  3. 根据权利要求1所述方法,其特征在于,利用所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,包括:
    将各所述重要性评估参数按照从大到小的顺序自上而下进行排列,得到重要性评估参数排列队列;
    从所述重要性评估参数排列队列中自上而下的选取所述第一预设数量的重要性评估参数,并将对应的字符串作为所述高重要性字符串。
  4. 根据权利要求1所述方法,其特征在于,还包括:
    按照恶意软件包含的不同类别对所述待测软件包含的PE文件进行同类归并,得到各类别归并后PE文件集;
    利用所述TFIDF技术分别计算各所述字符串在所述各类别归并后PE 文件集中对其中的每份PE文件的重要程度,得到归并后重要性参数;
    将各所述归并后重要性参数按从大到小的顺序对应选取第二预设数量的高重要性字符串,并也作为所述恶意软件识别特征;其中,所述第二预设数量<第一预设数量。
  5. 根据权利要求1至4任一项所述方法,其特征在于,在利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度之前,还包括:
    按照组成所述PE文件的特定文件格式将每个所述PE文件分为头部元数据段、代码段以及数据段;
    利用元数据解析器提取所述头部元数据段中的内容,得到第一字符串集;
    对所述代码段中的内容进行反编译,并从反编译后数据中提取包含预设函数名的字符串,得到第二字符串集;
    从所述数据段所包含的内容中提取包括IP地址、邮箱地址以及网址中至少一项在内的字符串,得到第三字符串集;
    将从每个所述PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以减少无效字符串的数量。
  6. 根据权利要求1所述方法,其特征在于,在利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别之前,还包括:
    利用主题分析技术对各所述恶意软件识别特征进行归类,得到主题归类后识别特征;
    根据每个所述PE文件中包含的主题归类后识别特征将对应的PE文件转化为主题分布向量。
  7. 根据权利要求6所述方法,其特征在于,利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别,包括:
    利用线性分类算法对每个所述主题分布向量进行分类。
  8. 根据权利要求5所述方法,其特征在于,从反编译后数据中提取包含预设函数名的字符串,包括:
    利用正则表达式从所述反编译后数据中提取得到包含所述预设函数名 的字符串。
  9. 根据权利要求1所述方法,其特征在于,还包括:
    将每个字符串与对应包含所述字符串的PE文件数量以map形式生成统计结果。
  10. 一种基于字符串的恶意软件识别系统,其特征在于,包括:
    TFIDF计算单元,用于利用TFIDF技术计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;
    恶意软件识别特征选取单元,用于根据所述重要性评估参数对对应的字符串按重要程度进行筛选,得到第一预设数量的高重要性字符串,并将各所述高重要性字符串作为恶意软件识别特征;其中,所述重要性评估参数的大小与所述字符串的重要程度成正比;
    恶意软件识别单元,用于利用各所述恶意软件识别特征对所述待测软件进行恶意软件识别。
  11. 根据权利要求10所述系统,其特征在于,所述TFIDF计算单元包括:
    公式计算子单元,用于利用
    Figure PCTCN2019087563-appb-100002
    计算从待测软件包含的PE文件中提取出的各字符串对每份PE文件的重要程度,得到重要性评估参数;其中,所述TF为所述字符串在每份所述PE文件中的出现次数、所述IDF为在所述待测软件中所有包含有所述字符串的PE文件个数、所述N为所述待测软件包含的PE文件总个数。
  12. 根据权利要求10所述系统,其特征在于,还包括:
    顺序排列单元,用于将各所述重要性评估参数按照从大到小的顺序自上而下进行排列,得到重要性评估参数排列队列;
    第一识别特征选取子单元,用于从所述重要性评估参数排列队列中自上而下的选取所述第一预设数量的重要性评估参数,并将对应的字符串作为所述高重要性字符串。
  13. 根据权利要求10所述系统,其特征在于,还包括:
    同类归并子单元,用于按照恶意软件包含的不同类别对所述待测软件包含的PE文件进行同类归并,得到各类别归并后PE文件集;
    归并后参数计算子单元,用于利用所述TFIDF技术分别计算各所述字符串在所述各类别归并后PE文件集中对其中的每份PE文件的重要程度,得到归并后重要性参数;
    第二识别特征选取子单元,用于将各所述归并后重要性参数按从大到小的顺序对应选取第二预设数量的高重要性字符串,并也作为所述恶意软件识别特征;其中,所述第二预设数量<第一预设数量。
  14. 根据权利要求10至13任一项所述系统,其特征在于,还包括:
    PE文件内容拆分单元,用于按照组成所述PE文件的特定文件格式将每个所述PE文件分为头部元数据段、代码段以及数据段;
    头部内容提取单元,用于利用元数据解析器提取所述头部元数据段中的内容,得到第一字符串集;
    代码段内容提取单元,用于对所述代码段中的内容进行反编译,并从反编译后数据中提取包含预设函数名的字符串,得到第二字符串集;
    数据段内容提取单元,用于从所述数据段所包含的内容中提取包括IP地址、邮箱地址以及网址中至少一项在内的字符串,得到第三字符串集;
    PE文档生成单元,用于将从每个所述PE文件中提取得到的第一字符串集、第二字符串集以及第三字符串集按照预设组合格式生成对应的PE文档,以减少无效字符串的数量。
  15. 根据权利要求10所述系统,其特征在于,还包括:
    主题分析归类单元,用于利用主题分析技术对各所述恶意软件识别特征进行归类,得到主题归类后识别特征;
    主题分布向量生成单元,用于根据每个所述PE文件中包含的主题归类后识别特征将对应的PE文件转化为主题分布向量。
  16. 根据权利要求15所述系统,其特征在于,所述恶意软件识别单元包括:
    线性分类算法分类子单元,用于利用线性分类算法对每个所述主题分布向量进行分类。
  17. 根据权利要求14所述系统,其特征在于,所述代码段内容提取单 元包括:
    正则表达式匹配提取子单元,用于利用正则表达式从所述反编译后数据中提取得到包含所述预设函数名的字符串。
  18. 根据权利要求10所述系统,其特征在于,还包括:
    统计结果生成单元,用于将每个字符串与对应包含所述字符串的PE文件数量以map形式生成统计结果。
  19. 一种基于字符串的恶意软件识别装置,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至9任一项所述的恶意软件识别方法的步骤。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述的恶意软件识别方法的步骤。
PCT/CN2019/087563 2018-06-20 2019-05-20 一种基于字符串的恶意软件识别方法、系统及相关装置 WO2019242443A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810639502.7A CN110619212B (zh) 2018-06-20 2018-06-20 一种基于字符串的恶意软件识别方法、系统及相关装置
CN201810639502.7 2018-06-20

Publications (1)

Publication Number Publication Date
WO2019242443A1 true WO2019242443A1 (zh) 2019-12-26

Family

ID=68921020

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087563 WO2019242443A1 (zh) 2018-06-20 2019-05-20 一种基于字符串的恶意软件识别方法、系统及相关装置

Country Status (2)

Country Link
CN (1) CN110619212B (zh)
WO (1) WO2019242443A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115189922A (zh) * 2022-06-17 2022-10-14 阿里云计算有限公司 风险识别方法及装置和电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324344A (zh) * 2020-02-28 2020-06-23 深圳前海微众银行股份有限公司 代码语句的生成方法、装置、设备及可读存储介质
CN116089912A (zh) * 2022-12-30 2023-05-09 成都鲁易科技有限公司 软件识别信息获取方法及装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473506A (zh) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 用于识别恶意apk文件的方法和装置
CN105740707A (zh) * 2016-01-20 2016-07-06 北京京东尚科信息技术有限公司 恶意文件的识别方法和装置
US20160267198A1 (en) * 2015-03-11 2016-09-15 Sap Se Importing data to a semantic graph
CN105956469A (zh) * 2016-04-27 2016-09-21 百度在线网络技术(北京)有限公司 文件安全性识别方法和装置
CN107315955A (zh) * 2016-04-27 2017-11-03 百度在线网络技术(北京)有限公司 文件安全性识别方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473506A (zh) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 用于识别恶意apk文件的方法和装置
US20160267198A1 (en) * 2015-03-11 2016-09-15 Sap Se Importing data to a semantic graph
CN105740707A (zh) * 2016-01-20 2016-07-06 北京京东尚科信息技术有限公司 恶意文件的识别方法和装置
CN105956469A (zh) * 2016-04-27 2016-09-21 百度在线网络技术(北京)有限公司 文件安全性识别方法和装置
CN107315955A (zh) * 2016-04-27 2017-11-03 百度在线网络技术(北京)有限公司 文件安全性识别方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, SHUOYING ET AL.: "Weibo Topic Detection Based on Improved TF-IDF Algorithm", SCIENCE & TECHNOLOGY REVIEW, vol. 34, no. 2, 28 January 2016 (2016-01-28), pages 282 - 286, XP055664837 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115189922A (zh) * 2022-06-17 2022-10-14 阿里云计算有限公司 风险识别方法及装置和电子设备
CN115189922B (zh) * 2022-06-17 2024-04-09 阿里云计算有限公司 风险识别方法及装置和电子设备

Also Published As

Publication number Publication date
CN110619212B (zh) 2022-01-18
CN110619212A (zh) 2019-12-27

Similar Documents

Publication Publication Date Title
US11544459B2 (en) Method and apparatus for determining feature words and server
CN111767716B (zh) 企业多级行业信息的确定方法、装置及计算机设备
WO2020140373A1 (zh) 一种意图识别方法、识别设备及计算机可读存储介质
CN109783787A (zh) 一种结构化文档的生成方法、装置及存储介质
WO2019242443A1 (zh) 一种基于字符串的恶意软件识别方法、系统及相关装置
WO2019041520A1 (zh) 基于社交数据的金融产品推荐方法、电子装置及介质
CN110472043B (zh) 一种针对评论文本的聚类方法及装置
WO2022160454A1 (zh) 医疗文献的检索方法、装置、电子设备及存储介质
KR101505546B1 (ko) 텍스트 마이닝을 이용한 키워드 도출 방법
CN110362601A (zh) 元数据标准的映射方法、装置、设备及存储介质
WO2022121163A1 (zh) 用户行为倾向识别方法、装置、设备及存储介质
US20150286706A1 (en) Forensic system, forensic method, and forensic program
CN105653553B (zh) 词权重生成方法和装置
US8572082B2 (en) Method and device for generating a similar meaning term list and search method and device using the similar meaning term list
Abuaiadah et al. On the impact of dataset characteristics on arabic document classification
WO2021012958A1 (zh) 原创文本甄别方法、装置、设备与计算机可读存储介质
CN109753646B (zh) 一种文章属性识别方法以及电子设备
CN117744652A (zh) 一种基于大语言模型的领域特征词挖掘方法和装置
CN103092838B (zh) 一种获取英文词的方法及装置
WO2023125336A1 (en) Methods and devices for generating sensitive text detectors
CN110222179B (zh) 一种通讯录文本分类方法、装置及电子设备
CN110851560B (zh) 信息检索方法、装置及设备
CN108763258B (zh) 文档主题参数提取方法、产品推荐方法、设备及存储介质
JP2011090463A (ja) 文書検索システム、情報処理装置およびプログラム
CN114925373A (zh) 基于用户评语的移动应用隐私保护政策漏洞自动识别的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19822826

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19822826

Country of ref document: EP

Kind code of ref document: A1