CN114254317B

CN114254317B - Software processing method and device based on software genes and storage medium

Info

Publication number: CN114254317B
Application number: CN202111432157.8A
Authority: CN
Inventors: 刘旭; 章丽娟; 胡逸漪; 陈鹏; 李朝阳; 王禹翔; 张甜; 陈振兴
Original assignee: Shanghai Roarpanda Network Technology Co ltd
Current assignee: Shanghai Roarpanda Network Technology Co ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2023-06-16
Anticipated expiration: 2041-11-29
Also published as: CN114254317A

Abstract

The application discloses a software processing method and device based on a software gene and a storage medium. Wherein the method comprises the following steps: extracting sample software genes contained in sample software of a target software family; determining family software genes of the target software family according to the extracted sample software genes; removing universal hereditary software genes from family software genes of the target software family to obtain unique hereditary software genes of the target software family, wherein the universal hereditary software genes are software genes contained by the target software family and other sample software together, the unique hereditary software genes are software genes unique to the target software family, and the unique hereditary software genes are used for indicating family gene characteristics of the target software family; and determining a key software gene for identifying the target software family from the unique inheritance software genes of the target software family.

Description

Software processing method and device based on software genes and storage medium

Technical Field

The present application relates to the field of software engineering and information security technologies, and in particular, to a software processing method and apparatus based on a software gene, and a storage medium.

Background

With the rapid development of internet technology, various network security problems are layered endlessly. In particular, various organizations performing long-term network attack activities based on politics or economic interests appear, and the malware developed by such organizations continuously improves variation, and forms unique genetic characteristics of software codes, so that different malware families (such as APT family, luxo family, industrial control malware family, etc.) are formed, and the attack behaviors of the malware families bring huge economic losses to individuals, enterprises, and even countries. Therefore, how to quickly and accurately identify the malicious software and family information thereof has extremely important significance for guaranteeing the safety of people and property, constructing network safety and national safety. In addition, the above-mentioned problems can occur with non-malware families that exist for piracy and infringement.

Taking a malicious software family as an example, in a traditional analysis method of the malicious software family, the analysis method based on the software gene tag library comprises the following steps: acquiring software to be analyzed; executing fragmentation operation on the code of the software to be analyzed to obtain a software genome of the software to be analyzed; performing normalization operation on each software gene in the software genome to obtain a target software genome; and determining preset software to which each software gene in the target software genome belongs based on the software gene library, and determining a software family to which the software to be analyzed belongs.

However, the software-based gene tag library analysis method has the following disadvantages: a large number of malware family samples are required to be analyzed in advance to construct corresponding relation data from each gene to the software family samples; for new malicious software genes, family attribution information cannot be obtained because the tag library has no corresponding data; the method has the advantages that a mass label library is difficult to construct, iteration is not easy, and the database is huge and is not convenient for embedding products.

Aiming at the technical problems of large workload, low recognition rate and large data volume, which are difficult to operate when analyzing software and family information thereof exist in the prior art, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the application provides a software processing method, a device and a storage medium based on a software gene, which at least solve the technical problems of large workload, low recognition rate and large data volume, which are difficult to operate when analyzing software and family information thereof in the prior art.

According to an aspect of the embodiments of the present application, there is provided a software processing method based on a software gene, including: extracting sample software genes contained in sample software of a target software family; determining family software genes of the target software family according to the extracted sample software genes, wherein the family software genes are minimum indivisible and consistently executed binary code fragments contained in the sample software; removing universal hereditary software genes from family software genes of the target software family to obtain unique hereditary software genes of the target software family, wherein the universal hereditary software genes are software genes contained by the target software family and other sample software together, the unique hereditary software genes are software genes unique to the target software family, and the unique hereditary software genes are used for indicating family gene characteristics of the target software family; and determining a key software gene for identifying the target software family from the unique inheritance software genes of the target software family.

According to another aspect of the embodiments of the present application, there is also provided a software processing method based on a software gene, including: acquiring software to be identified; extracting a software gene of the software to be identified; and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family, wherein the key software genes are unique hereditary software genes for identifying the target software family.

According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is run.

According to another aspect of the embodiments of the present application, there is also provided a software processing apparatus based on a software gene, including: a first extraction module for extracting a sample software gene contained in sample software of a target software family; a first determining module, configured to determine a family software gene of the target software family according to the extracted sample software gene, where the family software gene is a minimum indivisible, consistently executed binary code segment included in the sample software; a second determining module, configured to remove a generic genetic software gene from family software genes of the target software family to obtain a unique genetic software gene of the target software family, where the generic genetic software gene is a software gene that the target software family includes together with other sample software, the unique genetic software gene is a software gene that is unique to the target software family, and the unique genetic software gene is used to indicate a family gene characteristic of the target software family; and a third determination module for determining key software genes for identifying the target software family from among the unique inheritance software genes of the target software family.

According to another aspect of the embodiments of the present application, there is also provided a software processing apparatus based on a software gene, including: the first acquisition module is used for acquiring the software to be identified; the second extraction module is used for extracting the software genes of the software to be identified; and a fourth determining module for comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family, wherein the key software genes are unique hereditary software genes for identifying the target software family.

According to another aspect of the embodiments of the present application, there is also provided a software processing apparatus based on a software gene, including: a first processor; and a first memory, coupled to the first processor, for providing instructions to the first processor to process the steps of: extracting sample software genes contained in sample software of a target software family; determining family software genes of the target software family according to the extracted sample software genes, wherein the family software genes are minimum indivisible and consistently executed binary code fragments contained in the sample software; removing universal hereditary software genes from family software genes of the target software family to obtain unique hereditary software genes of the target software family, wherein the universal hereditary software genes are software genes contained by the target software family and other sample software together, the unique hereditary software genes are software genes unique to the target software family, and the unique hereditary software genes are used for indicating family gene characteristics of the target software family; and determining a key software gene for identifying the target software family from the unique inheritance software genes of the target software family.

According to another aspect of the embodiments of the present application, there is also provided a software processing apparatus based on a software gene, including: a second processor; and a second memory, coupled to the second processor, for providing instructions to the second processor to process the steps of: acquiring software to be identified; extracting a software gene of the software to be identified; and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family, wherein the key software genes are unique hereditary software genes for identifying the target software family.

In embodiments of the present application, the computing device identifies family gene features of the target software family by extracting key software genes of the target software family. Because the software genes have the characteristic of unifying materiality and informativeness, the hereditary property of sample software of the software family can be represented, and the family attribution of the software identified by the software genes is more reasonable and the interpretation is better. In addition, as different software families have different key software genes, the family attribute of unknown software is identified more accurately by using the key software genes, and misjudgment is not easy to occur. Because the key software genes can uniquely identify the corresponding target software families, compared with a software gene tag library analysis method, the method does not need to construct corresponding relation data from each gene to each software family sample and does not need to construct a massive tag library. Therefore, compared with the prior art, the technical scheme of the embodiment of the application does not need to build the running environment of each piece of software, does not need to carry out complex preprocessing operation on each piece of sample software, and does not need to carry out professional manual reverse analysis on the sample software. And further solves the technical problems of large workload, low recognition rate and large data volume, which are difficult to operate, in the analysis software and family information thereof in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a block diagram of the hardware architecture of a computing device for implementing the method according to embodiment 1 of the present application;

FIG. 2 is a flow chart of a software processing method based on a software gene according to the first aspect of embodiment 1 of the present application;

FIG. 3 is a flow chart of a software processing method based on a software gene according to a second aspect of embodiment 1 of the present application;

FIG. 4 is a schematic diagram of a software processing apparatus based on a software gene according to the first aspect of embodiment 2 of the present application;

FIG. 5 is a schematic diagram of a software processing apparatus based on a software gene according to a second aspect of embodiment 2 of the present application;

FIG. 6 is a schematic diagram of a software processing apparatus based on a software gene according to the first aspect of embodiment 3 of the present application; and

fig. 7 is a schematic diagram of a software processing apparatus based on a software gene according to the second aspect of embodiment 3 of the present application.

Detailed Description

In order to better understand the technical solutions of the present application, the following descriptions of the technical solutions of the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. It will be apparent that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in describing embodiments of the present disclosure are applicable to the following explanation:

general genetics: this refers to the situation where the codes are similar due to the use of the same computer architecture, operating system, programming language, common code library, etc. by the developer, this inheritance is a common characteristic of software and cannot be used to distinguish between software attributes and family attribution.

Unique inheritance: the method refers to the situation that developers in the software family use the same attack mode, private code library, hacking tool, programming specification, development habit and the like to cause similar codes, and the inheritance is the unique characteristic of the software family and can be used for distinguishing the family attribution of the software.

Example 1

According to the present embodiment, there is provided a method embodiment of a software processing method based on a software gene, it being noted that the steps shown in the flowcharts of the drawings may be executed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be executed in an order different from that herein.

The method embodiments provided by the present embodiments may be performed in a mobile terminal, a computer terminal, a server, or similar computing device. FIG. 1 shows a block diagram of a hardware architecture of a computing device for implementing a software processing method based on software genes. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc., processing means), memory for storing data, and transmission means for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the software processing method based on software genes in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the software processing method based on software genes of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to the computing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of the computing device. In one example, the transmission means comprises a network adapter (Network Interface Controller, NIC) connectable to other network devices via the base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted herein that in some alternative embodiments, the computing device shown in FIG. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computing devices described above.

In the above-described operating environment, according to a first aspect of the present embodiment, there is provided a software processing method based on a software gene. Fig. 2 shows a schematic flow chart of the method, and referring to fig. 2, the method includes:

s202: extracting sample software genes contained in sample software of a target software family;

s204: determining family software genes of the target software family according to the extracted sample software genes, wherein the family software genes are minimum indivisible and consistently executed binary code fragments contained in the sample software;

s206: removing universal hereditary software genes from family software genes of the target software family to obtain unique hereditary software genes of the target software family, wherein the universal hereditary software genes are software genes contained by the target software family and other sample software together, the unique hereditary software genes are software genes unique to the target software family, and the unique hereditary software genes are used for indicating family gene characteristics of the target software family; and

S208: the key software genes for identifying the target software family are determined from the unique inheritance software genes of the target software family.

Specifically, the computing device first obtains sample software of a family of target software (e.g., malware family), after which the computing device may extract the contained sample software genes from each sample software. Wherein the sample software genes are divided into software genes of binary software and software genes of non-binary software. More specifically, the computing device cuts from the execution units of the assembly code of the binary sample (i.e., sample software) or the abstract syntax tree AST of the non-binary sample (i.e., sample software), resulting in the smallest indivisible, consistently executed binary code fragments, referred to as the software genes of the sample, so that the computing device can extract the contained sample software genes from each sample software (S202). The relationships between the sample software of the software family and the relationships between the software genes of the sample software can be expressed by the following formulas:

fs＝{s ₁ ,s ₂ ,…,s _m }

sg＝{g ₁ ,g ₂ ,…,g _n }

wherein fs represents a family sample software set, s ₁ ～s _m Representing a single family sample software; sg denotes the software genome of the single sample software, g ₁ ～g _n A single software gene in the software genome representing the single sample software.

Further, the computing device performs merging and deduplication processing on the same sample software genes in the sample software genes of the sample software, and uses the processed sample software genes as family software genes of the malware family (i.e. the target software family), so that the computing device can obtain the family software genes of the malware family. Wherein the family software genes are the smallest indivisible, consistently executing binary code fragments contained by the sample software. That is, the software gene (i.e., binary code segment) is the smallest functional segment that cannot be cut on; the code in the software gene is either all executed or not all executed (S204).

The relationship between family software genes can be expressed by the following formula:

fg＝{g ₁ ,g ₂ ,…,g _t }

where fg denotes the combined de-duplicated family software genome, g ₁ ～g _t Representing a single software gene in a family software genome.

Further, family software genes of a malware family (i.e., a target software family) include generic genetic software genes and unique genetic software genes. Wherein the generic genetic software genes are software genes that are commonly contained by a malware family (i.e., the target software family) with sample software of other software families or other non-family sample software, i.e., similar code generated using the same computer architecture, operating system, programming language, common code library, etc. The generic genetic software genes appear not only in the sample software of the malware family (i.e., the target software family) but also in other sample software, and thus the generic genetic software genes cannot embody the unique software gene features of the malware family (i.e., the target software family). Unique inheritance software genes are software genes with unique characteristics generated by using the same attack mode, private code library, hacker tool, programming specification, development habit and the like between a malicious software family (i.e. target software family) and sample software of other software families and other non-family sample software, namely developers inside the malicious software family (i.e. target software family). The unique inheritance software genes of the malware family (i.e., the target software family) are only present in the sample software of the malware family (i.e., the target software family), such that the unique inheritance software genes are used to indicate family gene characteristics of the target software family.

The computing device screens family software genes of the malware family (i.e., the target software family) through a preset software gene library (e.g., a global software gene library) to obtain universal hereditary software genes and unique hereditary software genes of the malware family. The computing device then removes the generic inheritance software genes, resulting in unique inheritance software genes for the malware family (i.e., the target software family) (S206). The relationship between unique genetic software genes can be expressed by the following formula:

fg′＝{g ₁ ,g ₂ ,…,g _t′ }

wherein fg' represents the unique genetic software genome remaining after screening, g ₁ ～g _t′ Representing a single unique genetic software gene.

Further, the computing device obtains the number of sample software that each unique inheritance software gene of the malware family (i.e., the target software family) covers the family, selects the unique inheritance software genes according to the number of coverage, and uses the unique inheritance software genes as key software genes. Wherein the critical software gene is an identifying software gene of the malware family (i.e., the target software family) (S208). It should be noted that "overlay" in this embodiment means that the sample software contains unique genetic software genes. For example, both sample software 1 and sample software 2 contain unique genetic software gene 1, i.e., unique genetic software gene 1 covers sample software 1 and sample software 2. The formula is as follows:

gs＝{s ₁ ,s ₂ ,…,s _n }

Where gs denotes the total sample software covered by the predetermined software gene, s ₁ ～s _n A single sample of software representing the software gene overlay. And the formulas of the key software genes of the target software family are as follows:

fkg＝{kg ₁ ,kg ₂ ,…,kg _m }

fkg the key software genome of the software family, kg ₁ ～kg _m A key software gene representing the software family.

As described in the background, the software-based gene tag library analysis method has the following disadvantages: a large number of malware family samples are required to be analyzed in advance to construct corresponding relation data from each gene to the software family samples; for new malicious software genes, family attribution information cannot be obtained because the tag library has no corresponding data; the method has the advantages that a mass label library is difficult to construct, iteration is not easy, and the database is huge and is not convenient for embedding products.

According to the technical scheme, the computing equipment extracts key software genes of the target software family to identify family gene characteristics of the target software family. Because the software genes have the characteristic of unifying materiality and informativeness, the hereditary property of sample software of the software family can be represented, and the family attribution of the software identified by the software genes is more reasonable and the interpretation is better. In addition, as different software families have different key software genes, the family attribute of unknown software is identified more accurately by using the key software genes, and misjudgment is not easy to occur. Because the key software genes can uniquely identify the corresponding target software families, compared with a software gene tag library analysis method, the method does not need to construct corresponding relation data from each gene to each software family sample and does not need to construct a massive tag library. Therefore, compared with the prior art, the technical scheme of the embodiment of the application does not need to build the running environment of each piece of software, does not need to carry out complex preprocessing operation on each piece of sample software, and does not need to carry out professional manual reverse analysis on the sample software. And further solves the technical problems of large workload, low recognition rate and large data volume, which are difficult to operate, in the analysis software and family information thereof in the prior art.

Optionally, determining the family software genes of the target software family from the extracted sample software genes comprises: the sample software genes of the sample software of the target software family are combined and subjected to deduplication processing to determine family software genes of the target software family.

Specifically, the computing device obtains all sample software fs in the malware family (i.e., the target software family) and uses a preset software gene extraction engine to extract each sample software (i.e., s ₁ ,s ₂ ,…,s _m ) Is extracted from the sample software gene. The computing device then sets all of the sample software genes (i.e., sample software s ₁ ,s ₂ ,…,s _m Is a software gene sg of (2) ₁ ＝{g ₁ ,g ₂ ,…,g _n }，sg ₂ ＝{g ₁ ,g ₃ ,…,g _n-1 }，…，sg _m ＝{g ₄ ,g ₅ ,…,g _n-2 }) to obtain the same sample software gene (e.g., sg) ₁ G of (3) ₁ And sg (g) ₂ G of (3) ₁ ). The computing device then sets the same sample software gene (i.e., sg ₁ G of (3) ₁ And sg (g) ₂ G of (3) ₁ ) Performs the merging and deduplication processing, and samples the software genes (i.e., g ₁ ) And other sample software genes sg ₁ ＝{g ₂ ,…,g _n }，sg ₂ ＝{g ₃ ,…,g _n-1 }，…，sg _m ＝{g ₄ ,g ₅ ,…,g _n-2 Together, family software genes (i.e., fg= { g) that act as the malware family (i.e., target software family) ₁ ,g ₂ ,…,g _t })。

Therefore, the repeated sample software genes are combined and subjected to deduplication processing, so that the sample software genes can be more simplified, the subsequent operation of the sample software genes is facilitated, and the working efficiency is improved.

Optionally, removing the universal inheritance software genes from the family software genes of the target software family to obtain unique inheritance software genes of the target software family, comprising: matching family software genes of the target software family with software genes of all software in a preset software gene library, and determining universal hereditary software genes of the target software family; and removing the determined universal hereditary software genes from the family software genes of the target software family to obtain unique hereditary software genes of the target software family.

Specifically, the computing device matches family software genes of the malware family (i.e., the target software family) with software genes of respective software in a preset software gene library (e.g., a global software gene library), respectively, to determine generic and unique inheritance software genes of the family software genes of the malware family (i.e., the target software family). The global software gene library stores all relevant information of all software genes and is used for inquiring information of types or families of the software genes. The computing device is then from malwareThe determined generic inheritance software genes are removed from the family software genes of the piece family (i.e., the target software family), resulting in unique inheritance software genes of the malware family (i.e., the target software family). For example, family software genes fg= { g for malware family (i.e., target software family) ₁ ,g ₂ ,…,g _t The general hereditary software gene in } is g ₅ ,g ₆ Then the calculation will be of the generic genetic software gene g ₅ ,g ₆ Removal is performed such that the unique inheritance software gene of the malware family (i.e., the target software family) is fg' = { g ₁ ,g ₂ ,g ₃ ,g ₄ ,g ₇ ,…,g _t′ }。

Therefore, the technical scheme removes the universal hereditary software genes which are the same as other software family samples or other non-family samples in the target software family, thereby obtaining unique hereditary software genes with family characteristics, facilitating the extraction of key software genes and accelerating the extraction speed.

Optionally, determining an operation for identifying a key software gene of the target software family from among unique inheritance software genes of the target software family comprises: determining the number of family sample software covered by the unique hereditary software genes respectively; and determining key software genes of the target software family from the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene.

Specifically, the computing device first determines the number of family sample software covered by all unique inheritance software genes of the malware family (i.e., the target software family). For example, sample software 1 and sample software 2 include unique genetic software gene 1, then unique genetic software gene 1 covers a family sample software number of 2 (i.e., sample software 1 and sample software 2). The computing device then sorts the unique genetic software genes in descending order according to the number of family sample software corresponding to each unique genetic software gene. For example, when the top 10 unique inheritance software genes in descending order exactly cover the entire sample software of the malware family (i.e., the target software family), the computing device will determine these 10 unique inheritance software genes as the critical software genes of the malware family (i.e., the target software family) to identify the software genes of the malware family (i.e., the target software family).

Therefore, the technical scheme can rapidly determine the key software genes according to the number of family sample software corresponding to each unique hereditary software gene, so that the process of determining the key software genes is more convenient and faster.

Optionally, determining the key software genes of the target software family from the unique inheritance software genes according to the number of family sample software corresponding to each unique inheritance software gene comprises: sequencing the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene, wherein the number of the sample software is used for reflecting the importance degree of the unique hereditary software genes; sequentially selecting the most important unique hereditary software genes according to the sequence until the selected unique hereditary software genes cover all sample software of the target software family; and determining all the selected unique inheritance software genes as key software genes of the target software family.

Specifically, the computing device ranks the unique inheritance software genes in descending order using a preset ranking algorithm according to the number of family sample software covered by each unique inheritance software gene of the malware family (i.e., the target software family). The computing device then selects unique inheritance software genes for the malware family (i.e., the target software family), e.g., 30 total sample software for the malware family (i.e., the target software family), e.g., s ₁ ,s ₂ ,…,s ₃₀ . Wherein the unique genetic software Gene g ₁ Overlay family sample software s ₁ ,s ₂ ,…,s ₁₅ Thus unique hereditary software Gene g ₁ The number of covered family sample software is 15, and the unique genetic software gene g ₂ Overlay family sample software s ₁₆ ,s ₁₇ ,…,s ₂₅ Thus unique hereditary software Gene g ₂ Number of covered family sample software10, unique genetic software Gene g ₃ Overlay family sample software s ₂₆ ,s ₂₇ ,…,s ₃₀ Thus unique hereditary software Gene g ₃ The number of covered family sample software is 5, … …, unique genetic software Gene g _t′ Overlay family sample software s ₁ Thus unique hereditary software Gene g _t′ The number of family sample software covered was 1. The computing device sequentially selects the unique hereditary software genes according to the sorting result, namely the unique hereditary software genes g with the largest number of family sample software after the sorting in descending order ₁ Beginning selection followed by selection of unique genetic software Gene g ₂ And unique genetic software Gene g ₃ Until the selected unique genetic software gene (i.e., unique genetic software gene g ₁ Unique genetic software Gene g ₂ And unique genetic software Gene g ₃ ) All sample software s that can cover the malware family (i.e., the target software family) ₁ ,s ₂ ,…,s ₃₀ . The computing device will then select the unique genetic software gene (i.e., unique genetic software gene g ₁ Unique genetic software Gene g ₂ And unique genetic software Gene g ₃ ) The critical software genes that are the malware family (i.e., the target software family) are identified.

According to the technical scheme, the number of the family sample software corresponding to the unique hereditary software genes is ordered in a descending order, and the unique hereditary software genes which can exactly cover all the sample software are selected, so that the key software genes can be obtained rapidly, and the key software genes are simplified.

Optionally, the method further comprises: acquiring software to be identified; extracting a software gene of the software to be identified; and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family.

Specifically, when the computing device needs to identify a piece of software, and determine whether the piece of software belongs to the malware family (i.e. the target software family), the piece of software (i.e. the piece of software to be identified) is first acquired, and then the computing device extracts the software genes of the piece of software to be identified, and acquires the key software genes of the malware family (i.e. the target software family). The computing device then compares the software genes of the software to be identified with the critical software genes of the malware family (i.e., the target software family), and determines that the software genes of the software to be identified belong to the malware family (i.e., the target software family) when the software genes of the software to be identified are the same as any one or more of the critical software genes. Otherwise, the software genes of the software to be identified do not belong to the malware family (i.e., the target software family).

Therefore, the technical scheme can effectively identify the attribution of the software to be identified by comparing the software genes of the software to be identified with the extracted key software of the software family, and different software families have different key software genes, so that the family attribute of the unknown sample is identified by using the key software genome more accurately, and misjudgment is not easy to occur.

In addition, although the present embodiment describes a process of software processing based on a software gene by taking a malware family as an example, the same applies to a non-malware family, so that by the method described in the present application, it is possible to determine whether or not the software belongs to a target software family by comparing a software gene of a specified software with a key software gene of the target software family, thereby helping to determine whether or not the software violates copyrights of the owners of the target software family, thereby effectively preventing infringement and software piracy. The specific method is not described here in detail.

Thus, according to a first aspect of the present embodiment, a computing device identifies family gene characteristics of a target software family by extracting key software genes of the target software family. Because the software genes have the characteristic of unifying materiality and informativeness, the hereditary property of sample software of the software family can be represented, and the family attribution of the software identified by the software genes is more reasonable and the interpretation is better. In addition, as different software families have different key software genes, the family attribute of unknown software is identified more accurately by using the key software genes, and misjudgment is not easy to occur. Because the key software genes can uniquely identify the corresponding target software families, compared with a software gene tag library analysis method, the method does not need to construct corresponding relation data from each gene to each software family sample and does not need to construct a massive tag library. Therefore, compared with the prior art, the technical scheme of the embodiment of the application does not need to build the running environment of each piece of software, does not need to carry out complex preprocessing operation on each piece of sample software, and does not need to carry out professional manual reverse analysis on the sample software. And further solves the technical problems of large workload, low recognition rate and large data volume, which are difficult to operate, in the analysis software and family information thereof in the prior art.

Further, according to a second aspect of the present embodiment, there is provided a software processing method based on a software gene. Fig. 3 shows a schematic flow chart of the method, and referring to fig. 3, the method includes:

s302: acquiring software to be identified;

s304: extracting a software gene of the software to be identified; and

s306: and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family, wherein the key software genes are unique hereditary software genes for identifying the target software family.

Specifically, when the computing device needs to identify a piece of software, and determine whether the piece of software belongs to the malware family (i.e. the target software family), the piece of software (i.e. the software to be identified) is first acquired, and then the computing device extracts the software genes of the piece of software to be identified, and acquires the key software genes of the malware family (i.e. the target software family). Wherein the key software genes are software genes for identifying the target software family. The computing device then compares the software genes of the software to be identified with the critical software genes of the malware family (i.e., the target software family), and determines that the software genes of the software to be identified belong to the malware family (i.e., the target software family) when the software genes of the software to be identified are the same as any one or more of the critical software genes. Otherwise, the software genes of the software to be identified do not belong to the malware family (i.e., the target software family).

In addition, the computing device may compare the software genes of the software to be identified with key software genes of a plurality of software families, thereby determining which software family the software to be identified belongs to and obtaining family information of the software to be identified.

According to the second aspect of the embodiment, the software genes of the software to be identified and the extracted key software of the software family are compared, so that the attribution of the software to be identified can be effectively identified, different software families have different key software genes, and the family attribute of an unknown sample is identified by using the key software genome more accurately, so that misjudgment is not easy to occur.

Further, referring to fig. 1, according to a third aspect of the present embodiment, there is provided a storage medium. The storage medium includes a stored program, wherein the method of any one of the above is performed by a processor when the program is run.

Thus, according to this embodiment, the computing device serves to identify family gene features of the target software family by extracting key software genes of the target software family. Because the software genes have the characteristic of unifying materiality and informativeness, the hereditary property of sample software of the software family can be represented, and the family attribution of the software identified by the software genes is more reasonable and the interpretation is better. In addition, as different software families have different key software genes, the family attribute of unknown software is identified more accurately by using the key software genes, and misjudgment is not easy to occur. Because the key software genes can uniquely identify the corresponding target software families, compared with a software gene tag library analysis method, the method does not need to construct corresponding relation data from each gene to each software family sample and does not need to construct a massive tag library. Therefore, compared with the prior art, the technical scheme of the embodiment of the application does not need to build the running environment of each piece of software, does not need to carry out complex preprocessing operation on each piece of sample software, and does not need to carry out professional manual reverse analysis on the sample software. And further solves the technical problems of large workload, low recognition rate and large data volume, which are difficult to operate, in the analysis software and family information thereof in the prior art.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

Example 2

Fig. 4 shows a software processing apparatus 400 based on a software gene according to the first aspect of the present embodiment, the apparatus 400 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a first extraction module 410, configured to extract a sample software gene included in sample software of a target software family; a first determining module 420, configured to determine a family software gene of the target software family according to the extracted sample software gene, where the family software gene is a minimum indivisible, consistently executed binary code segment included in the sample software; a second determining module 430, configured to remove a generic genetic software gene from family software genes of the target software family to obtain unique genetic software genes of the target software family, where the generic genetic software gene is a software gene that is included in the target software family together with other sample software, the unique genetic software gene is a software gene that is unique to the target software family, and the unique genetic software gene is used to indicate a family gene characteristic of the target software family; and a third determination module 440 for determining key software genes for identifying the target software family from among the unique inheritance software genes of the target software family.

Optionally, the first determining module 420 includes: and the first determination submodule is used for combining and de-duplicating sample software genes of each sample software of the target software family to determine family software genes of the target software family.

Optionally, the second determining module 430 includes: the second determining submodule is used for matching family software genes of the target software family with software genes of all software in a preset software gene library and determining general hereditary software genes of the target software family; and a third determination submodule for removing the determined universal hereditary software genes from the family software genes of the target software family to obtain unique hereditary software genes of the target software family.

Optionally, the third determining module 440 includes: a fourth determination submodule for determining the number of family sample software respectively covered by the unique genetic software genes; and a fifth determination submodule for determining key software genes of the target software family from the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene.

Optionally, the fifth determining submodule includes: the sequencing unit is used for sequencing the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene, wherein the number of the sample software is used for reflecting the importance degree of the unique hereditary software genes; a gene selection unit for sequentially selecting the most important unique hereditary software genes according to the sequence until the selected unique hereditary software genes cover all sample software of the target software family; and a first determining unit for determining all the selected unique inheritance software genes as key software genes of the target software family.

The apparatus 400 further comprises: the acquisition module is used for acquiring the software to be identified; the extraction module is used for extracting a software gene of the software to be identified; and the determining module is used for comparing the software genes of the software to be identified with the key software genes of the target software family and determining whether the software to be identified belongs to the target software family.

Further, fig. 5 shows a software processing apparatus 500 based on a software gene according to the second aspect of the present embodiment, the apparatus 500 corresponding to the method according to the second aspect of embodiment 1. Referring to fig. 5, the apparatus 500 includes: a first obtaining module 510, configured to obtain software to be identified; a second extraction module 520 for extracting a software gene of the software to be identified; and a fourth determining module 530, configured to compare the software genes of the software to be identified with the key software genes of the target software family, and determine whether the software to be identified belongs to the target software family, where the key software genes are unique inheritance software genes for identifying the target software family.

Example 3

Fig. 6 shows a software processing apparatus 600 based on a software gene according to the first aspect of the present embodiment, the apparatus 600 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 6, the apparatus 600 includes: a first processor 610; and a first memory 620 coupled to the first processor 610 for providing instructions to the first processor 610 for processing the following processing steps: extracting sample software genes contained in sample software of a target software family; determining family software genes of the target software family according to the extracted sample software genes, wherein the family software genes are minimum indivisible and consistently executed binary code fragments contained in the sample software; removing universal hereditary software genes from family software genes of the target software family to obtain unique hereditary software genes of the target software family, wherein the universal hereditary software genes are software genes contained by the target software family and other sample software together, the unique hereditary software genes are software genes unique to the target software family, and the unique hereditary software genes are used for indicating family gene characteristics of the target software family; and determining a key software gene for identifying the target software family from the unique inheritance software genes of the target software family.

Optionally, the memory 620 is further configured to provide instructions for the processor 610 to process the following processing steps: acquiring software to be identified; extracting a software gene of the software to be identified; and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family.

Further, fig. 7 shows a software processing apparatus 700 based on a software gene according to the second aspect of the present embodiment, the apparatus 700 corresponding to the method according to the second aspect of embodiment 1. Referring to fig. 7, the apparatus 700 includes: a second processor 710; and a second memory 720, coupled to the second processor 710, for providing instructions to the second processor 710 for processing the following processing steps: acquiring software to be identified; extracting a software gene of the software to be identified; and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family, wherein the key software genes are unique hereditary software genes for identifying the target software family.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A software processing method based on a software gene, comprising:

extracting sample software genes contained in sample software of a target software family;

determining family software genes of the target software family according to the extracted sample software genes, wherein the family software genes are minimum indivisible and consistently executed binary code fragments contained in the sample software;

removing universal hereditary software genes from family software genes of the target software family to obtain unique hereditary software genes of the target software family, wherein the universal hereditary software genes are software genes contained in the target software family together with other sample software, the unique hereditary software genes are software genes unique to the target software family, and the unique hereditary software genes are used for indicating family gene characteristics of the target software family;

Removing a generic genetic software gene from a family software gene of the target software family to obtain a unique genetic software gene of the target software family, comprising:

matching the family software genes of the target software family with software genes of all software in a preset software gene library, and determining universal hereditary software genes of the target software family; and

removing the determined universal hereditary software genes from the family software genes of the target software family to obtain unique hereditary software genes of the target software family; and

determining a key software gene for identifying the target software family from the unique inheritance software genes of the target software family;

determining an operation for identifying a key software gene of the target software family from among unique inheritance software genes of the target software family, comprising:

determining the number of family sample software respectively covered by the unique genetic software genes; and

determining key software genes of the target software family from the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene;

Determining the operation of the key software genes of the target software family from the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene, wherein the operation comprises the following steps:

sorting the unique hereditary software genes according to the number of family sample software corresponding to each unique hereditary software gene, wherein the number of the sample software is used for reflecting the importance degree of the unique hereditary software genes;

sequentially selecting the most important unique hereditary software genes according to the sequence until the selected unique hereditary software genes cover all sample software of the target software family; and

all unique genetic software genes selected are determined to be the key software genes of the target software family.

2. The method of claim 1, wherein determining the family software genes of the target software family from the extracted sample software genes comprises:

and combining and deduplicating sample software genes of each sample software of the target software family to determine family software genes of the target software family.

3. The method as recited in claim 1, further comprising:

Acquiring software to be identified;

extracting a software gene of the software to be identified;

and comparing the software genes of the software to be identified with the key software genes of the target software family to determine whether the software to be identified belongs to the target software family.

4. A software processing method based on a software gene, comprising:

acquiring software to be identified;

extracting a software gene of the software to be identified; and

comparing the software genes of the software to be identified with key software genes of a target software family to determine whether the software to be identified belongs to the target software family, wherein the key software genes are unique hereditary software genes for identifying the target software family;

extracting sample software genes contained in sample software of the target software family;

removing the determined universal hereditary software genes from the family software genes of the target software family to obtain unique hereditary software genes of the target software family;

and

5. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 4 is performed by a processor when the program is run.

6. A software processing apparatus based on a software gene, comprising:

a first extraction module for extracting a sample software gene contained in sample software of a target software family;

A first determining module for determining a family software gene of the target software family from the extracted sample software gene, wherein the family software gene is a smallest indivisible, consistently executed binary code segment contained in the sample software;

a second determining module, configured to remove a generic genetic software gene from family software genes of the target software family, to obtain a unique genetic software gene of the target software family, where the generic genetic software gene is a software gene that the target software family includes together with other sample software, and the unique genetic software gene is a software gene unique to the target software family, and the unique genetic software gene is used to indicate a family gene characteristic of the target software family; removing a generic genetic software gene from a family software gene of the target software family to obtain a unique genetic software gene of the target software family, comprising:

a third determination module for determining key software genes for identifying the target software family from among the unique inheritance software genes of the target software family;

7. A software processing apparatus based on a software gene, comprising:

the first acquisition module is used for acquiring the software to be identified;

the second extraction module is used for extracting the software genes of the software to be identified; and

a fourth determining module, configured to compare a software gene of the software to be identified with a key software gene of a target software family, and determine whether the software to be identified belongs to the target software family, where the key software gene is a unique genetic software gene for identifying the target software family; and

and