CN113656763B - Method and device for determining feature vector of applet and electronic equipment - Google Patents

Method and device for determining feature vector of applet and electronic equipment Download PDF

Info

Publication number
CN113656763B
CN113656763B CN202110926708.XA CN202110926708A CN113656763B CN 113656763 B CN113656763 B CN 113656763B CN 202110926708 A CN202110926708 A CN 202110926708A CN 113656763 B CN113656763 B CN 113656763B
Authority
CN
China
Prior art keywords
applet
feature
vector
strings
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110926708.XA
Other languages
Chinese (zh)
Other versions
CN113656763A (en
Inventor
郑黄成
欧阳瑜
李佳佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AlipayCom Co ltd
Original Assignee
AlipayCom Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AlipayCom Co ltd filed Critical AlipayCom Co ltd
Priority to CN202110926708.XA priority Critical patent/CN113656763B/en
Publication of CN113656763A publication Critical patent/CN113656763A/en
Application granted granted Critical
Publication of CN113656763B publication Critical patent/CN113656763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • G06F21/121Restricting unauthorised execution of programs
    • G06F21/125Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Abstract

The embodiment of the application provides a method, a device and electronic equipment for determining an applet feature vector, which can generate a vector which can be identified by a machine to accurately express the applet feature. The method for determining the feature vector of the applet comprises the following steps: extracting a plurality of character strings in sequence from program data of an applet, wherein the program data comprises program data of at least one of the following categories: the package file structure of the applet, the static code file of the applet and the dynamic operation data of the applet; generating a feature string sequence of the applet according to the plurality of feature strings; converting the feature string sequence of the applet into a feature string vector; the feature string vectors are input into a trained deep learning model to generate feature vectors for the applet.

Description

Method and device for determining feature vector of applet and electronic equipment
[ field of technology ]
The embodiment of the application relates to the technical field of applets, in particular to a method, a device and electronic equipment for determining feature vectors of applets.
[ background Art ]
The applet is an application which can be used without downloading and installing, and usually depends on a certain applet platform (other application software), after downloading and installing an application software capable of being used as the applet platform, a user can enter the applet through an applet entry (such as an applet icon and an option of an applet search result) provided in the application software, and use functions provided by the applet.
[ invention ]
The embodiment of the application provides a method, a device and electronic equipment for determining an applet feature vector, so as to generate a vector which can be identified by a machine to accurately express the applet feature.
In a first aspect, embodiments of the present application provide a method for determining an applet feature vector, the method comprising: extracting a plurality of character strings in sequence from program data of an applet, wherein the program data comprises program data of at least one of the following categories: the package file structure of the applet, the static code file of the applet and the dynamic operation data of the applet; generating a feature string sequence of the applet according to the plurality of feature strings; converting the feature string sequence of the applet into a feature string vector; the feature string vectors are input into a trained deep learning model to generate feature vectors for the applet.
In one possible implementation manner, converting the feature string sequence of the applet into a feature string vector includes: and replacing each characteristic character string in the characteristic character string sequence with a corresponding digital index code according to the mapping relation between the character string and the digital index code in the preset index mapping table so as to obtain a characteristic character string vector.
In one possible implementation manner, in a case where the program data includes a plurality of categories, generating a feature string sequence of the applet according to the plurality of feature strings includes: extracting no more than a preset number of characteristic character strings from the characteristic character strings corresponding to each kind of program data respectively; combining the extracted characteristic strings to obtain a characteristic string sequence.
In one possible implementation manner, the program data includes a package file structure of an applet, and sequentially extracting a plurality of feature strings from the program data of the applet includes: and extracting the file name and the file type suffix of each file according to the structural sequence of the package file structure to obtain file name characteristic character strings of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
In one possible implementation manner, the generating the feature string sequence of the applet according to the plurality of feature strings includes: extracting a character string of the suffix of the type of the target file from the file name characteristic character string obtained according to the package file structure to obtain a characteristic character string corresponding to the package file structure; and generating a characteristic character string sequence according to the extracted character string.
In one possible implementation, the program data includes a static code file of an applet, and sequentially extracting a plurality of feature strings from the applet program data includes: selecting a plurality of object code files from static code files of the applet; matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string; and splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
In one possible implementation manner, the program data includes dynamic running data of an applet, and sequentially extracting a plurality of feature strings from the program data of the applet includes: running a applet; the method comprises the steps that a request generated in the process of grabbing an applet is matched with preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request; splitting the hit request to obtain a plurality of characteristic strings.
In one possible implementation manner, before each feature string in the feature string sequence is replaced by the numeric index code of the corresponding string according to the mapping relationship between the string and the numeric index code in the preset index mapping table, the method further includes: determining unrepeated character strings which appear in a plurality of characteristic character strings and do not appear in a preset index mapping table, so as to obtain unknown character strings; assigning a non-duplicate numerical index code to each unknown string; and storing the mapping relation between the unknown character strings and the corresponding digital index codes in a preset index mapping table so as to update the preset index mapping table.
In one possible implementation manner, the generating the feature string sequence of the applet according to the plurality of feature strings includes: calculating word frequency-inverse text frequency index TF-IDF for each character string in the updated preset index mapping table to obtain the score of each character string in the preset index mapping table; among the plurality of feature strings, a feature string sequence of the applet is generated from the feature strings having the scores exceeding the preset scores.
In one possible implementation, before inputting the feature string vector into the trained deep learning model to generate the feature vector of the applet, the method further includes: training a coding and decoding model by using a plurality of training vectors, wherein each training vector is a feature string vector of a small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the training vector to obtain an output vector, the decoding model is used for decoding the output vector of the coding model to obtain the output vector, and the optimization goal of the training coding and decoding model is to reduce a loss value calculated according to the output vector and the training vector; and determining that the training convergence condition is reached to obtain a trained coding model.
In one possible implementation manner, after the feature string vector is input into the trained deep learning model to generate the feature vector of the applet, the method further includes: and determining the similarity of the applet and other applets according to the feature vector of the applet and the feature vector of other applets.
In one possible implementation manner, after determining the similarity between the applet and other applets, the method further includes: acquiring preset labels of other small programs; and determining the labels of the applets according to the preset labels of other applets.
In one possible implementation manner, the preset tag is used for marking whether the corresponding applet is a malicious applet.
According to the method, the device and the system, the plurality of characteristic strings are sequentially extracted from the program data of one or more applets such as the package file structure of the applet, the static code file of the applet and the dynamic operation data of the applet, the characteristic string sequence of the applet is generated according to the plurality of characteristic strings, the characteristic string sequence is converted into the characteristic string vector and then is input into a trained coding model, so that the characteristic vector of the applet is generated, the vector which can be identified by a machine can be generated to accurately express the characteristics of the applet, and the technical problem that the characteristics of the applet cannot be expressed is solved.
In a second aspect, embodiments of the present application provide an apparatus for determining an applet feature vector, comprising: an extracting unit for sequentially extracting a plurality of character strings from program data of an applet, wherein the program data includes program data of at least one kind of: the package file structure of the applet, the static code file of the applet and the dynamic operation data of the applet; a first generation unit for generating a feature string sequence of the applet from the plurality of feature strings; the conversion unit is used for converting the characteristic character string sequence of the applet into a characteristic character string vector; and the second generation unit is used for inputting the characteristic character string vector into the trained deep learning model to generate the characteristic vector of the applet.
In one possible implementation manner, the conversion unit is further configured to replace each feature string in the feature string sequence with a corresponding digital index code according to a mapping relationship between the string and the digital index code in the preset index mapping table, so as to obtain a feature string vector.
In one possible implementation manner, in a case where the program data includes a plurality of categories, the first generating unit includes: the first extraction module is used for extracting the characteristic character strings which are not more than the preset number from the characteristic character strings corresponding to each kind of program data respectively; and the combination module is used for combining the extracted characteristic strings to obtain a characteristic string sequence.
In one possible implementation, the program data includes a package file structure of an applet, and the extracting unit includes: and the second extraction module is used for extracting the file name and the file type suffix of each file according to the structure sequence of the package file structure so as to obtain file name characteristic character strings of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
In one possible implementation manner, the first generating unit includes: the third extraction module is used for extracting the character string of the suffix of the type of the target file from the file name characteristic character string obtained according to the package file structure so as to obtain the characteristic character string corresponding to the package file structure; the first generation module is used for generating a characteristic character string sequence according to the extracted character string.
In one possible implementation, the program data includes static code files of the applet, and the extracting unit includes: the selecting module is used for selecting a plurality of target code files from static code files of the applet; the matching module is used for matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string; and the splitting module is used for splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
In one possible implementation, the program data includes dynamic running data of the applet, and the extracting unit includes: the operation module is used for operating the applet; the grabbing module is used for grabbing requests generated in the running process of the applet; the matching module is used for matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request; and the splitting module is used for splitting the hit request to obtain a plurality of characteristic character strings.
In one possible implementation manner, the apparatus further includes: the first determining unit is used for determining unrepeated character strings which appear in a plurality of characteristic character strings and do not appear in a preset index mapping table before the characteristic character string vector is obtained by the converting unit, so as to obtain an unknown character string; the distribution unit is used for distributing non-repeated digital index codes for each unknown character string; and the storage unit is used for storing the mapping relation between the unknown character strings and the corresponding digital index codes in the preset index mapping table so as to update the preset index mapping table.
In one possible implementation manner, the first generating unit includes: the computing module is used for computing word frequency-inverse text frequency index TF-IDF aiming at each character string in the updated preset index mapping table so as to obtain the score of each character string in the preset index mapping table; and the second generation module is used for generating a feature string sequence of the applet according to the feature strings with the scores exceeding the preset scores in the plurality of feature strings.
In one possible implementation manner, the apparatus further includes: the training unit is used for training a coding and decoding model by using a plurality of training vectors before the second generating unit generates the feature vectors of the small program, wherein each training vector is a feature string vector of the small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the training vectors to obtain output vectors, the decoding model is used for decoding the output vectors of the coding model to obtain the output vectors, and the optimization goal of the training coding and decoding model is to reduce loss values calculated according to the output vectors and the training vectors; and the second determining unit is used for determining that the training convergence condition is reached so as to obtain a trained coding model.
In one possible implementation manner, the apparatus further includes: and the third determining unit is used for determining the similarity between the applet and other applets according to the feature vector of the applet and the feature vector of other applets after the second generating unit generates the feature vector of the applet.
In one possible implementation manner, the apparatus further includes: the acquisition unit is used for acquiring preset labels of other applets after the similarity between the applets and the other applets is determined by the third determination unit; and a fourth determining unit for determining the label of the applet according to the preset labels of other applets.
In one possible implementation manner, the preset tag is used for marking whether the corresponding applet is a malicious applet.
In a third aspect, an embodiment of the present application provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the method provided in the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method provided in the first aspect.
It should be understood that, the second to fourth aspects of the embodiments of the present application are consistent with the technical solutions of the first aspect of the embodiments of the present application, and the beneficial effects obtained by each aspect and the corresponding possible implementation manner are similar, and are not repeated.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of one embodiment of a method of determining applet feature vectors in an embodiment of the present application;
FIG. 2 is a flow chart of another embodiment of a method of determining applet feature vectors in an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating one embodiment of an apparatus for determining applet feature vectors according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an embodiment of an electronic device according to the present application.
[ detailed description ] of the invention
For a better understanding of the technical solutions of the embodiments of the present application, the embodiments of the present application are described in detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the embodiments of the present application, are within the scope of the embodiments of the present application.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
Before the applet is released to the applet platform (which may also be referred to as applet putting on shelf), it is necessary to manually check whether the applet is legal and compliant, and putting on shelf is performed after the applet passes the check, so that the check cost is high, and after putting on shelf, the applet developer may change the content (such as website) stored in the server called in the applet, resulting in the illegal applet after putting on shelf. The applet is required to rely on the applet platform, so that templates used in development of the applet are similar, the illegal malicious applet has similar characteristics, and the problem to be solved is that how to express and generate the applet more accurately.
The embodiments of the present application provide a method for determining an applet feature vector, which may be applied to an electronic device having computing and storage capabilities, such as a server, a workstation, a notebook computer, a mobile communication terminal, etc., and may specifically provide a user with an entry capable of entering a software module for implementing the method in the form of a program, a client or a software platform, etc., for example, when the user provides an entry in the form of a client, the user may download the client, upload applet data to a remote server (such as a cloud server) through the software client, and the remote server performs the method for determining the applet feature vector provided in the embodiments of the present application, where other ways will not be repeated herein, and those skilled in the art may provide the user with the functions of the method provided in the embodiments of the present application in other types of entry ways according to the above exemplary descriptions.
FIG. 1 is a flowchart of one embodiment of a method for determining an applet feature vector according to an embodiment of the present application, as shown in FIG. 1, the method for determining an applet feature vector may include:
step 101, extracting a plurality of characteristic strings from the program data of the applet in sequence.
The applet is an application that can be used without downloading and installing, and is one type of application that can be run on a designated platform (application). The applet may be developed by an applet developer for an application or several compatible applications, and provided to an administrator of the application after the development is completed, and if the audit is passed, it may be put on top of the application so that a user using the application can enter the applet through some portal (e.g., an applet icon, an applet search result option, etc.) using the functionality provided by the applet.
The program data of the applet is data related to the program of the applet, and may include file contents of the applet and data generated when the applet runs. The file content of the applet may be the file name of the package file structure of the applet, the file type suffix, or the content of the code, etc. The data generated by the applet during operation may be all data generated by the applet during operation, the underlying method of the invoked end device system, the issued request, etc.
The program data of the applet may include program data of at least one of the following categories: (1) package file structure of applet; (2) static code files of the applet; (3) dynamic running data of the applet. For example, the program data may include a package file structure of the applet, or include a static code file of the applet and dynamic operation data of the applet, or include a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet, which are not exemplified.
In the case where the applet is developed using Java language, the applet is used in the form of a Jar package, and the Jar package file of the applet has a certain structure, which is referred to as a package file structure in the embodiment of the present application, and a specific structure may be a tree structure. The file name and the suffix of the package file structure carry certain information, and the package structure also carries certain information. Specifically, in the case where the program data includes a package file structure of the applet, when step 101 is performed, a file name and a file type suffix of each file may be extracted in the structural order of the package file structure to obtain a file name feature string of each file, where each file name feature string includes a file name and a file type suffix of a corresponding file. The extraction in structure order may be based on Depth-First-Search (DFS) or Breadth-First-Search (BFS) to obtain a filename feature string. For example, the filename feature string may be index. Js, webview. Axml, body. Jpg, title. Png, etc., where "preceded by a filename and" followed by a file type suffix.
The program data may include static code files of the applet, and correspondingly, the step 101 of sequentially extracting the plurality of feature strings from the program data of the applet may include the steps of:
step 111, selecting a plurality of object code files from the static code files of the applet.
An alternative implementation way of selecting the target code file is to preset several keywords, filter the file names of all static code files, and obtain the file as the target code file. The key word may be to filter out the core code file in the static code file, and take the core code file as the target code file. This is because plain code text can be quite noisy, so features can be extracted only for the core code file, which can be the main entry configuration file or the top page presentation code file, etc. Alternatively, the target code file may be selected by the file size of the code file, and the first several files (preset values) with larger sizes are used as the target code file.
Step 112, matching a preset regular expression in each target code file.
After the object code files are selected, in each object code file, a preset regular expression is used for matching. The preset regular expression comprises one or more target character strings and a matching rule of each target character string. The regular expression (regular expression) describes a pattern of string matching that can be used to select strings that meet predefined criteria from all strings in the object code file. For example, since the applet has a closed nature, the dynamic control logic will typically use httprequest class, and the string in the piece of code that invokes httprequest class can be matched by presetting the corresponding regular expression as follows:
“httprequest->url->success->if->setData->display0->else->setData->display1”。
And step 113, splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
After determining the hit code segment, the code segment can be split into a plurality of character strings according to space characters, carriage return characters and the like between every two adjacent words in the code to obtain a plurality of code characteristic character strings.
The program data may include dynamic running data of the applet, and correspondingly, the step 101 of sequentially extracting the plurality of feature strings from the program data of the applet may include the steps of:
step 121, running an applet.
Specifically, the applet may be run in a simulated running environment.
Step 122, the request generated during the applet running process is grabbed.
Step 123, matching preset character strings in the request, wherein each preset character string is used for representing the name of one information carried in the request;
step 124, splitting the hit request to obtain a plurality of feature strings.
The generated request may carry various information, in order to avoid interference caused by excessive information, only part of the information of the type may be extracted, for example, header, response information may be obtained, and a specific method may be to match a preset character string in the request, where the preset character string is a character string that may appear in the required information, and take the request in which the preset character string appears as a request feature character string, for example, the request feature character string may be: head: www.zzryy.cn, response: status, etc.
Step 102, generating a feature string sequence of the applet according to the plurality of feature strings.
After extracting the plurality of feature strings from the program data, the plurality of feature strings may be combined in the order of extraction to generate a feature string sequence. For example, the sequentially extracted feature strings include index. Js, webview. Axml, body. Jpg, title. Png, and the feature string sequence is { index. Js, webview. Axml, body. Jpg, title. Png }.
In an alternative embodiment, if the program data includes a plurality of kinds of program data, the respective kinds of program data may be connected together in a preset order to generate the feature string sequence. For example, the sequentially extracted file name feature string includes index, js, webview, axml, body, jpg, title, png, and the sequentially extracted code feature string includes httprequest, url, success, if, setData, display0, else, setData, display1, and the sequence of the generated feature string is as follows, according to the order in which the file name feature string is preceding: { index. Js, webview. Axml, body. Jpg, title. Png, httprequest, url, success, if, setData, display0, else, setData, display1}.
In an alternative embodiment, to enable alignment of the feature string sequences of each applet, a predetermined number of feature strings may be selected to form the feature string sequence, and if the number is less than the predetermined number, the predetermined strings (e.g., 0, or non, as described herein, by way of example and not limitation) may be used to supplement the alignment. If the program data includes a plurality of categories of program data, a corresponding preset number may be set for each category, for example, 100 character strings are selected in order from each of the file name character string, the code character string, and the request character string, and usage non of less than 100 character strings is complemented.
In an alternative embodiment, an index, such as word frequency-inverse text frequency index TF-IDF, may be used for each string in advance to evaluate the string, and the index is used as the score of the string, so that a string with a higher score is selected from a plurality of feature strings according to the score, and the selected strings are combined to obtain the feature string sequence. Taking three types of program data ((1) package file structure of applet, (2) static code file of applet, (3) dynamic operation data of applet) as an example, the calculation formula of TF-IDF is as follows:
Tf=the number of times the target string is hit in one code file or one request/the number of times the target string appears in all strings;
idf=log [ (total number of files+total number of requests+1)/(number of files or number of requests including target character string) ]+1;
TF-IDF=TF*IDF。
in an alternative embodiment, the program data includes a package file structure of the applet, the feature string extracted in the package file structure is a file name feature string, where the file name feature string includes a file name and a file type suffix, and when executing step 102, a feature string of an interesting (target) file type may be extracted, for example, in an application scenario, if the applet auditor finds that some applet has illegal pictures, the auditor may be interested in the feature of the file of the picture type, and needs to extract the file name feature strings of jpg and png file types from the file name feature string. Specifically, in this alternative embodiment, in performing step 102, the following steps may be performed:
step 201, extracting the character string of the suffix of the target file type from the file name character string obtained according to the package file structure to obtain the character string corresponding to the package file structure;
Step 202, generating a characteristic character string sequence according to the extracted character string.
Alternatively, in executing step 102, implementation may be performed according to any one of the optional embodiments described above, or in combination with multiple optional embodiments, for example, after extracting feature strings from three kinds of program data in order, respectively, TF-IDF score of each feature string is determined, then, among feature strings corresponding to each kind of program data, feature strings with scores ranked in the top 20 bits are reserved, feature strings with scores ranked 20 bits later are deleted, and at most 100 feature strings of each kind are reserved.
For example, the characteristic strings extracted from the program data of the type (1) are F, Y, D, I, N, C, I, a, T, … …, the characteristic strings extracted from the program data of the type (2) are W, D, P, Q, B, X, D, … …, and the characteristic strings extracted from the program data of the type (2) are R, U, S, F, a, T, D, Z, … ….
The feature strings before the score of 20 bits are A-T, and further, the feature strings after 20 bits are deleted from the feature strings corresponding to the program data of each category, after deletion, the number of the feature strings remaining in category (1) is less than 100, the rest of the feature strings remaining in categories (2) and (3) are all complemented with non, and the number of the feature strings remaining in categories (2) and (3) is more than 100, and then the feature strings after 100 are deleted.
The remaining character strings are combined in the original order to obtain character string sequences { F, D, I, N, C, I, A, T, … …, non, non, … …, D, P, Q, B, D, … …, R, S, F, A, T, D, … … }.
Step 103, according to the mapping relation between the character strings and the digital index codes in the preset index mapping table, replacing each character string in the character string sequence with the digital index code of the corresponding character string to obtain the character string vector of the applet.
It should be noted that, the feature string vector refers to a number vector used for representing a feature string sequence, and each feature string in the feature string sequence is identified by a corresponding number, so as to convert a string possibly including letters, symbols, and the like into a string of pure numbers, so that the deep learning model can be identified.
An alternative implementation manner is that a plurality of mapping relations are stored through a preset index mapping table, each mapping relation is a corresponding relation between one character string and one digital index code, the character strings in different mapping relations are not repeated, and the digital index codes in different mapping relations are also not repeated. The number index code is a number. For example, the preset index mapping table may include the following mapping relationship:
{1:“index”,2:“webview”,3:“title”,……}
Wherein, the front digit is a digital index code, and the corresponding vocabulary is a characteristic character string.
And further, according to the mapping relation between the character strings and the digital index codes in the preset index mapping table, replacing each characteristic character string in the characteristic character string sequence with the corresponding digital index code to obtain a characteristic character string vector.
In an alternative embodiment, the preset index map is obtained before step 103 is performed.
Specifically, the step of obtaining the preset index mapping table may include:
step 301, determining non-repeated character strings which occur in a plurality of characteristic character strings and do not occur in a preset index mapping table, and obtaining an unknown character string.
And (3) de-duplicating the characteristic character strings of the applet extracted in the step (101), and removing the character strings existing in the preset index mapping table. Because one or more mapping relationships may already be pre-stored in the preset index mapping table, the pre-stored mapping relationships may be the mapping relationships stored when the feature vectors of other applets are generated by using the method provided by the embodiment of the application. And after the duplication is removed, obtaining the character strings which do not exist in the preset index mapping table, and obtaining the unknown character strings.
Step 302, assigning a non-duplicate numerical index code to each unknown string.
It should be noted that, the non-repeated digital index code refers to different digital index codes of different unknown character strings, and is different from the existing digital index codes in the preset index mapping table.
Step 303, storing the mapping relation between the unknown character string and the corresponding digital index code in the preset index mapping table to update the preset index mapping table.
Step 104, inputting the feature string vector into the trained deep learning model to generate the feature vector of the applet.
The feature vector may be considered a "fingerprint" of the applet, the feature vectors of different applets being different. Alternatively, the deep learning model may be an encoding model in a coding and decoding model based on a Seq2Seq (sequence to sequence) framework or a Seq2seq+attention (sequence to sequence+attention) framework in the prior art, or the deep learning model may also be a neural network model based on a transducer framework in the prior art, and the embodiment of the application is not limited in particular, and may be set according to particular situations.
Taking an example of the coding model, the coding (encoder) model is a model for coding in a coding-decoding (encoder-decoder) model, and the coding-decoding model further includes a decoding (decoder) model. The coding model in the coding and decoding model is used for outputting a vector according to the input vector (each element in the vector is input one by one according to the sequence of the vector), the output vector is used as a feature vector for expressing the feature of the applet, then the feature vector output by the coding model is input into the decoding module so as to obtain the output vector of the decoding model, and the optimization target of training the coding and decoding model is to reduce the loss value calculated according to the output vector output by the decoding model and the training vector input into the coding model.
The trained coding model is pre-trained, and at least before step 104 is performed, the coding and decoding model is trained using a plurality of training vectors, specifically, each training vector is used to train the coding and decoding model, and after each training, parameters of the coding and decoding model are adjusted according to a loss value between an output result (i.e., a vector output by the decoding model) and a target vector (which may be a training vector of the input coding model or may also be a vector determined according to a training vector of the input coding model, for example, an inverted vector of the training vector), and specifically, when using one of the training vectors to the coding and decoding model, specific steps of an alternative embodiment may include:
step 401, selecting a training vector from a plurality of training vectors, and inputting the training vector into a current coding model to obtain an output feature vector, wherein each training vector is a feature string vector of a small program;
step 402, inputting the feature vector output by the coding model into the decoding model;
step 403, obtaining a vector output by the decoding model;
in step 404, a reverse order vector of the training vector is determined, e.g., the training vector is {12,31,56}, and the reverse order vector is {56,31,12}.
And step 405, adjusting weight parameters in the coding model and the decoding model according to the loss value between the vector output by the decoding model and the inverted sequence vector of the training vector.
In step 406, it is determined whether a training convergence condition is reached, for example, the training convergence condition may be that a specified number of iterations is trained or a loss value between the output result and the expected result is less than a preset threshold.
Specifically, a loss (loss) value sequence_loss between the reverse order vector Sn of the training vector and the output vector Sn' output by the decoding model is calculated (S n ,S′ n ) The formula of (c) may be:
in the coding model and decoding model, one or more neural network units may be included, for example, the neural network unit may be a cyclic neural network (Recurrent Neural Network, abbreviated as RNN) unit, or the neural network unit may also be a Long Short-Term Memory (LSTM) unit, where the LSTM unit is a time-cyclic neural network, or the neural network unit may also be a bidirectional cyclic neural network, such as a bidirectional Long-Term Memory (Bidirectional Long Short-Term Memory), or a bidirectional gate cyclic unit (Bidirectional Gated Recurrent Unit, abbreviated as BiGRU).
Optionally, in the above codec model, an Attention (Attention) mechanism may be added, that is, a codec model based on the Seq2seq+attention framework is adopted, and the training process is similar to the above steps 401 to 406, except that when each element of the vector is output by the codec model in step 403, a state vector corresponding to the element is introduced as one of the inputs. The specific structure of the coding and decoding model for introducing the attention mechanism can refer to the existing related technology, and is not described herein.
The feature vector of the applet obtained after performing step 104 may be used to calculate the similarity with other applets, in particular, the similarity of the two applets may be the vector cosine value of the feature vector of the two applets.
In order to identify an applet that is similar to a known malicious applet, the feature vector of the unknown applet may be compared to the vector cosine value of the feature vector known to be a malicious applet, and if the vector cosine value is close to 1 (the difference between the vector cosine value and 1 is less than a pre-specified threshold), the unknown applet is considered to be similar to the malicious applet, and may also be a malicious applet. Alternatively, in another embodiment, in a library of known malicious applets, the vector cosine values of the feature vector of each applet and the feature vector of the unknown applet are calculated respectively, and the number of applets which are the malicious applets in the n applets most similar to the unknown applet is determined according to the magnitude sequence of the vector cosine values, and if the number exceeds the preset number, the unknown applet is determined to be the malicious applet.
According to the method, the device and the system, the plurality of characteristic strings are sequentially extracted from the program data of one or more applets such as the package file structure of the applet, the static code file of the applet and the dynamic operation data of the applet, the characteristic string sequence of the applet is generated according to the plurality of characteristic strings, the characteristic string sequence is converted into the characteristic string vector and then is input into a trained coding model, so that the characteristic vector of the applet is generated, the vector which can be identified by a machine can be generated to accurately express the characteristics of the applet, and the technical problem that the characteristics of the applet cannot be expressed is solved.
Further, the present embodiments also provide an alternative embodiment of a method for determining an applet feature vector, as shown in fig. 2.
As shown in fig. 2, first, character strings are extracted for three kinds of program data (package file structure, static code file, dynamic running data) of an applet, respectively.
For the package file structure, performing depth-first traversal or breadth-first traversal based on the tree structure of the applet package file structure, and obtaining file name feature strings according to the traversal sequence, wherein each file name feature string comprises a file name and a file type suffix, and combining the file name feature strings to obtain a first type sequence shown in fig. 2:
{′index.js′,′webview.js′,...,′index.axml′,′webview.axml′,...,′title.jpg′,′body.jpg′,...,′title.png′,′body.png′}
Alternatively, the file name feature string may be stored in a classified manner according to the file type suffix, for example, the file name feature string may be classified into a directory sequence, a js file sequence, an axml file sequence, and a picture file sequence (the file type suffix is a picture file of jpg or png, etc.), that is, in the first type sequence, the file name feature string is split into a plurality of sub-type sequences according to the file type suffix.
For static code files, firstly determining core code files comprising a main entry configuration file and a first page display code file, then extracting code fragments matched with a preset regular matching formula (such as httprequest) from the core code files, splitting the code fragments into a plurality of characteristic character strings, and combining the characteristic character strings to obtain a second class sequence shown in fig. 2:
{′httprequest′,′url′,′success′if′,′setdata′,′display0′,′else′,′setdata′,′display1′,...}
the dynamic operation data comprises requests generated in the process of the operation of the applet, the requests are matched with corresponding requests according to target character strings (such as header, response, etc.), a plurality of request characteristic character strings are obtained, and a third class sequence shown in fig. 2 is obtained after combination:
{′header:www.zzgryy.cn′′header:47.91.249.40′,...,′response:status′,′response:show′,′response:font′,...}
after the first class sequence, the second class sequence and the third class sequence of the applet are obtained, a partial character string is selected for each class sequence, respectively, as shown in fig. 2.
And determining TF-IDF scores corresponding to each character string aiming at each class of sequences, deleting the character strings with scores lower than a preset score, reserving the character strings with scores exceeding the preset score, supplementing fewer character strings with non if the reserved character strings of a certain class of sequences are less than 100, and reserving only 100 if the reserved character strings of a certain class of sequences are more than 100.
After each class of sequences selects a partial string, the partial strings are combined to obtain a sequence of feature strings, as shown in fig. 2.
Obtaining a preset digital index code mapping table, wherein the preset digital index code mapping table can be:
{1:′index′,2:′webview′,3:′title′,4:′body′,5:′httprequest′.6:′url′,7:′success′,8:′if′,9:′setdata′,10:′display0′,....}
as shown in fig. 2, after determining the digital index code corresponding to each character string in the characteristic string sequence according to the preset digital index code mapping table, a characteristic string vector may be obtained, for example:
{2,5,6,8,......0,0,0,s 100 ,152,24,......0,0,0,s 200 ,255,826,145,......0,0,0}
as shown in fig. 2, after the feature string vector is obtained, the feature string vector is input into the coding model, and the output is the feature vector of the applet.
The above steps shown in fig. 2 are the process of acquiring feature vectors of the applet. Optionally, after obtaining the feature vector of an applet when training the coding model, the method further includes the following steps as shown in fig. 2:
as shown in fig. 2, the output of the decoding model is an output vector, wherein the penalty value is calculated from the output vector and a feature string vector (i.e., training vector) of the input encoding model.
After each round of training results, parameters in the coding model and the decoding model can be adjusted according to the loss value, and the coding and decoding model after the parameters are adjusted is used when the coding and decoding model is trained according to the characteristic string vector of the next applet.
The foregoing has described certain embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for determining an applet feature vector according to an embodiment of the present application, and as shown in fig. 3, the apparatus for determining an applet feature vector may include: an apparatus for determining an applet feature vector, comprising: a decimation unit 10, a first generation unit 20, a conversion unit 30 and a second generation unit 40.
Wherein, the extracting unit 10 is configured to sequentially extract a plurality of feature strings from the program data of the applet, wherein the program data includes at least one kind of program data of the following categories: the package file structure of the applet, the static code file of the applet and the dynamic operation data of the applet; a first generation unit 20 for generating a feature string sequence of the applet from the plurality of feature strings; a conversion unit 30 for converting the feature string sequence of the applet into a feature string vector; the second generating unit 40 is configured to input the feature string vector into the trained encoding model to generate a feature vector of the applet.
Optionally, the conversion unit is further configured to replace each feature string in the feature string sequence with a corresponding digital index code according to a mapping relationship between the string and the digital index code in the preset index mapping table, so as to obtain a feature string vector.
Alternatively, in the case where the program data includes a plurality of categories, the first generating unit 20 includes: the first extraction module is used for extracting the characteristic character strings which are not more than the preset number from the characteristic character strings corresponding to each kind of program data respectively; and the combination module is used for combining the extracted characteristic strings to obtain a characteristic string sequence.
Optionally, the program data includes a package file structure of the applet, and the extracting unit 10 includes: and the second extraction module is used for extracting the file name and the file type suffix of each file according to the structure sequence of the package file structure so as to obtain file name characteristic character strings of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
Optionally, the first generating unit 20 includes: the third extraction module is used for extracting the character string of the suffix of the type of the target file from the file name characteristic character string obtained according to the package file structure so as to obtain the characteristic character string corresponding to the package file structure; the first generation module is used for generating a characteristic character string sequence according to the extracted character string.
Optionally, the program data comprises static code files of the applet, and the extraction unit 10 comprises: the selecting module is used for selecting a plurality of target code files from static code files of the applet; the matching module is used for matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string; and the splitting module is used for splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
Optionally, the program data includes dynamic running data of the applet, and the extraction unit 10 includes: the operation module is used for operating the applet; the grabbing module is used for grabbing requests generated in the running process of the applet; the matching module is used for matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request; and the splitting module is used for splitting the hit request to obtain a plurality of characteristic character strings.
Optionally, the apparatus further comprises: a first determining unit, configured to determine, before the converting unit 30 obtains the feature string vector, unrepeated strings that occur in the plurality of feature strings and that do not occur in the preset index mapping table, to obtain unknown strings; the distribution unit is used for distributing non-repeated digital index codes for each unknown character string; and the storage unit is used for storing the mapping relation between the unknown character strings and the corresponding digital index codes in the preset index mapping table so as to update the preset index mapping table.
Optionally, the first generating unit 20 includes: the computing module is used for computing word frequency-inverse text frequency index TF-IDF aiming at each character string in the updated preset index mapping table so as to obtain the score of each character string in the preset index mapping table; and the second generation module is used for generating a feature string sequence of the applet according to the feature strings with the scores exceeding the preset scores in the plurality of feature strings.
Optionally, the apparatus further comprises: a training unit, configured to train a coding and decoding model using a plurality of training vectors before the second generating unit 40 generates the feature vectors of the applet, where each training vector is a feature string vector of the applet, the coding and decoding model including a coding model for coding the training vector to obtain an output vector, and a decoding model for decoding the output vector of the coding model to obtain the output vector, and an optimization objective of training the coding and decoding model is to reduce a loss value calculated according to the output vector and the training vector; and the second determining unit is used for determining that the training convergence condition is reached so as to obtain a trained coding model.
Optionally, the apparatus further comprises: and a third determining unit for determining the similarity of the applet to other applets based on the feature vector of the applet and the feature vector of other applets after the second generating unit 40 generates the feature vector of the applet.
Optionally, the apparatus further comprises: the acquisition unit is used for acquiring preset labels of other applets after the similarity between the applets and the other applets is determined by the third determination unit; and a fourth determining unit for determining the label of the applet according to the preset labels of other applets.
Optionally, the preset tag is used for marking whether the corresponding applet is a malicious applet.
The apparatus for determining feature vectors of an applet provided by the embodiment shown in fig. 3 may be used to implement the technical solutions of the method embodiments shown in fig. 1 or 2 of the embodiments of the present application, and the principle and technical effects thereof may be further described with reference to the related descriptions in the method embodiments.
FIG. 4 is a schematic structural diagram of an embodiment of an electronic device according to the present application, where the electronic device may include at least one processor as shown in FIG. 4; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which invokes the program instructions to perform the method for determining the applet feature vector provided in the embodiments of fig. 1-2 of the present application.
Fig. 4 shows a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. It should be noted that the electronic device shown in fig. 4 is only an example, and should not impose any limitation on the functions and application scope of the embodiments of the present application.
As shown in fig. 4, the electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors 410, a memory 430, and a communication bus 440 that connects the different system components (including the memory 430 and the processor 410).
The communication bus 440 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.
Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 430 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) and/or cache memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Memory 430 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present application.
A program/utility having a set (at least one) of program modules may be stored in the memory 430, such program modules including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules generally perform the functions and/or methods in the embodiments described herein.
Processor 410 executes programs stored in memory 430 to perform various functional applications and data processing, such as implementing the method of determining applet feature vectors provided by the embodiments of fig. 1-2 of the present application.
Embodiments of the present application provide a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the method for determining applet feature vectors provided by the embodiments shown in fig. 1-2 of the embodiments of the present application.
The non-transitory computer readable storage media described above may employ any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory; EPROM) or flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for embodiments of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network; hereinafter: LAN) or a wide area network (Wide Area Network; hereinafter: WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The foregoing has described certain embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In the description of embodiments of the present application, a description of reference to the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of embodiments of the present application. In the embodiments of the present application, the schematic representations of the above terms are not necessarily for the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples described in the embodiments of the present application and the features of the various embodiments or examples may be combined and combined by persons skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the embodiments of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred implementation of the embodiments of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
It should be noted that, the terminal according to the embodiments of the present application may include, but is not limited to, a personal Computer (Personal Computer; hereinafter referred to as a PC), a personal digital assistant (Personal Digital Assistant; hereinafter referred to as a PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a Processor (Processor) to perform part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (hereinafter referred to as ROM), a random access Memory (Random Access Memory) and various media capable of storing program codes such as a magnetic disk or an optical disk.
The foregoing description of the preferred embodiments is merely exemplary in nature and is not intended to limit the embodiments of the present application, so that any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the embodiments of the present application are intended to be included within the scope of the embodiments of the present application.

Claims (16)

1. A method of determining applet feature vectors, wherein the method comprises:
sequentially extracting a plurality of characteristic character strings from program data of an applet, wherein the program data comprises a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet;
Generating a feature string sequence of the applet according to the plurality of feature strings;
converting the feature string sequence of the applet into a feature string vector;
inputting the feature character string vector into a trained deep learning model to generate the feature vector of the applet, wherein the deep learning model is a coding model used for coding in a coding and decoding model;
before the feature string vector is input into a trained deep learning model to generate the feature vector of the applet, the encoding and decoding model is obtained by using the following training method:
training a coding and decoding model by using a plurality of training vectors, wherein each training vector is the characteristic string vector of one applet, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the characteristic string vector to obtain the characteristic vector, the decoding model is used for decoding the characteristic vector to obtain an output vector, and the optimization goal of training the coding and decoding model is to reduce a loss value calculated according to the output vector and the training vector;
Determining that a training convergence condition is reached to obtain a trained coding model;
after inputting the feature string vector into the trained deep learning model to generate the feature vector of the applet, the method further comprises:
and calculating a vector cosine value of the feature vector of the applet and the feature vector of the malicious applet, and if the difference value between the vector cosine value and 1 is smaller than a pre-specified threshold value, determining that the applet is the malicious applet.
2. The method of claim 1, wherein the converting the sequence of feature strings of the applet into a feature string vector comprises:
and replacing each characteristic character string in the characteristic character string sequence with a corresponding digital index code according to the mapping relation between the character string and the digital index code in a preset index mapping table so as to obtain the characteristic character string vector.
3. The method of claim 1, wherein, in a case where the program data includes a plurality of categories, the generating the feature string sequence of the applet from the plurality of feature strings includes:
extracting no more than a preset number of characteristic character strings from the characteristic character strings corresponding to each kind of program data respectively;
And combining the extracted characteristic character strings to obtain the characteristic character string sequence.
4. A method according to any one of claims 1-3, wherein the program data comprises a package file structure of the applet, the extracting a plurality of feature strings in sequence in the applet program data comprising:
and extracting the file name and the file type suffix of each file according to the structural sequence of the package file structure to obtain file name characteristic character strings of each file, wherein each file name characteristic character string comprises the file name and the file type suffix of the corresponding file.
5. The method of claim 4, wherein the generating the feature string sequence of the applet from the plurality of feature strings comprises:
extracting a character string of a target file type suffix from the file name characteristic character string obtained according to the package file structure to obtain a characteristic character string corresponding to the package file structure;
and generating the characteristic character string sequence according to the extracted character string.
6. A method according to any one of claims 1-3, wherein the program data comprises a static code file of the applet, the extracting a plurality of feature strings in sequence in the applet program data comprising:
Selecting a plurality of object code files from the static code files of the applet;
matching a preset regular expression in each target code file, wherein the preset regular expression comprises one or more target character strings and a matching rule of each target character string;
and splitting each hit code segment into a plurality of character strings to obtain a plurality of characteristic character strings.
7. A method according to any one of claims 1-3, wherein the program data comprises dynamic run data of the applet, the extracting a plurality of feature strings in sequence in the applet program data comprising:
running the applet;
grabbing a request generated in the running process of the applet;
matching preset character strings in the request, wherein each preset character string is used for representing the name of one type of information carried in the request;
splitting the hit request to obtain the plurality of characteristic strings.
8. A method according to claim 2 or 3, wherein before replacing each of the characteristic strings in the characteristic string sequence with a numeric index code of a corresponding string according to a mapping relationship between the strings and the numeric index codes in a preset index mapping table, the method further comprises:
Determining unrepeated character strings which appear in the plurality of characteristic character strings and do not appear in the preset index mapping table, so as to obtain unknown character strings;
assigning a non-repeated digital index code to each unknown character string;
and storing the mapping relation between the unknown character strings and the corresponding digital index codes in the preset index mapping table so as to update the preset index mapping table.
9. The method of claim 8, wherein the generating the feature string sequence of the applet from the plurality of feature strings comprises:
calculating word frequency-inverse text frequency index TF-IDF for each character string in the updated preset index mapping table to obtain the score of each character string in the preset index mapping table;
and generating a feature string sequence of the applet from the feature strings with the scores exceeding the preset scores in the plurality of feature strings.
10. A method according to any of claims 1-3, wherein after inputting the feature string vector into a trained deep learning model to generate a feature vector for the applet, the method further comprises:
And determining the similarity of the applet and other applets according to the feature vector of the applet and the feature vector of other applets.
11. The method of claim 10, wherein after determining the similarity of the applet to other applets, the method further comprises:
acquiring preset labels of other small programs;
and determining the label of the small program according to the preset labels of other small programs.
12. The method of claim 11, wherein the preset tag is used to mark whether the corresponding applet is a malicious applet.
13. A method according to any of claims 1-3, wherein after inputting the feature string vector into a trained deep learning model to generate a feature vector for the applet, the method further comprises:
determining a plurality of similarity between the applet and a plurality of other applets according to the feature vector of the applet and the feature vectors of the other applets;
ranking the plurality of other applets based on the plurality of similarities;
and determining whether the applet is a malicious applet or not according to the number of malicious applets in the preset number of applets in the sorting.
14. An apparatus for determining an applet feature vector, wherein the apparatus comprises:
the extraction unit is used for sequentially extracting a plurality of characteristic character strings from the program data of the applet, wherein the program data comprises a package file structure of the applet, a static code file of the applet and dynamic operation data of the applet;
a first generation unit configured to generate a feature string sequence of the applet from the plurality of feature strings;
a conversion unit for converting the feature string sequence of the applet into a feature string vector;
the second generation unit is used for inputting the characteristic character string vector into a trained deep learning model to generate the characteristic vector of the applet, wherein the deep learning model is a coding model used for coding in a coding and decoding model;
the training unit is used for training a coding and decoding model by using a plurality of training vectors before the second generating unit generates the feature vectors of the small program, wherein each training vector is a feature string vector of the small program, the coding and decoding model comprises a coding model and a decoding model, the coding model is used for coding the training vectors to obtain output vectors, the decoding model is used for decoding the output vectors of the coding model to obtain the output vectors, and the optimization goal of the training coding and decoding model is to reduce loss values calculated according to the output vectors and the training vectors; the second determining unit is used for determining that the training convergence condition is reached so as to obtain a trained coding model;
The second generating unit is further configured to calculate a vector cosine value of a feature vector of the applet and a feature vector of the malicious applet, and determine that the applet is the malicious applet if a difference between the vector cosine value and 1 is smaller than a pre-specified threshold.
15. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-13.
16. A non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method of any one of claims 1 to 13.
CN202110926708.XA 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment Active CN113656763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110926708.XA CN113656763B (en) 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010334290.9A CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment
CN202110926708.XA CN113656763B (en) 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202010334290.9A Division CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment

Publications (2)

Publication Number Publication Date
CN113656763A CN113656763A (en) 2021-11-16
CN113656763B true CN113656763B (en) 2024-01-09

Family

ID=70867606

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010334290.9A Active CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment
CN202110926708.XA Active CN113656763B (en) 2020-04-24 2020-04-24 Method and device for determining feature vector of applet and electronic equipment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202010334290.9A Active CN111241496B (en) 2020-04-24 2020-04-24 Method and device for determining small program feature vector and electronic equipment

Country Status (1)

Country Link
CN (2) CN111241496B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783095A (en) * 2020-07-28 2020-10-16 支付宝(杭州)信息技术有限公司 Method and device for identifying malicious code of applet and electronic equipment
CN113064627B (en) * 2021-03-23 2023-04-07 支付宝(杭州)信息技术有限公司 Service access data processing method, platform, terminal, equipment and system
CN114860673B (en) * 2022-07-06 2022-09-30 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299273A (en) * 2018-11-02 2019-02-01 广州语义科技有限公司 Based on the multi-source multi-tag file classification method and its system for improving seq2seq model
CN110489555A (en) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 A kind of language model pre-training method of combination class word information

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975857A (en) * 2015-11-17 2016-09-28 武汉安天信息技术有限责任公司 Method and system for deducing malicious code rules based on in-depth learning method
CN107885995A (en) * 2017-10-09 2018-04-06 阿里巴巴集团控股有限公司 The security sweep method, apparatus and electronic equipment of small routine
CN108959924A (en) * 2018-06-12 2018-12-07 浙江工业大学 A kind of Android malicious code detecting method of word-based vector sum deep neural network
CN110858288A (en) * 2018-08-24 2020-03-03 中国移动通信集团浙江有限公司 Abnormal behavior identification method and device
CN110059468B (en) * 2019-04-02 2023-09-26 创新先进技术有限公司 Applet risk identification method and device
CN110119621B (en) * 2019-05-05 2020-08-21 网御安全技术(深圳)有限公司 Attack defense method, system and defense device for abnormal system call
CN110414238A (en) * 2019-06-18 2019-11-05 中国科学院信息工程研究所 The search method and device of homologous binary code
CN110348214B (en) * 2019-07-16 2021-06-08 电子科技大学 Method and system for detecting malicious codes

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299273A (en) * 2018-11-02 2019-02-01 广州语义科技有限公司 Based on the multi-source multi-tag file classification method and its system for improving seq2seq model
CN110489555A (en) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 A kind of language model pre-training method of combination class word information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于机器学习算法的主机恶意代码检测技术研究;张东等;网络与信息安全学报;第3卷(第7期);第25-32页 *

Also Published As

Publication number Publication date
CN111241496B (en) 2021-06-29
CN113656763A (en) 2021-11-16
CN111241496A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN110287278B (en) Comment generation method, comment generation device, server and storage medium
CN109657054B (en) Abstract generation method, device, server and storage medium
US11093707B2 (en) Adversarial training data augmentation data for text classifiers
CN113656763B (en) Method and device for determining feature vector of applet and electronic equipment
CN107729300B (en) Text similarity processing method, device and equipment and computer storage medium
US10423649B2 (en) Natural question generation from query data using natural language processing system
US20200257757A1 (en) Machine Learning Techniques for Generating Document Summaries Targeted to Affective Tone
CN109543058B (en) Method, electronic device, and computer-readable medium for detecting image
US11222053B2 (en) Searching multilingual documents based on document structure extraction
US10628525B2 (en) Natural language processing of formatted documents
CN108121697B (en) Method, device and equipment for text rewriting and computer storage medium
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN111611452B (en) Method, system, equipment and storage medium for identifying ambiguity of search text
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN110377750B (en) Comment generation method, comment generation device, comment generation model training device and storage medium
US20190294684A1 (en) Machine translation locking using sequence-based lock/unlock classification
CN111259262A (en) Information retrieval method, device, equipment and medium
JP7140913B2 (en) Video distribution statute of limitations determination method and device
CN110991175A (en) Text generation method, system, device and storage medium under multiple modes
US11487971B2 (en) Multi-dimensional language style transfer
CN113407775B (en) Video searching method and device and electronic equipment
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN112307738A (en) Method and device for processing text
CN111666405B (en) Method and device for identifying text implication relationship
CN111488450A (en) Method and device for generating keyword library and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230105

Address after: 200120 Floor 15, No. 447, Nanquan North Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant after: Alipay.com Co.,Ltd.

Address before: 310000 801-11 section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant before: Alipay (Hangzhou) Information Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant