CN113378167A

CN113378167A - Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Info

Publication number: CN113378167A
Application number: CN202110737511.1A
Authority: CN
Inventors: 杨明极; 赵艺博
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-10

Abstract

A malicious software detection method based on the mixing of an improved naive Bayes algorithm and a gated cyclic unit belongs to the field of software detection. The traditional Android defense mechanism is difficult to deal with the rapid increase of the number and the types of malicious software. A malicious software detection method based on a mixture of an improved naive Bayes algorithm and a gated loop unit is characterized in that an apktool is used for decompiling a to-be-detected software sample set file to obtain a decompiled resource file of an application program, a feature set is extracted from the decompiled resource file, extracted feature geometries are sorted from low to high according to the use times, features with high frequency are selected and combined into a feature set, and the feature set is quantized; and processing the characteristic with time sequence change by adopting a gating circulation unit to detect the dynamic characteristic. The invention can effectively detect the malicious software using the obfuscation technology and improve the accuracy of detection.

Description

Malicious software detection method based on improved naive Bayes algorithm and gated loop unit mixing

Technical Field

The invention relates to a malicious software detection method based on a mixture of an improved naive Bayes algorithm and a gated loop unit.

Background

In recent years, the rapid development of mobile internet has made smart phones gradually become national basic information devices. Nowadays, smart phones are used for various applications, such as photographing, map navigation, instant messaging, internet payment, internet shopping, entertainment, online learning, and the like. Therefore, a great deal of personal information is stored in the smart phone, including privacy information such as personal photos, call records, and the like; accounts, such as online banking account numbers, social account numbers, and the like; and equipment information such as position information, mobile phone numbers and the like. Due to the characteristic that the smart phone is connected with the network in real time, personal information of a user is easily leaked and utilized by malicious applications, and therefore the smart phone has potential safety problems.

In the operating system of the smart phone, Android and iOS operating systems are mainly used. The traditional Android defense mechanism is difficult to deal with the rapid increase of the number and the types of malicious software, and an Android platform is vulnerable to new and unknown malicious software. While the Android smart phone is widely popularized, due to the fact that source codes at the bottom layer of the system are completely free of sources, the safety auditing mechanism of a third-party application store is not standard, factors such as the safety consciousness of user information is weak and the like, malicious software on an Android platform is large in way, the number of user infections is increased day by day, great threats are caused to the personal privacy and property safety of users, and particularly the safe mobile phone payment environment is more and more important due to the rise of mobile payment in recent years. Based on the open source of the Android system, Android malware detection is always one of the hotspots of network security research. Aiming at the problems that the obfuscation technology is widely used in the Android malicious software at present and the detection effect of the Android malicious software using the obfuscation technology is poor, in order to protect the safety of the Android operating system, an effective Android malicious software detection system is needed to help improve the increasingly severe safety situation of the Android system and a user.

Disclosure of Invention

Due to the open source of the Android system, Android malware detection has been one of the hot spots of network security research. Aiming at the problems that the obfuscation technology in the Android malicious software is widely used at present and the Android malicious software using the obfuscation technology is poor in detection effect, the invention aims to solve the problems and provides a malicious software detection method based on the mixing of an improved naive Bayes algorithm and a gated loop unit.

A malware detection method based on a mixture of an improved naive Bayes algorithm and a gated loop unit is realized by the following steps:

the method comprises the following steps of firstly, decompiling a to-be-detected software sample set file by using an apktool to obtain a decompiled resource file of an application program, wherein the decompiled resource file comprises the following steps: xml manifest file and smali byte code file;

step two, extracting a feature set from the decompiled resource file, wherein the feature set comprises the following steps: a Permission set, an Intent set and a sensitive API set; sorting the extracted feature geometries from low to high according to the using times, selecting the features with high frequency, merging the features into a feature set, and quantifying the feature set; wherein the content of the first and second substances,

permission set and Intent set are obtained from android manifest.xml file;

the sensitive API set is obtained from a smali file; performing decompiling and analysis on class.dex files by a bakmali tool to obtain called API interfaces, wherein chmod is a sensitive API for changing user permissions; obtaining the sensitive API characteristics by analyzing class.

And step three, processing the characteristic with time sequence change by adopting a gating cycle unit so as to detect the dynamic characteristic.

Preferably, the extracting of the feature set from the decompiled resource file in the second step is to specifically process the static features by using a method associated with a rule mining algorithm and a TF-IDF algorithm, remove features with large relevance, extract the static features, perform weighting processing on a naive bayes algorithm, reduce feature dimensions, and remove redundant features; static features include, among other things, requested permissions, components, intents, and sensitive APIs.

Preferably, the processing of the feature with time sequence change by using the gate control cycle unit to detect the dynamic feature in the third step is specifically to install an Xpoesd frame through a simulator or a mobile device to obtain root authority and extract the dynamic feature through an automated test tool Monkey Runner + dynamic analysis tool instrumentation;

the dynamic characteristics refer to behavior characteristics of the Android application software during operation, and include file read-write operation, call request, short message request, data encryption and decryption operation, network data input and output, and private information reading, and the behaviors can express behavior intention of the application software.

Preferably, the method for processing static features by using the rule mining algorithm and the TF-IDF algorithm comprises the following specific steps:

(1) constructing a feature weighting algorithm based on the TF-IDF algorithm;

finding a piece of text information with good category distinction, wherein if the frequency of a certain piece of text information in a document is higher and the frequency of the certain piece of text information in other files in the file set is lower, the importance degree of the text information is higher, namely the text information is the text information to be determined and belongs to a piece of text information with good category distinction, and the weight calculation of the TF-IDF algorithm is as the following formula:

Weight＝TF_i,j×IDF_i (1-1)

therefore, the TF-IDF value of the i-th text information is the product of the TF value and the IDF value, wherein the expression of the word frequency TF is shown as the formula (1-2), wherein N is_i,jIndicating the frequency, sigma, of the ith text information in the jth document_kN_K,jRepresenting the occurrence times of all words in the jth document;

the expression of the inverse file frequency IDF is shown in the formula (1-3), wherein D represents the number of all files in the file set, Di represents the number of all files in the file set in which the ith section of text information appears,

regarding each sensitive API as a piece of text information, taking all sensitive API sets called by each sample as a file, taking all software sample libraries as file sets, and calculating the weight value of each sensitive API characteristic; the higher the weight value is, the more commonly the sensitive API is called in the malicious software, and the calling feature vector of the sensitive API with the weight is obtained; keeping the larger weight value and removing the sensitive API with the smaller weight value, thereby performing feature dimension reduction on the authority, the sensitive API and the like; obtaining a feature vector with weight information, wherein the weight information of the feature is used as an important basis for subsequently improving the Bayesian algorithm;

(2) selecting characteristics based on association rules:

an iterative mode of searching layer by layer is adopted, namely a (k +1) authority set is explored through a k authority set;

firstly, scanning an authority test set, counting the occurrence times of each authority, collecting the authorities meeting the minimum support degree to obtain a set of frequent 1 authority sets, and recording the set as Q1;

then, finding out a set Q2 of the frequent 2 authority sets by means of Q1, and so on until the frequent k authority sets can not be found;

obtaining an association rule through an Apriori algorithm, wherein the strength of the association rule is measured through support degree and confidence degree; wherein, the support degree and confidence measure formula is expressed as:

the support degree is as follows:

confidence coefficient:

wherein, X and Y are subsets of the authority set, sigma (X) represents the number of APKs containing the authority subset X, and N represents the total number of APK data sets; the support ensures the proportion of the subset occupied in the whole data set, and the confidence ensures the proportion of Y occupied in the data set containing X; if the association rule meeting the set support degree and confidence degree contains X → Y and Y → X, only one of the two permission subsets can be selected, so that the relevance in the permission is greatly weakened;

(3) traditional naive Bayes algorithm

Defining: assuming that a sample has n attribute features, and a vector X represents an attribute set composed of n attributes, the attribute feature information of the sample can be represented by a vector X (X1, X2, X3 ….. Xn), and the sample classes are divided into C1, C2, …, Cm on a given total probability event, where each Ci is a class; obtaining the conditional probability of each attribute on a training sample set under the condition of a certain category Ci through calculation, namely P (X1| Ci), P (X2| Ci), … and P (Xn | Ci); finally, when classifying the sample Xi, respectively calculating the posterior probability of the sample for each class, namely P (C1| Xi), P (C2| Xi), … and P (Cm | Xi), and taking the class Ci with the highest posterior probability as the class to which the final classification of the sample belongs;

the posterior probability is defined by the naive bayes algorithm principle given by the definition as follows:

wherein P (X) is constant under all the categories, so that the maximum posterior probability can be judged only by the maximum P (Ci) P (X | Ci); p (Ci) represents the probability of each attribute appearing in the training sample, P (X | Ci) being the conditional probability of each attribute under all categories; since the naive bayes algorithm assumes that the attributes are independent of each other, there are:

in summary, the maximum posterior probability is only required to be the maximum when the sample X belongs to a certain class, and is defined as:

wherein CMAP is the final decision classification of the naive Bayes algorithm according to the maximum posterior probability;

(4) improved naive Bayes based on weighting

In the characteristic weighting module, a weight value W (X) of each characteristic attribute is calculated by using a TF-IDF algorithm_k) The redundancy of a plurality of authority characteristics is removed by using the weight values, and a naive Bayes algorithm is improved at the same time; taking the TF-IDF weight value obtained by the calculation as the basis, and taking the weight value W (X)_k) The new posterior probability is obtained by substituting the new posterior probability into a naive Bayes posterior probability calculation formula (1-6):

preferably, the process of extracting dynamic features by the automated testing tool Monkey Runner + dynamic analysis tool inspection is,

and (3) processing the dynamic feature vector by adopting a GRU model:

X^tas input to the current cell, Y^tIs the output of the current cell, h^tIs the hidden state of the current unit; h is^t-1A hidden state which is output by the previous unit and is transmitted to the unit, wherein the hidden state comprises the related information of the previous unit;

the GRU internal calculation process is as follows:

as is hadamard product, representing multiplication of corresponding elements in the matrix; + is a matrix addition operation, representing the addition of corresponding elements in the matrix; two gating states, r gating control reset, z gating control update, W^r、W^zAnd W are both weight matrices;

activation function:

step 1: acquiring two gating states r and z;

r＝σ(W^rX^t+W^rh^t-1) (1-12)

z＝σ(W^zX^t+W^zh^t-1) (1-13)

step 2: using reset gate r to reset data, and calculating to obtain h';

h^t-1'＝h^t-1⊙r (1-14)

h'＝tanh(WX^t+Wh^t-1') (1-15)

step 3: updating memory, wherein the more the gating signal is close to 1, the more data representing memory is, the closer to 0 is, the more data representing forgetting is;

h^t＝z⊙h^t-1+(1-z)⊙h' (1-16)

as described above, in conjunction with Xt and ht-1, the GRU will obtain the output Yt of the current hidden unit and pass it to the next unit as hidden state ht, where Yt and ht are numerically the same.

The invention has the beneficial effects that:

because static analysis has a poor analysis effect on Android malware using obfuscation techniques, more and more Android malware is disguised using obfuscation techniques in the present situation. Therefore, the invention adopts a plurality of static characteristic methods to analyze through the resource type characteristics and the grammar type characteristics, the resource type characteristic analysis method is extracted from the resource files stored in the APK and comprises the characteristics related to the certificate, the characteristics related to the package name, the characteristics related to the malicious code hiding and the like, and the other is the grammar type characteristics which are extracted from the resource codes in the APK and comprise the characteristics related to the authority and intention related to the sensitive API calling of the characteristics.

According to the invention, the Association Rule Mining (ARM) and the TF-IDF algorithm are used for carrying out feature processing, the features with larger association are removed, the effective features are extracted, and the weighting processing is carried out on the naive Bayes algorithm, so that the feature dimension can be effectively reduced, the redundant features are removed, and the calculation efficiency is optimized. In dynamic detection, due to the fact that dynamic features have time sequence correlation, a time sequence is formed after entity embedded coding, and a Recurrent Neural Network (RNN) processing sequence or time sequence correlation features have specific advantages and can well process features with time sequence changes. The GRU is used as a special recurrent neural network widely used, the problems of gradient elimination and gradient explosion in the long sequence training process are solved, compared with a common RNN model, the GRU can have better performance in a longer time sequence, parameters are less compared with other RNN models, the training difficulty is lower, and the training efficiency can be improved to a great extent, so that a Gated Recursive Unit (GRU) is adopted for dynamic detection.

The method provided by the invention is verified by applying the Android Malware Dataset (AMD) Malware data set, has better classification performance, can effectively detect Malware using obfuscation technology, and improves the accuracy of detection.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic illustration of static feature detection according to the present invention;

fig. 3 is a schematic diagram of an internal structure of a GRU according to the present invention.

Detailed Description

The first embodiment is as follows:

in this embodiment, as shown in fig. 1, a malware detection method based on a mixture of an improved naive bayes algorithm and a gated loop unit is implemented by the following steps:

permission set and Intent set are obtained from android manifest.xml file;

the sensitive API set is obtained from a smali file;

decompiling and analyzing class.dex files by a bakmali tool to obtain called API interfaces, wherein chmod is a sensitive API for changing user permissions; obtaining the sensitive API characteristics by analyzing class.

The second embodiment is as follows:

different from the first specific embodiment, in the second specific embodiment, a malware detection method based on the mixing of the improved naive bayes algorithm and the gated loop unit extracts a feature set from a decompiled resource file, specifically, a method associated with A Rule Mining (ARM) algorithm and a TF-IDF algorithm is used for processing static features, removing features with large relevance, extracting the static features, and performing weighting processing on the naive bayes algorithm, so that feature dimensionality is effectively reduced, redundant features are removed, and calculation efficiency is optimized; wherein the static features include requested permissions, components, intents, and sensitive APIs;

although the classification effect of the naive Bayes algorithm is good, when the number of attributes is large or the correlation among the attributes is large, the classification effect of the naive Bayes algorithm is not good, and the contribution degree of the default characteristic attribute of the traditional naive Bayes algorithm to the reiterated attribute is the same; the invention solves the problems by carrying out the Android malicious software static detection based on the improved naive Bayes algorithm.

The third concrete implementation mode:

different from the first or second specific embodiment, in the third embodiment, the gated loop unit is adopted to process the characteristic with time sequence change so as to detect the dynamic characteristic, specifically, an Xpoesd frame is installed through a simulator or a mobile device to obtain root authority, and the dynamic characteristic is extracted through an automatic test tool Monkey Runner + dynamic analysis tool instrumentation;

the dynamic characteristics refer to behavior characteristics of the Android application software during running, and include file read-write operation, call request, short message request, data encryption and decryption operation, network data input and output, and private information reading, and the behaviors can express behavior intention of the application software;

the dynamic feature action _ sendnet indicates data transfer through a network, the dynamic feature actions _ telephones indicates the use of a call service, and the dynamic feature action _ sendmss indicates the sending of a short message. The extraction of the dynamic features is mainly realized based on monitoring related API calls, and each dynamic feature corresponds to a group of a plurality of API calls.

Currently, for monitoring dynamic behaviors, relevant tools can be used for carrying out work. In previous studies, researchers collected recorded dynamic features using a Droidbox, a dynamic analysis tool for analyzing Android applications based on taitdroid, which recorded the dynamic behavior of the application over time after it was started through the hook technology monitoring system API. Although the function of the Droidbox is complete, the Droidbox cannot be automatically detected, cannot execute all functions of Android application software, and possibly has omission in a monitoring result, so that an automatic testing tool, MonkeyRunner and a dynamic analysis tool, instance, is selected to extract dynamic features.

The fourth concrete implementation mode:

the third difference from the specific embodiment is that, in the malicious software detection method based on the mixing of the improved naive bayes algorithm and the gated loop unit, the static features are processed by using the method of associating the rule mining (ARM) algorithm with the TF-IDF algorithm, the features are processed by using the Association Rule Mining (ARM) algorithm and the TF-IDF algorithm, the features with larger correlation are removed, the dimensions of the features are reduced, the naive bayes algorithm is weighted, and the accuracy of judging the malicious application of the Android is improved by the weighted naive bayes algorithm. The method comprises the following specific steps:

(1) constructing a feature weighting algorithm based on the TF-IDF algorithm;

TF-IDF (Term Frequency-Inverse Document Frequency) is one of the commonly used feature weighting algorithms, is usually used for evaluating the importance degree of a section of text information in a Document set, and has wide application in the field of retrieval and classification of text information. The TF-IDF algorithm is composed of two parts, namely Term Frequency (Term Frequency) and Inverse file Frequency (Inverse Document Frequency), wherein the Term Frequency represents the Frequency of occurrence of a certain section of text information in a Document, and the Inverse file Frequency represents the Frequency of occurrence of a file containing the section of text information in a file set. The main idea of the TF-IDF algorithm is to find a piece of text information with good category distinction, and if the frequency of a certain piece of text information appearing in a document is higher and the frequency of the text information appearing in other files of a file set is lower, the importance degree of the text information is higher, that is, the text information to be determined is the text information belonging to a piece of text information with good category distinction, and the weight calculation of the TF-IDF algorithm is as follows:

Weight＝TF_i,j×IDF_i (1-1)

the expression of the inverse file frequency IDF is shown in the formula (1-3), wherein D represents the number of all files in the file set, Di represents the number of all files in the file set in which the ith section of text information appears, and Di plus 1 processing is carried out in order to prevent the generation of the denominations when Di is 0.

Regarding the sensitive API feature vector, according to the idea of a TF-IDF algorithm, regarding each sensitive API as a piece of text information, regarding all sensitive API sets called by each sample as a file, regarding all software sample libraries as file sets, and calculating the weight value of each sensitive API feature; the higher the weight value is, the more commonly the sensitive API is called in the malicious software, and the calling feature vector of the sensitive API with the weight is obtained; keeping the larger weight, removing the sensitive API with the smaller weight, and indicating that the smaller weight does not have distinctiveness, thereby performing feature dimension reduction on the authority, the sensitive API and the like; obtaining a feature vector with weight information, wherein the weight information of the feature is used as an important basis for subsequently improving the Bayesian algorithm;

(2) selecting characteristics based on association rules;

there is often a significant correlation between rights characteristic limits. For example, READ _ CONTACTS and WRITE _ CONTACTS have high correlation with the READ-WRITE rights of the contact information, and are almost paired or missing in the sample. If the authority features include a large number of features with strong correlation, the classification overhead is increased, and the classification accuracy is interfered. The method performs redundancy removal processing on the authority characteristics with strong correlation by using association rule learning.

The APriori algorithm is one of classical association rule learning algorithms in the field of data mining and the field of computers, and is mainly used for processing association rules among content information related to transactions such as commodity lists, access lists and the like in a database. The algorithm adopts an iteration mode of searching layer by layer, namely, a (k +1) authority set is explored through a k authority set;

obtaining an association rule through an Apriori algorithm, wherein the strength of the association rule is measured through support (support) and confidence (confidence); wherein, the measurement formula of the support degree and the confidence degree is expressed as:

the support degree is as follows:

confidence coefficient:

(3) traditional naive bayes algorithm:

the principle of the naive Bayes algorithm is as follows:

assuming that a sample has n attribute features, and a vector X represents an attribute set composed of n attributes, the attribute feature information of the sample can be represented by a vector X (X1, X2, X3 ….. Xn), and the sample classes are divided into C1, C2, …, Cm on a given total probability event, where each Ci is a class; obtaining the conditional probability of each attribute on a training sample set under the condition of a certain category Ci through calculation, namely P (X1| Ci), P (X2| Ci), … and P (Xn | Ci); finally, when classifying the sample Xi, respectively calculating the posterior probability of the sample for each class, namely P (C1| Xi), P (C2| Xi), … and P (Cm | Xi), and taking the class Ci with the highest posterior probability as the class to which the final classification of the sample belongs;

in summary, the sample X belongs to a certain class, and only the Maximum posterior probability (MAP) is the Maximum, where the Maximum posterior probability is defined as:

(4) improved naive Bayes based on weighting

The naive bayes algorithm has a good overall performance, but also has disadvantages. The features are defaulted to have the same weight and are mutually independent, so that the feature weighting module based on the TF-IDF algorithm is used for improving the accuracy of the naive Bayes classification algorithm.

The correlation problem between features is not considered when using a naive bayes classification algorithm. In order to reduce the influence of strong correlation between features on a classification algorithm, the invention provides a feature selection based on association rule mining, which can well discover the strong correlation rules between the authorities, only one authority is reserved in a frequent authority set as a detection feature, and the number of the authorities is reduced again;

in the characteristic weighting module, a weight value W (X) of each characteristic attribute is calculated by using a TF-IDF algorithm_k) The redundancy removal of a plurality of authority characteristics is completed by utilizing the weight value, and the improvement of a naive Bayes algorithm is also completed; because the traditional naive Bayes classification algorithm does not consider different characteristics to the malignancyThe influence degrees of the software classification results are different, so that the weight value W (X) is determined by taking the TF-IDF weight value obtained by the calculation as the basis_k) The new posterior probability is obtained by substituting the new posterior probability into a naive Bayes posterior probability calculation formula (1-6):

the overall static detection flow is illustrated as static feature detection shown in fig. 2 below.

The fifth concrete implementation mode:

different from the specific implementation manner, in the malware detection method based on the mixture of the improved naive bayes algorithm and the gated loop unit in the implementation manner, the process of extracting the dynamic features by the automated testing tool Monkey Runner + dynamic analysis tool inspection is that,

gru (gate recovery unit) is a special Recurrent Neural Network (RNN). Mainly aims to solve the problems of gradient extinction and gradient explosion in the long sequence training process. GRUs can perform better in longer sequences than normal RNNs. The transmission state is controlled through the gating state, and unimportant information which needs to be memorized for a long time is memorized and forgotten; unlike the common RNN, which only has a memory superposition mode, the GRU can have better performance in a longer time operation sequence, and compared with other RNN model parameters, the GRU has the advantages of lower training speed, lower difficulty and less data volume required for achieving the generalization effect. Therefore, the GRU model is more suitable for processing dynamic feature vectors with less dimensions;

and (3) processing the dynamic feature vector by adopting a GRU model:

the internal structure of the GRU is shown in FIG. 3, X^tAs input to the current cell, Y^tIs the output of the current cell, h^tIs the hidden state of the current unit; h is^t-1A hidden state (hidden state) which is output by the previous unit and is transmitted to the unit, wherein the hidden state contains the related information of the previous unit;

the GRU internal calculation process is as follows:

activation function:

step 1: acquiring two gating states r and z;

r＝σ(W^rX^t+W^rh^t-1) (1-12)

z＝σ(W^zX^t+W^zh^t-1) (1-13)

step 2: using reset gate r to reset data, and calculating to obtain h';

h^t-1'＝h^t-1⊙r (1-14)

h'＝tanh(WX^t+Wh^t-1') (1-15)

h^t＝z⊙h^t-1+(1-z)⊙h' (1-16)

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A malicious software detection method based on the mixing of an improved naive Bayes algorithm and a gated cyclic unit is characterized by comprising the following steps: the method is realized by the following steps:

permission set and Intent set are obtained from android manifest.xml file;

2. The malware detection method based on the mixture of the modified naive bayes algorithm and the gated round robin unit as claimed in claim 1, wherein: step two, extracting a feature set from the decompiled resource file, specifically, processing static features by using a method associated with a rule mining algorithm and a TF-IDF algorithm, removing features with high relevance, extracting the static features, performing weighting processing on a naive Bayes algorithm, reducing feature dimensions, and removing redundant features; static features include, among other things, requested permissions, components, intents, and sensitive APIs.

3. The malware detection method based on the mixture of the modified naive bayes algorithm and the gated round robin unit as claimed in claim 1, wherein: processing the characteristics with time sequence change by adopting a gate control cycle unit to detect the dynamic characteristics, namely specifically, installing an Xpos frame through a simulator or a mobile device to obtain root authority, and extracting the dynamic characteristics through an automatic test tool Monkey Runner + dynamic analysis tool Instrucge;

4. The malware detection method based on the mixture of the modified naive bayes algorithm and the gated round robin unit as claimed in claim 2, wherein:

the method for processing the static characteristics by using the rule mining algorithm and the TF-IDF algorithm comprises the following specific steps:

(1) constructing a feature weighting algorithm based on the TF-IDF algorithm;

Weight＝TF_i,j×IDF_i (1-1)