CN111860662A - Training method and device, application method and device of similarity detection model - Google Patents

Training method and device, application method and device of similarity detection model Download PDF

Info

Publication number
CN111860662A
CN111860662A CN202010723891.9A CN202010723891A CN111860662A CN 111860662 A CN111860662 A CN 111860662A CN 202010723891 A CN202010723891 A CN 202010723891A CN 111860662 A CN111860662 A CN 111860662A
Authority
CN
China
Prior art keywords
application program
similarity
information
detection model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010723891.9A
Other languages
Chinese (zh)
Other versions
CN111860662B (en
Inventor
许静
高红灿
过辰楷
黄登蓉
吴彦峰
何振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202010723891.9A priority Critical patent/CN111860662B/en
Publication of CN111860662A publication Critical patent/CN111860662A/en
Application granted granted Critical
Publication of CN111860662B publication Critical patent/CN111860662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a training method and device, and an application method and device of a similarity detection model. The training method comprises the following steps: extracting the characteristics of the attribute information of the first application program and the second application program to obtain attribute characteristics; according to the attribute characteristics, obtaining the similarity between the first application program and the second application program; and training a similarity detection model according to the difference between the similarity and the similarity label, wherein the similarity label is used for marking the similarity between the first application program and the second application program, and the similarity label can be used as the training label to establish a supervised similarity detection model and improve the performance of the similarity detection model, so that the detection efficiency and accuracy of the similarity of the application programs are improved.

Description

Training method and device, application method and device of similarity detection model
Technical Field
The invention relates to the technical field of deep learning, in particular to a training method and device, and a training method and device of a similarity detection model.
Background
Application (APP) similarity detection is an important component of software engineering and is widely applied to the fields of malware detection, APP recommendation, software demand discovery and the like.
At present, methods for detecting the similarity of APP mainly comprise a watermarking method and a feature extraction method. The watermarking method is characterized in that specific data (such as characters, character string keys and the like) are added into an APP as watermarks, the watermarks are extracted from the APP by using a corresponding algorithm during detection, and then the similarity of the APP is judged according to the extracted watermark results. The feature extraction method is to analyze the attributes of the APP to generate feature vectors, and obtain the similarity between similar feature vectors through distance calculation or classify the feature vectors.
However, the existing similarity detection method has low detection efficiency and low accuracy.
Disclosure of Invention
In view of this, embodiments of the present invention provide a training method and apparatus for a similarity detection model, and a training method and apparatus, which can improve detection efficiency and accuracy of application similarity.
According to a first aspect of the embodiments of the present invention, there is provided a training method for a similarity detection model, including: extracting the characteristics of the attribute information of the first application program and the second application program to obtain attribute characteristics; according to the attribute characteristics, obtaining the similarity between the first application program and the second application program; and training a similarity detection model according to the difference between the similarity and the similarity label, wherein the similarity label is used for marking the similarity between the first application program and the second application program.
In some embodiments of the invention, the similarity label is established based on coarse-grained category information and/or fine-grained category information of the first application and the second application.
In some embodiments of the present invention, the similarity label includes a first similarity label, a second similarity label and/or a third similarity label, wherein the first similarity label is used for marking that coarse-grained category information of the first application and the second application are different; the second similarity label is used for marking that the coarse-grained category information of the first application program and the second application program is the same and the fine-grained category information is different; and the third similarity label is used for marking that the fine-grained class information of the first application program and the fine-grained class information of the second application program are the same.
In some embodiments of the invention, the similarity detection model is an FM model, a DNN model, or a deep FM model.
In some embodiments of the present invention, the training method of the similarity detection model further includes: performing word embedding processing on attribute information of a first application program and a second application program, wherein the performing feature extraction on the attribute information of the first application program and the second application program comprises the following steps: and performing feature extraction on the attribute information of the first application program and the second application program after the word embedding processing.
In some embodiments of the present invention, the attribute information includes title information, description information, and privacy policy information of the application program, and the training method of the similarity detection model further includes: the method comprises the following steps of pre-training description information and privacy policy information of a first application program and a second application program after word embedding processing through a long-short term memory network, wherein the feature extraction of the attribute information of the first application program and the second application program after word embedding processing comprises the following steps: and extracting the characteristics of the title information of the first application program and the second application program after the word embedding processing, the description information and the privacy policy information after the word embedding processing and the pre-training.
In some embodiments of the invention, the long short term memory network is a unidirectional long short term memory network, a bidirectional long short term memory network, a unidirectional long short term memory network based on attention mechanism, or a bidirectional long short term memory network based on attention mechanism.
According to a second aspect of the embodiments of the present invention, there is provided an application method of a similarity detection model, including: inputting attribute information of a first application program and a second application program to be detected into a similarity detection model, wherein the similarity detection model is obtained by training through any one of the methods; and carrying out similarity detection on the first application program and the second application program by utilizing a similarity detection model.
According to a third aspect of the embodiments of the present invention, there is provided a training apparatus for a similarity detection model, including: the characteristic extraction module is used for extracting the characteristics of the attribute information of the first application program and the second application program to obtain attribute characteristics; the similarity module is used for obtaining the similarity between the first application program and the second application program according to the attribute characteristics; and the training module is used for training the similarity detection model according to the difference between the similarity and the similarity label, wherein the similarity label is used for marking the similarity between the first application program and the second application program.
According to a fourth aspect of the embodiments of the present invention, there is provided an application apparatus of a similarity detection model, including: the input module is used for inputting the attribute information of the first application program and the second application program to be detected into the similarity detection model, wherein the similarity detection model is obtained by training through any one of the methods; and the detection module is used for carrying out similarity detection on the first application program and the second application program by utilizing the similarity detection model.
According to a fifth aspect of embodiments of the present invention, there is provided a computer-readable storage medium, characterized in that the storage medium stores a computer program for executing any one of the methods described above.
According to a sixth aspect of the embodiments of the present invention, there is provided an electronic apparatus, including: a processor; a memory for storing processor-executable instructions; a processor configured to perform any of the methods described above.
According to the technical scheme provided by the embodiment of the invention, the attribute characteristics are obtained by extracting the characteristics of the attribute information of the first application program and the second application program; according to the attribute characteristics, obtaining the similarity between the first application program and the second application program; according to the similarity and the difference of the similarity labels, the similarity detection model is trained, the similarity labels can be used as training labels, a supervised similarity detection model is established, and the performance of the similarity detection model is improved, so that the detection efficiency and accuracy of the similarity of the application program are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating a training method of a similarity detection model according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating a training method of a similarity detection model according to another embodiment of the present invention.
Fig. 3 is a flowchart illustrating a training method of a similarity detection model according to another embodiment of the present invention.
Fig. 4 is a flowchart illustrating a training method of a similarity detection model according to another embodiment of the present invention.
Fig. 5 is a flowchart illustrating a training method of a similarity detection model according to another embodiment of the present invention.
Fig. 6 is a schematic structural diagram of a similarity detection model according to an embodiment of the present invention.
Fig. 7 is a flowchart illustrating an application method of the similarity detection model according to an embodiment of the present invention.
Fig. 8 is a block diagram of a training apparatus for a similarity detection model according to an embodiment of the present invention.
Fig. 9 is a block diagram illustrating an apparatus for applying a similarity detection model according to an embodiment of the present invention.
Fig. 10 is a block diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart illustrating a training method of a similarity detection model according to an embodiment of the present invention. The method may be performed by a computer device (e.g., a server). As shown in fig. 1, the method includes the following.
S110: and performing feature extraction on the attribute information of the first application program and the second application program to obtain attribute features.
Specifically, the attribute information of the first application program and the attribute information of the second application program may be spliced and input to the similarity detection model, and the attribute information may be subjected to feature extraction by the similarity detection model, so as to obtain the attribute feature. The attribute characteristics may include low-order characteristics and/or high-order characteristics of the attribute information, which is not limited by the present invention.
The attribute information may include a user design page of the APP, a virtual machine (Dalvik) bytecode, metadata information of the APP (e.g., text description information, title information, category information, privacy policy information, authority information, etc. provided by a developer), an Application Programming Interface (API), and/or a multimedia file (audio, video), etc., which is not limited in this respect.
The description information in the metadata information is a text provided by a developer and used for introducing an APP function, the privacy policy information is used for introducing privacy acquisition in the APP development process, and the two attributes are usually presented in a long text form; the title information is the name of the APP, and the category information is the classification of the APP, usually presented in the form of short text.
It should be understood that the attribute information in the embodiment of the present invention may be one or more of the above information, and the present invention is not limited to this.
For example, the attribute information may be acquired from google stores using crawler technology.
S120: and according to the attribute characteristics, obtaining the similarity between the first application program and the second application program.
Specifically, the similarity between the first application program and the second application program can be obtained by performing dimension reduction processing on the attribute feature through the full connection layer and performing sigmoid activation function processing.
S130: and training a similarity detection model according to the difference between the similarity and the similarity label, wherein the similarity label is used for marking the similarity between the first application program and the second application program.
Specifically, according to the similarity and the similarity label, a loss function value is calculated by using a log-likelihood loss function, the loss function value is propagated in the reverse direction, and the parameter of the similarity detection model is adjusted, so that the trained similarity detection model is obtained. It should be understood that the above log-likelihood loss function is only an exemplary description, and the present invention does not specifically limit the type of the loss function.
Illustratively, the similarity detection model may be an FM model, a DNN model, a deep FM model, or the like, which is not particularly limited in the present invention.
According to the technical scheme provided by the embodiment of the invention, the attribute characteristics are obtained by extracting the characteristics of the attribute information of the first application program and the second application program; according to the attribute characteristics, obtaining the similarity between the first application program and the second application program; according to the similarity and the difference of the similarity labels, the similarity detection model is trained, the similarity labels can be used as training labels, a supervised similarity detection model is established, and the performance of the similarity detection model is improved, so that the detection efficiency and accuracy of the similarity of the application program are improved.
In some embodiments of the invention, the similarity label is established based on coarse-grained category information and/or fine-grained category information of the first application and the second application.
For example, a "Category" provided by google shop for describing APP type may be denoted as coarse-grained Category; in addition, some representative APPs in Google stores perform further detailed category division on the APPs on the basis of category classification, and can be recorded as fine-grained categories.
The APPs in different coarse-grained categories have smaller similarity, the APPs in the same coarse-grained category have certain degree of similarity, and the APPs in the same fine-grained category have greater similarity. Further, the similarity labels may include a first similarity label, a second similarity label and/or a third similarity label, and the similarity detection model is trained by using the three similarity labels.
For example, the similarity between APPs of different coarse-grained categories is lowest, which may be defined as 0.25 (i.e., the first similarity label); the similarity between APPs in the same coarse-grained category and different fine-grained categories is high, and can be defined as 0.5 (i.e., a second similarity label); the similarity between APPs in the same fine-grained category is the highest and can be defined as 0.75 (i.e., the third similarity label).
It should be understood that the above description is only an exemplary description, and the invention is not limited to the specific setting values of the similarity labels, the definition standards of the similarity labels, and the number of categories.
Fig. 2 is a schematic flow chart illustrating a training method of a similarity detection model according to another embodiment of the present invention. The embodiment shown in fig. 2 of the present invention is extended on the basis of the embodiment shown in fig. 1 of the present invention, and the differences between the embodiment shown in fig. 2 and the embodiment shown in fig. 1 will be emphasized below, and the descriptions of the same parts will not be repeated.
As shown in fig. 2, the training method of the similarity detection model according to the embodiment of the present invention further includes the following steps.
S140: and preprocessing the attribute information of the first application program and the second application program by adopting a natural language processing technology.
Wherein, step S110 includes: and performing feature extraction on the preprocessed attribute information of the first application program and the second application program to obtain attribute features.
The pre-processing may include text filtering processing, word modification processing, and/or word stemming processing.
The text filtering process is to select information required by a user from text information or to remove information not required by the user. For example, the text filtering process may be a non-english text filtering process. The APPs on google shops are published in different countries, textual descriptions are composed of different languages, and the same APP often contains paragraphs in multiple languages. Since most of the information is english text, in some embodiments of the present invention, only english text may be focused, specifically, the language identification is performed using the land package, all non-english text is deleted, and only english text is retained for analysis.
The word filtering process means to filter out common english stop words (e.g., "the", "it", "on", etc.) and meaningless expressions (e.g., "@", "$", "#", and meaningless punctuation marks, etc.). Because these characters appear the frequency in the english text higher, be unfavorable for APP's similarity to detect moreover, through carrying out word filtering processing, can improve APP similarity detection's validity and accuracy.
The word modification process refers to modifying english words that cannot be recognized (e.g., acronyms and misspelled words, etc.) into words that can be recognized. Specifically, a PyEnchant library can be utilized to check and modify misspelled english words and replace the abbreviations based on the collected list of abbreviations, thereby improving the effectiveness of APP similarity detection.
Word drying refers to reducing words to root words. For example, in english text, the same word has many variants, such as "eating", "eaten" and "ate" are all variants of "eat", and specifically, NLTK packages can be used to perform the stemming task to restore the word to the root word. It should be noted that the word after the word stem processing may not be a complete word, and only a part of the word, such as "leave", "reviv", "airlin", etc., is reserved.
According to the technical scheme provided by the embodiment of the invention, the effectiveness of APP similarity detection can be improved by performing preprocessing operations such as text filtering processing, word modification processing and/or word stemming processing on the attribute information of the first application program and the second application program.
Fig. 3 is a flowchart illustrating a training method of a similarity detection model according to another embodiment of the present invention. The embodiment shown in fig. 3 of the present invention is extended on the basis of the embodiment shown in fig. 1 of the present invention, and the differences between the embodiment shown in fig. 3 and the embodiment shown in fig. 1 will be emphasized below, and the descriptions of the same parts will not be repeated.
As shown in fig. 3, the training method of the similarity detection model according to the embodiment of the present invention further includes the following steps.
S150: and performing word embedding processing on the attribute information of the first application program and the second application program.
Wherein, step S110 includes: and performing feature extraction on the attribute information of the first application program and the second application program after the word embedding processing to obtain attribute features.
Illustratively, Word2Vec can be adopted for Word embedding, and vectorization conversion of the text is realized. Specifically, Word2Vec mainly includes a Continuous Bag of Words (CBOW) model and a Skip-Gram (Skip-Gram) model. When the CBOW model is adopted for word embedding processing, the input of the CBOW model is a context word vector corresponding to a certain specific word, and the output is a word vector of the specific word. When the Skip-Gram model is adopted to carry out word embedding processing, the input of the Skip-Gram model is a word vector of a specific word, and the output is a context word vector corresponding to the specific word. The present invention does not limit the specific manner of word embedding.
Text attribute information (e.g., title information, description information, privacy policy information, etc.) can be converted into a low-dimensional, tightly continuous vector by word embedding techniques for subsequent various computations on the basis of the word vector.
Fig. 4 is a flowchart illustrating a training method of a similarity detection model according to another embodiment of the present invention. The embodiment shown in fig. 4 of the present invention is extended on the basis of the embodiment shown in fig. 3 of the present invention, and the differences between the embodiment shown in fig. 4 and the embodiment shown in fig. 3 will be emphasized below, and the descriptions of the same parts will not be repeated.
As shown in fig. 4, in the training method of the similarity detection model provided in the embodiment of the present invention, the attribute information includes title information, description information, and privacy policy information of the application program, and the method further includes the following steps.
S160: and pre-training the description information and the privacy policy information of the first application program and the second application program after word embedding processing through a long-term and short-term memory network.
Wherein, step S110 includes: and extracting the characteristics of the title information of the first application program and the second application program after the word embedding processing and the description information and the privacy policy information after the word embedding processing and the pre-training to obtain attribute characteristics.
In the embodiment of the invention, the APP similarity detection is carried out based on three kinds of metadata information (title information, description information and privacy policy information), the multi-dimensional characteristics of the APP metadata information are comprehensively considered, the contributions of different characteristics to similar APPs are fully considered, and the accuracy of similar APP detection is improved.
In the embodiment of the present invention, the header information (including the header information of the first application program and the second application program) remains unchanged after the word embedding process. The description information (including description information of the first application and the second application) and the privacy policy information (including privacy policy information of the first application and the second application) are long texts with respect to the title information. Because the long text contains richer text information, in order to better represent the description information and the privacy policy information of the continuous long text, the description information and the privacy policy information can be further extracted by adopting a long-short term memory network after being processed by Word2 Vec.
It should be understood that the long-short term memory network is a unidirectional long-short term memory network, a bidirectional long-short term memory network, a unidirectional long-short term memory network based on attention mechanism, or a bidirectional long-short term memory network based on attention mechanism, which is not limited in the present invention.
A more specific example is given below in conjunction with fig. 5 and 6. It should be understood that the process flow of fig. 5 is based on the model structure provided in fig. 6, which corresponds to each other, and the respective steps in fig. 5 are described in detail below in conjunction with fig. 6.
S210: and preprocessing the title information, the description information and the privacy policy information of the first application program and the second application program, wherein the preprocessing comprises non-English text filtering processing, word modification processing and word stem processing.
Specifically, the metadata information (title 1, description 1, privacy policy 1, and category 1) of the first application and the metadata information (title 2, description 2, privacy policy 2, and category 2) of the second application may be acquired by downloading the APK installation package and using a crawler technique. A similarity label between the first application and the second application is established based on the category information (category 1 and category 2). The title information (title 1 and title 2), the description information (description 1 and description 2), and the privacy policy information (privacy policy 1 and privacy policy 2) are preprocessed and then input as a similarity detection model.
S220: and performing Word embedding processing on the preprocessed title information, description information and privacy policy information by adopting a Skip-Gram model of Word2 Vec.
In particular, the output via word embedding is a set of continuous distributed representations W in a dense spacei=(wi1,wi2,…,win) Wherein, in the step (A),
Figure BDA0002600981120000101
where n is the word size of the ith sequence, dwThe dimension of each word vector.
The technical scheme adopts a Skip-Gram neural network model of Word2Vec to embed words, and retains semantic and sequence information while retaining original information.
S230: and pre-training the description information and the privacy strategy information after word embedding processing by adopting a bidirectional long-short term memory network based on an attention mechanism.
The description information and the privacy policy information are long texts with respect to the header information. The Long text contains rich text information, and in order to better represent the Long text, the description information and the privacy policy information are processed by Word2Vec, and then a bidirectional Long Short-Term Memory (BilSTM) network based on an attention mechanism is adopted to further extract feature information.
The BilSTM network is formed by combining a forward LSTM network and a backward LSTM network. The LSTM network is used as a time recursion neural network, and the neural network can effectively keep historical information and realize the learning of long-term dependence information of texts. The LSTM network implements the update and retention of history information by 3 gates (input gate, forgetting gate and output gate) and one cell.
The LSTM data updating process is as follows, at time t, the input gate will output result h according to the LSTM unit at the last timet-1And input x of the current timetAs an input, the decision whether to update the current information into the LSTM cell is made by calculation, which can be expressed as:
it=σ(Wixt+Uiht-1+bi)
the forgetting gate can hide the output result h of the layer according to the last momentt-1And input x of the current timetAs an input, information that needs to be retained and discarded is determined, and history information is stored, which can be expressed as:
ft=σ(Wfxt+Ufht-1+bf)
for the current candidate cell value cintFrom the current input data xtAnd last time LSTM hidden layer unit output result ht-1The decision, can be expressed as:
cint=tanh(Wcxt+Ucht-1+bc)
the state value c of the memory cell at the present timetExcept by the current candidate unit cintAnd self-state ct-1The two factors need to be adjusted through an input door and a forgetting door, which can be expressed as:
ct=ft·ct-1+it·tanh(Wcxt+Ucht-1+bc)
where "·" denotes a point-by-point product.
Calculation output gate otThe output for controlling the state value of the memory cell can be expressed as:
ot=σ(Woxt+Uoht-1+bo)
the final LSTM cell output is htIt can be expressed as:
ht=ot·tanh(ct)
wherein, in the formula, σ is sigmoid function, and b is bias vector. i. f and o are an input gate, an output gate and a forgetting gate respectively, and h is a hidden vector. W is an input weight matrix, U is a hidden state weight matrix, and C is a window activation vector.
In order to obtain the context information of the description information and the privacy policy information, a BilSTM network is adopted to explore a bidirectional hidden state at time t, and two directional states are integrated into a final state, so that the bidirectional characteristics of the text can be learned, and the following calculation is carried out:
Figure BDA0002600981120000121
in addition, in the text, the contribution degree of each word to the text category is different, in order to extract the features of important words and further extract deeper information between texts, the embodiment of the invention introduces an Attention mechanism for acquiring the influence of different words on the text description, namely, establishes a BilStm network based on the Attention mechanism. By introducing an Attention mechanism, the weight of each time sequence is calculated, then the vectors of all time sequences are weighted and taken as characteristic vectors, and softmax classification is carried out, and the calculation is as follows:
uti=tanh(Whti+b)
Figure BDA0002600981120000122
Figure BDA0002600981120000123
where t denotes the tth description text and i denotes the ith word in the description. u. oftiIs a context vector htiB is a bias vector. Then, use utiAnd a context vector uwThe similarity between them measures the importance of different words. Obtaining a normalized weight a by a softmax functiontiTo vector the sentence into stRepresented as a weighted sum of word annotations.
The BiLSTM network based on the attention mechanism is adopted, so that not only can the characteristics of words and sentences in the text be extracted, but also the importance of different words can be acquired.
S240: and performing feature extraction on the header information subjected to the word embedding processing, the description information subjected to the word embedding processing and the pre-training and the privacy policy information by adopting a deep FM network to respectively obtain a low-order feature and a high-order feature.
Through the processing of the above steps S210-S230, the header information (short text) retains the feature representation after word embedding, the description information and the privacy policy information (long text) retain the feature representation after word embedding and pre-training, and in order to capture the interaction between features, the feature extraction and fusion of the long text and the short text are realized by using the deep fm model.
The deep FM model includes two parts: and the Factorization Machine (FM) part and the Deep Neural Network (DNN) part are respectively responsible for extracting low-order features and high-order features of the title information, the description information and the privacy policy information. The FM part can extract the first-order features and second-order features formed by pairwise combination of the first-order features; the DNN part may perform feature extraction on high-order features formed by performing operations such as full concatenation on input first-order features. The two parts share the same input (i.e., word-embedded header information, word-embedded and pre-trained description information, and privacy policy information).
The output of the FM part is as follows:
Figure BDA0002600981120000131
wherein, W0Is a global offset, WiIs the ith characteristic variable xiN represents the number of features of the sample, xiThe ith feature is represented. ViAnd VjIs a hidden vector that is a vector of interest,<Vi,Vj>is the interaction of the second order features. x { (i, j) |0<i≤m,0<j≤m,j>i,xi≠0,xj≠0}。
Through the FM mechanism, low-dimensional cross features can be learned, and further, the high-order relationship between embedded features can be trained through a Deep Neural Network (DNN) part, and the output of the DNN part can be calculated through the following formula:
Figure BDA0002600981120000132
wherein the content of the first and second substances,
Figure BDA0002600981120000133
what is represented is a weight parameter which is,
Figure BDA0002600981120000134
ble R represents a bias parameter. σ is the sigmoid activation function. For each hidden layer of DNN, its output vector L is calculated by the Relu activation function, i.e., L ═ Relu (W)zz+bz) Wherein
Figure BDA0002600981120000135
Is a matrix of the weights that is,
Figure BDA0002600981120000136
is a feature vector, z is the feature vector of the input model for the input layer, for the hidden layerZ is the output vector of the previous layer, which
Figure BDA0002600981120000137
And (4) biasing.
S250: and carrying out fusion and dimensionality reduction processing on the low-order features and the high-order features, and carrying out sigmoid activation function processing to obtain a similarity prediction value between the first application program and the second application program.
Specifically, the similarity prediction result of the deep fm model can be expressed as:
Figure BDA0002600981120000141
s260: and training a similarity detection model according to the difference between the similarity prediction value and the similarity label.
According to the technical scheme provided by the embodiment of the invention, a supervised similarity detection model can be established by establishing a similarity label between the first application program and the second application program and using the similarity label as a training label; the method has the advantages that the multi-dimensional characteristics of the APP metadata information are comprehensively considered, including description, title, category and privacy strategies, the contributions of different characteristics to similar APPs are fully considered, and the accuracy of similar APP detection is improved; the method has the advantages that vectorization representation of the text is realized and converted into compact continuous vectors through a natural language processing technology and a word embedding technology, context information of the long text is learned through a BilSTM technology, multi-feature interaction is realized through a deep FM mechanism, a plurality of technologies are effectively combined, a similar APP detection model based on deep learning is constructed, the performance of the similarity detection model is improved, and the detection efficiency and accuracy of the similarity of the application program are improved.
Fig. 7 is a flowchart illustrating an application method of the similarity detection model according to an embodiment of the present invention. The method may be performed by a computer device (e.g., a server). As shown in fig. 7, the method includes the following.
S310: inputting the attribute information of the first application program and the second application program to be detected into a similarity detection model, wherein the similarity detection model is obtained by training through any one of the methods.
S320: and carrying out similarity detection on the first application program and the second application program by utilizing a similarity detection model.
Specifically, the attribute information of the first application program and the second application program to be detected is input into the similarity detection model, and the similarity detection model can output the similarity between the first application program and the second application program.
According to the technical scheme provided by the embodiment of the invention, the similarity detection efficiency and the accuracy of the similarity of the application programs can be improved by inputting the attribute information of the first application program and the second application program to be detected into the similarity detection model and carrying out the similarity detection on the first application program and the second application program by using the similarity detection model.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Fig. 8 is a block diagram of a training apparatus for a similarity detection model according to an embodiment of the present invention. As shown in fig. 8, the training apparatus 800 includes:
the feature extraction module 810 is configured to perform feature extraction on the attribute information of the first application program and the second application program to obtain attribute features.
And a similarity module 820 for obtaining a similarity between the first application and the second application according to the attribute characteristics.
A training module 830 for training a similarity detection model according to the difference between the similarity and a similarity label, wherein the similarity label is used for labeling the similarity between the first application and the second application.
According to the technical scheme provided by the embodiment of the invention, the attribute characteristics are obtained by extracting the characteristics of the attribute information of the first application program and the second application program; according to the attribute characteristics, obtaining the similarity between the first application program and the second application program; according to the similarity and the difference of the similarity labels, the similarity detection model is trained, the similarity labels can be used as training labels, a supervised similarity detection model is established, and the performance of the similarity detection model is improved, so that the detection efficiency and accuracy of the similarity of the application program are improved.
In some embodiments of the invention, the similarity label is established based on coarse-grained category information and/or fine-grained category information of the first application and the second application.
In some embodiments of the present invention, the similarity label includes a first similarity label, a second similarity label and/or a third similarity label, wherein the first similarity label is used for marking that coarse-grained category information of the first application and the second application are different; the second similarity label is used for marking that the coarse-grained category information of the first application program and the second application program is the same and the fine-grained category information is different; and the third similarity label is used for marking that the fine-grained class information of the first application program and the fine-grained class information of the second application program are the same.
In some embodiments of the invention, the similarity detection model is an FM model, a DNN model, or a deep FM model.
In some embodiments of the present invention, the training module of the similarity detection model further includes a word embedding module 840, configured to perform word embedding processing on the attribute information of the first application program and the second application program, where the feature extraction module 810 is further configured to perform feature extraction on the attribute information of the first application program and the second application program after the word embedding processing.
In some embodiments of the present invention, the attribute information includes title information, description information, and privacy policy information of the application program, and the training module of the similarity detection model further includes a pre-training module 850 for pre-training the description information and the privacy policy information of the first application program and the second application program after the word embedding process through the long-short term memory network, wherein the feature extraction module 810 is further configured to perform feature extraction on the title information and the description information and the privacy policy information of the first application program and the second application program after the word embedding process and the pre-training process.
In some embodiments of the invention, the long short term memory network is a unidirectional long short term memory network, a bidirectional long short term memory network, a unidirectional long short term memory network based on attention mechanism, or a bidirectional long short term memory network based on attention mechanism.
Fig. 9 is a block diagram illustrating an apparatus for applying a similarity detection model according to an embodiment of the present invention. As shown in fig. 9, the application apparatus 900 includes:
an input module 910, configured to input attribute information of a first application program and a second application program to be detected into a similarity detection model, where the similarity detection model is obtained by training according to any one of the above methods;
the detecting module 920 is configured to perform similarity detection on the first application and the second application by using a similarity detection model.
According to the technical scheme provided by the embodiment of the invention, the similarity detection efficiency and the accuracy of the similarity of the application programs can be improved by inputting the attribute information of the first application program and the second application program to be detected into the similarity detection model and carrying out the similarity detection on the first application program and the second application program by using the similarity detection model.
The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
Fig. 10 is a block diagram of an electronic device according to an embodiment of the invention.
Referring to fig. 10, electronic device 1000 includes a processing component 1010 that further includes one or more processors, and memory resources, represented by memory 1020, for storing instructions, such as application programs, that are executable by processing component 1010. The application programs stored in memory 1020 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1010 is configured to execute instructions to perform the above-described methods of training and applying similarity detection models.
The electronic device 1000 may also include a power supply component configured to perform power management of the electronic device 1000, a wired or wireless network interface configured to connect the electronic device 1000 to a network, and an input-output (I/O) interface. The electronic device 1000 may operate based on an operating system stored in the memory 1020, such as Windows ServerTM,MacOS XTM,UnixTM,LinuxTM,FreeBSDTMOr the like.
A non-transitory computer readable storage medium, wherein instructions of the storage medium, when executed by a processor of the electronic device 1000, enable the electronic device 1000 to perform a method for training and applying a similarity detection model.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program check codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that the combination of the features in the present application is not limited to the combination described in the claims or the combination described in the embodiments, and all the features described in the present application may be freely combined or combined in any manner unless contradictory to each other.
It should be noted that the above-mentioned embodiments are only specific examples of the present invention, and obviously, the present invention is not limited to the above-mentioned embodiments, and many similar variations exist. All modifications which would occur to one skilled in the art and which are, therefore, directly derived or suggested from the disclosure herein are deemed to be within the scope of the present invention.
It should be understood that the terms such as first, second, etc. used in the embodiments of the present invention are only used for clearly describing the technical solutions of the embodiments of the present invention, and are not used to limit the protection scope of the present invention.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A training method of a similarity detection model is characterized by comprising the following steps:
extracting the characteristics of the attribute information of the first application program and the second application program to obtain attribute characteristics;
according to the attribute characteristics, obtaining the similarity between the first application program and the second application program;
and training the similarity detection model according to the difference between the similarity and a similarity label, wherein the similarity label is used for marking the similarity between the first application program and the second application program.
2. The method of claim 1, wherein the similarity label is established based on coarse-grained category information and/or fine-grained category information of the first application and the second application.
3. The method of claim 2, wherein the similarity tags comprise a first similarity tag, a second similarity tag, and/or a third similarity tag,
wherein the first similarity label is used for marking that the coarse-grained category information of the first application program and the second application program are different;
the second similarity label is used for marking that the coarse-grained category information of the first application program and the second application program is the same and the fine-grained category information of the first application program and the second application program is different;
the third similarity label is used for marking that fine-grained category information of the first application program and fine-grained category information of the second application program are the same.
4. The method of claim 1, wherein the similarity detection model is a factorizer FM model, a deep neural network DNN model, or a deep factorizer deep FM model.
5. The method of any of claims 1 to 4, further comprising:
performing word embedding processing on the attribute information of the first application program and the second application program,
wherein the performing feature extraction on the attribute information of the first application program and the second application program includes:
and performing feature extraction on the attribute information of the first application program and the second application program after word embedding processing.
6. The method of claim 5, wherein the attribute information includes title information, description information, and privacy policy information of the application, the method further comprising:
pre-training the description information and the privacy policy information of the first application program and the second application program after word embedding processing through a long-short term memory network,
wherein the performing feature extraction on the attribute information of the first application program and the second application program after the word embedding processing includes:
and extracting the characteristics of the title information of the first application program and the second application program after word embedding processing, the description information and the privacy policy information after word embedding processing and pre-training.
7. The method of claim 6, wherein the long-short term memory network is a unidirectional long-short term memory network, a bidirectional long-short term memory network, a unidirectional long-short term memory network based on attention mechanism, or a bidirectional long-short term memory network based on attention mechanism.
8. An application method of a similarity detection model is characterized by comprising the following steps:
inputting attribute information of a first application program and a second application program to be detected into a similarity detection model, wherein the similarity detection model is obtained by training through the method of any one of claims 1 to 7;
and carrying out similarity detection on the first application program and the second application program by utilizing the similarity detection model.
9. A training device for a similarity detection model, comprising:
the characteristic extraction module is used for extracting the characteristics of the attribute information of the first application program and the second application program to obtain attribute characteristics;
the similarity module is used for obtaining the similarity between the first application program and the second application program according to the attribute characteristics;
a training module, configured to train the similarity detection model according to a difference between the similarity and a similarity label, where the similarity label is used to label a similarity between the first application program and the second application program.
10. An apparatus for applying a similarity detection model, comprising:
an input module, configured to input attribute information of a first application program and a second application program to be detected into a similarity detection model, where the similarity detection model is obtained by training according to the method of any one of claims 1 to 7;
and the detection module is used for carrying out similarity detection on the first application program and the second application program by utilizing the similarity detection model.
11. A computer-readable storage medium, characterized in that the storage medium stores a computer program for performing the method of any of the preceding claims 1 to 7.
12. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
the processor configured to perform the method of any of the preceding claims 1 to 7.
CN202010723891.9A 2020-07-24 2020-07-24 Training method and device, application method and device of similarity detection model Active CN111860662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010723891.9A CN111860662B (en) 2020-07-24 2020-07-24 Training method and device, application method and device of similarity detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010723891.9A CN111860662B (en) 2020-07-24 2020-07-24 Training method and device, application method and device of similarity detection model

Publications (2)

Publication Number Publication Date
CN111860662A true CN111860662A (en) 2020-10-30
CN111860662B CN111860662B (en) 2023-03-24

Family

ID=72950573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010723891.9A Active CN111860662B (en) 2020-07-24 2020-07-24 Training method and device, application method and device of similarity detection model

Country Status (1)

Country Link
CN (1) CN111860662B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107015961A (en) * 2016-01-27 2017-08-04 中文在线数字出版集团股份有限公司 A kind of text similarity comparison method
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN109446905A (en) * 2018-09-26 2019-03-08 深圳壹账通智能科技有限公司 Sign electronically checking method, device, computer equipment and storage medium
CN110211674A (en) * 2019-04-23 2019-09-06 平安科技(深圳)有限公司 Stone age test method and relevant device based on machine learning model
CN110705299A (en) * 2019-09-26 2020-01-17 北京明略软件系统有限公司 Entity and relation combined extraction method, model, electronic equipment and storage medium
CN111143203A (en) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 Machine learning method, privacy code determination method, device and electronic equipment
CN111353033A (en) * 2020-02-27 2020-06-30 支付宝(杭州)信息技术有限公司 Method and system for training text similarity model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107015961A (en) * 2016-01-27 2017-08-04 中文在线数字出版集团股份有限公司 A kind of text similarity comparison method
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN109446905A (en) * 2018-09-26 2019-03-08 深圳壹账通智能科技有限公司 Sign electronically checking method, device, computer equipment and storage medium
CN110211674A (en) * 2019-04-23 2019-09-06 平安科技(深圳)有限公司 Stone age test method and relevant device based on machine learning model
CN110705299A (en) * 2019-09-26 2020-01-17 北京明略软件系统有限公司 Entity and relation combined extraction method, model, electronic equipment and storage medium
CN111143203A (en) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 Machine learning method, privacy code determination method, device and electronic equipment
CN111353033A (en) * 2020-02-27 2020-06-30 支付宝(杭州)信息技术有限公司 Method and system for training text similarity model

Also Published As

Publication number Publication date
CN111860662B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
CN110633577B (en) Text desensitization method and device
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
CN111475622A (en) Text classification method, device, terminal and storage medium
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN115718792A (en) Sensitive information extraction method based on natural semantic processing and deep learning
CN112560504A (en) Method, electronic equipment and computer readable medium for extracting information in form document
CN117668292A (en) Cross-modal sensitive information identification method
CN113918936A (en) SQL injection attack detection method and device
CN116521899A (en) Improved graph neural network-based document-level relation extraction algorithm and system
CN116257601A (en) Illegal word stock construction method and system based on deep learning
CN111860662B (en) Training method and device, application method and device of similarity detection model
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN112132269B (en) Model processing method, device, equipment and storage medium
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN113836297A (en) Training method and device for text emotion analysis model
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN110674497B (en) Malicious program similarity calculation method and device
CN113537372B (en) Address recognition method, device, equipment and storage medium
CN117354067B (en) Malicious code detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant