CN111626318A - Multi-language harmful information feature intelligent mining method based on deep learning - Google Patents

Multi-language harmful information feature intelligent mining method based on deep learning Download PDF

Info

Publication number
CN111626318A
CN111626318A CN201911063979.6A CN201911063979A CN111626318A CN 111626318 A CN111626318 A CN 111626318A CN 201911063979 A CN201911063979 A CN 201911063979A CN 111626318 A CN111626318 A CN 111626318A
Authority
CN
China
Prior art keywords
language
word
words
category
harmful
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911063979.6A
Other languages
Chinese (zh)
Inventor
赵全军
吴敬征
段旭
陈宏江
伊克拉木·伊力哈木
刘立力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sinosoft Co ltd
Original Assignee
Sinosoft Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sinosoft Co ltd filed Critical Sinosoft Co ltd
Priority to CN201911063979.6A priority Critical patent/CN111626318A/en
Publication of CN111626318A publication Critical patent/CN111626318A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Physiology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a multilingual harmful information characteristic intelligent mining method based on deep learning, which is used for marking harmful and harmless information texts of various categories of various languages; selecting candidate words from the words of each category of each language by using an RNSW method and establishing an unique hot code; inputting the sample data into a CNN neural network model for training to obtain a score of each word belonging to the harmful category of the language as a weight; and screening the harmful information characteristics selected by machine learning by using a genetic algorithm to form final harmful information characteristics and weights. The invention provides a language-independent RNSW method for reducing the dimension of the text, which effectively reduces the number of parameters for model training, accelerates the training speed and improves the accuracy of model identification; secondly, intelligent mining of harmful information features is achieved by adopting a deep learning method, and the harmful information features are screened through a genetic algorithm, so that the interpretability of harmful information identification is better.

Description

Multi-language harmful information feature intelligent mining method based on deep learning
The technical field is as follows:
the invention relates to a text analysis technology in the field of Internet, in particular to a harmful text recognition method, which is a multilingual harmful information feature intelligent mining method based on deep learning.
Background art:
two methods are commonly used for identifying harmful information, one is a method based on keyword and rule matching, and the other is a method based on machine learning. According to the method based on keyword and rule matching, a word bank of harmful words needs to be edited manually, sometimes the rules are complex enough to achieve a good effect, new words of harmful words on the network are layered endlessly, the updating iteration period is short, and a great deal of cost is consumed for maintaining the word bank and designing new rules. The method based on machine learning is a method gradually adopted in recent years, and has the advantages that technicians do not need to have deep field knowledge on harmful information and manually establish a large number of harmful word banks, but harmful words in network texts are automatically extracted by optimizing a machine learning algorithm, and the accuracy of harmful information identification is improved.
Zhang Jialiang et al in the patent "a web page harmful information identification method based on machine learning" (patent application number: 201811302974.X) proposed a web page harmful information identification method based on machine learning, through machine learning, training model, text classification technology, classification recognition to the web page that snatchs, according to the affiliated category of the web page recognition result, achieve the purpose of distinguishing whether the web page has harmful information, further judge whether the website has harmful information. The method is trained based on the collected corpus data, the trained model can only be used for identifying the text data of the corresponding language, and the method cannot extract the characteristics of harmful information and has poor interpretability of classification results.
The patent of kuweiming et al, a harmful information identification and webpage classification method based on multi-instance learning (patent application number: CN201410609728), proposes a webpage classification method based on multi-instance learning, which makes the algorithm more consistent with the actual distribution of the webpage content by taking the images and the related texts contained in the webpage as the instances in the webpage package, and can utilize the effective information of the webpage to deeply mine the complementarity of the image information and the text information, and finally obtain better effect than classification by using only single-mode information. The method needs images in the webpage as auxiliary information for identification, but most web texts have no pictures, and some pictures may not be related to texts, so the method is not convenient to use on large-scale text data.
Aiming at the problems, the patent provides a multilingual harmful information characteristic intelligent mining method based on deep learning, which is characterized in that samples of a labeled training data set are subjected to tokenization, a candidate word list of the class samples is established by an RNSW (remove Negative Sample words) method for language-independent text characteristic dimension reduction, and a unique One-Hot Encoding code is established for each word to correspond to the word, so that a word and coded data pair is formed. The data pairs are used for mapping a training data set to a vector space, then a CNN model is used for training, the trained optimal model is used for inputting candidate words to obtain the weight of each word, the initial range of harmful information features is determined according to the weight of each word, and a genetic algorithm is used for selecting the final number of harmful information feature words. The method can be used for mining the harmful text information characteristics of each category of each language. In addition, because the RNSW method of text feature dimension reduction is used, and candidate words are limited in the training sample set of the category of the language, the number of word vectors is greatly reduced, the number of parameters of a training model is greatly reduced, and the training speed is high; the optimal number of characteristic words representing harmful information are automatically screened out through the self-adaptability of the genetic algorithm, the accuracy rate of characteristic selection is high, and the method is suitable for processing large-scale text data.
The invention content is as follows:
the invention aims to provide a deep learning-based intelligent mining method for multi-language harmful information characteristics, which can be used for identifying harmful text data by mining harmful information characteristics in different languages through a general method for mining harmful information characteristics independent of specific languages.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a multilingual harmful information feature intelligent mining method based on deep learning is disclosed, as shown in the attached figure 1, and comprises the following steps:
1) collecting harmful information texts and harmless information texts of various languages and various categories, and establishing a data labeling set<S>Labeling positive and negative sample data of harmful information text data of each language and each category, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is NPositive sampleThe negative sample is the harmless information text of the category of the language, and the number of samples is NNegative sample
2) Tokenizing the text of the positive and negative samples of each language and each category in the data tagging set < S > in the step 1).
3) Selecting n candidate words from the words of each category of each language in the step 2) by using an RNSW method, and establishing a word pair set of the words-ID of the category<W1,ID1>,<W2,ID2>,......,<Wn,IDn>N is the number of word pairs in the set of word pairs of the category, WxRepresenting words of various languages or words after word segmentation, IDxThe ID representing the word is represented by a unique one-hot code encoding of a unique one of the integer values in the set.
4) Collecting each sample data of each category of each language according to the word pairs of the category of the language in the step 3)<W1,ID1>,<W2,ID2>,......,<Wn,IDn>And converting into a data vector X of the corresponding ID: { Vec1,Vec2,......Vecm}。
5) The number M of words of the largest sample in each category of each language of the step 4)maxAs the number of words in this category, data vector X: { Vec1,Vec2,......VecmThe dimension of the vector of the Chinese word is less than MmaxIs filled with 0 at the front end. Setting a class vector of the corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1,y2,......,ymFor positive samples of this class yxIs [1, 0 ]]Negative sample is yxIs [0, 1 ]]。
6) Dividing the data vectors X and y of each category of each language in the step 5) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a certain proportion.
7) Inputting train _ x and train _ y obtained in the step 6) into a CNN neural network model shown in the attached figure 4 in batches according to the batch size of a certain numerical value, training by using an Adam optimizer and a cross entropy loss function, and finally performing normalized classification by using softmax to obtain a final classification result.
8) Set word pairs<W1,ID1>,<W2,ID2>,......,<Wn,IDn>Each word W ofxInputting the obtained data into a final model of the training optimization in the step 7) to obtain each word WxScore M of the harmful category belonging to the languagexWill MxAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight1,W2,......WpAnd if so, the word set is the harmful information characteristic of the category of the language selected by machine learning.
9) Machine learning selected harmful information features { W) for step 8) using genetic algorithm1,W2,......WpSelecting characteristics, selecting an optimal number of harmful information characteristic words, and forming final harmful information characteristics { W }1,W2,......WqAnd a weight { M }1,M2,......,Mq}。
10) Using the harmful information feature of step 9) { W }1,W2,......WqAnd corresponding weight { M }1,M2,......,MqAnd judging whether the text is harmful information.
The step 1) above describes the process of collecting and labeling harmful information and harmless information and the process of establishing a data label set < S >, which is the basis for performing harmful information feature mining, wherein harmful information in various languages includes but is not limited to the following languages: chinese, English, Vietnamese, Korean, Japanese, Arabic, German, French.
The step 2) described above describes the process of segmenting words or instantiating the harmful information text and the harmless information text in each language in the data label set < S >. Different processing procedures exist for different languages, as shown in fig. 2, and can be performed according to the following steps:
2a) judging the languages of the text, and turning to the step 2b if the languages of the text are Chinese, Korean, Japanese and the like; if the language is Latin language such as English, French and the like, turning to step 2 c); if the language is Virware, Arabic, etc., the step 2d) is proceeded to.
2b) Performing word segmentation on Chinese, Korean, Japanese and other languages, namely segmenting a character sequence into word sequences, and then removing stop words and punctuation marks;
2c) performing word instantiation on Latin character languages such as English and French, decomposing words contained in a sentence according to language rules, mainly adopting spaces, punctuations and the like for segmentation, and converting all capital letters into lowercase letters;
2d) the method is characterized in that words such as Vietwork and Arabic are instantiated, words contained in sentences are decomposed according to language rules, and space, punctuation marks such as Vietwork and Arabic and the like are mainly adopted for segmentation, and even words are segmented. For different spelling forms of characters in the language, the characters are converted into the same spelling form, for example, the current Vi, Latin Vi, Schlavian, new Vi and nonstandard Latin Vi are converted into the spelling form of the current Vi.
The method for performing dimensionality reduction representation on the text, which is irrelevant to the language and is described in the step 3), comprises the following steps: the RNSW method can select candidate words from the texts of each category of each language, establish a word pair set and use the word pair set to perform characteristic dimension reduction representation on the original texts. The process establishes a set of word pairs only in the category of the language, and the number of words in the set is determined according to the specific situation of feature dimension reduction, including but not limited to 10000, 100000 and 1000000. The specific process of selecting candidate words is shown in fig. 3, and may be performed according to the following steps:
3a) establishing dictionary dit of words of the negative examples according to the negative examples of the category of the languageNegative sample={W1:V1,......,Wn:VnIn which W isiIs a word, ViN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.
3b) According to ViThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtainedNegative sample MAX={W1:V1,......, Wk:VkWhere the value of k includes, but is not limited to, 100, 1000, and 10000. Get W1To WkObtaining a set:
set(Wnegative sample MAX)={W1,W2,......Wk} (1)
3c) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the languagePositive sample={W1:V1,......, Wm:VmIn which W isxIs a word, VxM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W1To WmObtaining a set:
set(Wpositive sample)={W1,W2,......Wm} (2)
3d) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)Candidates) Where s is the number of words of the candidate word set:
set(Wcandidates)=set(WPositive sample)-set(WNegative sample MAX)={W1,W2,......Ws} (3)
The above step 4) describes a process of vectorizing the sample data of each category of each language, which is to vectorize the positive and negative samples of the category of the language only by using the word pair set of the category of the language.
Step 5) above describes a process of vector dimension expansion of the data vector X for each category of each language, which is based only on the number M of words of the largest sample of that category of that languagemaxThe number of words in the category of the language is extended in dimension.
Step 6) above describes the division process of the training set and the test set, the division ratio of the process is set according to the effect of the embodied process, for example, it can be set to include but not limited to 10: 1, where the training set is 10 parts and the test set is 1 part.
The process of model training described in the above step 7), where the batch size can be, but is not limited to, 16, 32, 64, and the CNN neural network model shown in fig. 4 is a CNN neural network model of a specific embodiment, in which the number M of words of each category according to each language is required in the specific implementation processmaxAnd adjusting input parameters of the CNN neural network during the concrete implementation process. The above step 8) describes the process of obtaining the harmful information characteristics of the category of the language according to the trained model, wherein the harmful information characteristics are only used for identifying the harmful information of the category of the language, and the value of p is manually obtained according to the words W conforming to the category of the languagexIs determined.
The above step 9) describes the process of screening the harmful information features by the genetic algorithm, wherein the harmful information features selected by the genetic algorithm are only used for the feature selection of the harmful information of the category of the language. The process of the genetic algorithm for feature selection is carried out according to the following steps:
9a) collecting the p word sets { W) generated in the step 8)1,W2,......,WpThe population of the words is used as a total population, the initial population adopts a random method to generate t individuals,
9b) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm1,W2,......,WpAnd establishing a similar word set of each word according to a set similarity threshold value.
9c) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:
f(x)=α×(Nidentified positive samples-NFalse-identified negative sample)/(NPositive sample+NNegative sample)-β×t
Wherein N isIdentified positive samplesThe number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, NFalse-identified negative sampleα is a parameter for controlling the contribution of the improvement of the classification accuracy to the evaluation function, β is a parameter for controlling the contribution of the reduction of the feature number to the evaluation function, t is the number of individuals in step 9a), N is the number of samples containing the feature word and the feature word similarity in the word set of each sample in the negative sample setPositive sampleNumber of harmful information text samples, N, of step 1)Negative sampleThe number of harmless information text samples in step 1).
9d) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent
9e) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population
9f) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. When the algorithm termination condition is not reached, the iteration of steps 9c) to 9e) is repeated.
The above step 10) describes a process of performing harmful information identification using the harmful information feature, wherein the harmful information feature and the corresponding weight are used only for identification of harmful information of the category of the language.
The invention provides a language-independent RNSW method for reducing the dimension of the text, which effectively reduces the number of parameters for model training, accelerates the training speed and improves the accuracy of model identification; secondly, intelligent mining of harmful information characteristics is achieved by adopting a deep learning method, and screening of the harmful information characteristics is performed through a genetic algorithm, so that the harmful information category identification is more accurate, and the interpretability of the harmful information identification is better.
Drawings
FIG. 1 is a flow diagram of a harmful information feature intelligent mining method based on deep learning
FIG. 2 is a schematic diagram of the processing procedure of segmenting or instantiating words according to language samples
FIG. 3 is a process diagram of candidate word selection using RNSW method
FIG. 4 is a schematic diagram of a CNN network structure used in the present invention
The specific implementation mode is as follows:
the invention will now be further illustrated by way of example, without in any way limiting its scope, with reference to the accompanying drawings.
The general flow of the intelligent mining method for the characteristics of the multi-language harmful information based on deep learning is shown in the attached figure 1, taking the intelligent mining of the characteristics of the Chinese violence and terrorism harmful information as an example, the method specifically comprises the following steps:
1) collecting various language violence and terrorism harmful information texts and harmless information texts including Chinese, and establishing a data labeling set<S>Marking positive and negative sample data of the harmful information text data, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is NPositive sampleThe negative sample is the harmless information text of the category of the language, and the number of samples is NNegative sample
2) According to the processing flow shown in fig. 2, when the language is judged to be Chinese, a Chinese word segmentation processing flow is executed, and the harmful information text and the harmless information text in the data label set < S > are segmented by using but not limited to a Jieba word segmentation tool.
3) According to the process flow of FIG. 3, a negative example of Chinese violence and terrorism is used to create a dictionary of negative example words, dittNegative sample={W1:V1,......,Wn:VnIn which W isiIs a word, ViN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.
4) According to ViThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtainedNegative sample MAX={W1:V1,......, Wk:VkWherein the value of k comprisesBut are not limited to 100, 1000 and 10000. Get W1To WkObtaining a set:
set(Wnegative sample MAX)={W1,W2,......Wk} (1)
5) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the languagePositive sample={W1:V1,......,Wm:VmIn which W isxIs a word, VxM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W1To WmObtaining a set:
set(Wpositive sample)={W1,W2,......Wm} (2)
6) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)Candidates) Where s is the number of words of the candidate word set:
set(Wcandidates)=set(WPositive sample)-set(WNegative sample MAX)={W1,W2,......Ws} (3)
7) Set of candidate words set (W) generated according to step 6)Candidates) Establishing word pair set of word-ID of Chinese violence and terrorism class<W1, ID1>,<W2,ID2>,......,<Wn,IDn>N is the number of word pairs in the set of word pairs of the category, WxRepresenting candidate words, IDxThe ID representing the word is represented by a unique one-hot code encoding of a unique one of the integer values in the set.
8) Collecting each sample data of Chinese violence and terrorism classes according to the word pairs in the step 7)<W1,ID1>,<W2, ID2>,......,<Wn,IDn>And converting into a data vector X of the corresponding ID: { Vec1,Vec2,......Vecm}。
9) Number M of words of the largest sample of step 8)maxThe number of words as Chinese violence and terrorism class is obtained by using data vector X:{Vec1,Vec2,......VecmThe dimension of the vector of the Chinese word is less than MmaxIs filled with 0 at the front end. And setting a Chinese violence and terrorist class vector of corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1,y2,......,ymFor positive samples y of Chinese violence and terrorismxIs [1, 0 ]]Negative sample is yxIs [0, 1 ]]。
10) Dividing the data vectors X and y of the step 9) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a ratio of 10: 1.
11) Inputting train _ x and train _ y of the step 10) into a CNN neural network model shown in the attached figure 4 in batches according to the batch size of 32, training and learning by using an Adam optimizer and a cross entropy loss function, and finally performing normalized classification by using softmax to obtain a final classification result.
12) Set word pairs<W1,ID1>,<W2,ID2>,......,<Wn,IDn>Each word W ofxInputting the word W into the final model of the training optimization in the step 11) to obtain each word WxScore M belonging to Chinese violence and terrorismxWill MxAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight1,W2,......WpAnd fourthly, the word set is the harmful information characteristics of the Chinese violence and terrorism class selected by machine learning.
13) Gathering the p word sets { W) generated in the step 12)1A population of words of W2......, Wp } as a total population, the initial population using a random method, producing t individuals,
14) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm1,W2,......,WpAnd establishing a similar word set of each word according to a set similarity threshold value.
15) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:
f(x)=α×(Nidentified positive samples-NFalse-identified negative sample)/(NPositive sample+NNegative sample)-β×t
Wherein N isIdentified positive samplesThe number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, NFalse-identified negative sampleα is a parameter for controlling contribution of improvement of classification accuracy to the evaluation function, and β is a parameter for controlling contribution of reduction of feature number to the evaluation function.
16) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent
17) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population
18) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. And when the algorithm termination condition is not reached, repeating the iteration from the step 15) to the step 17).
19) After the algorithm is ended, the feature words in the population of the last generation are the optimal harmful information feature words selected to form the final harmful information features { W }1,W2,......WqAnd a weight { M }1,M2,......,Mq}。
20) Using the harmful information features W of step 19)1,W2,......WqAnd corresponding weight { M }1,M2,......,MqAnd judging whether the text is harmful information of Chinese violence and terrorism.

Claims (11)

1. A multilingual harmful information feature intelligent mining method based on deep learning comprises the following steps:
1) collecting various language harmful information texts and noneHarmful information text, establishing data label set<S>Labeling positive and negative sample data of harmful information text data of each language and each category, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is NPositive sampleThe negative sample is the harmless information text of the language of the category, and the number of samples is NNegative sample
2) Tokenizing each language harmful information text and harmless information text in the data tagging set < S > in the step 1), and then removing stop words and punctuation marks.
3) Selecting n candidate words from the words of each category of each language in the step 2) by using an RNSW (remove Negative Sample words) method, and establishing a word pair set of the words of the category, namely the ID (identity)<W1,ID1>,<W2,ID2>,......,<Wn,IDn>N is the number of word pairs in the set of word pairs of the category, WxRepresenting words of various languages or words after word segmentation, IDxThe ID representing the word is represented by a unique One-Hot Encoding of an integer value in the set.
4) Collecting each sample data of each category of each language in the step 3) according to the word pair set of the category of the language<W1,ID1>,<W2,ID2>,......,<Wn,IDn>And converting into a data vector X of the corresponding ID: { Vec1,Vec2,......Vecm}。
5) The number M of words of the largest sample in each category of each language of the step 4)maxAs the number of words in this category, data vector X: { Vec1,Vec2,......VecmThe dimension of the vector of the Chinese word is less than MmaxIs filled with 0 at the front end. Setting a class vector of the corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1, y2,......,ymFor positive samples of this class yxIs [1, 0 ]]Negative sample is yxIs [0, 1 ]]。
6) Dividing the data vectors X and y of each category of each language in the step 5) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a certain proportion.
7) Inputting the train _ x and the train _ y in the step 6) into a CNN neural network model shown in the attached figure 4 in batches for training and learning according to the batch size, using an Adam optimizer and a cross entropy loss function for training, and finally using softmax for normalized classification to obtain a final classification result.
8) Set word pairs<W1,ID1>,<W2,ID2>,......,<Wn,IDn>Each word W ofxInputting the obtained data into a final model of the training optimization in the step 7) to obtain each word WxScore M of the harmful category belonging to the languagexWill MxAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight1,W2,......,WpAnd if so, the word set is the harmful information characteristic of the category of the language selected by machine learning.
9) Machine learning selected harmful information features { W) for step 8) using genetic algorithm1,W2,......WpSelecting characteristics, selecting an optimal number of harmful information characteristic words, and forming final harmful information characteristics { W }1,W2,......WqAnd a weight { M }1,M2,......,Mq}。
10) Using the harmful information feature of step 9) { W }1,W2,......,WqAnd corresponding weight { M }1,M2,......,MqAnd judging whether the text is harmful information.
2. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 1) harmful-information and harmless-information collecting and labeling process and data label set < S > establishing process are the basis for harmful-information feature mining, wherein the harmful information of each language includes but is not limited to the following languages: chinese, English, Vietnamese, Korean, Japanese, Arabic, German, French.
3. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 2) is a process of tokenizing or tokenizing each language of harmful information text and harmless information text in the data tagging set < S >. Different processing procedures exist for different languages, and the processing procedures can be carried out according to the following steps:
2a) judging the languages of the text, and turning to the step 2b if the languages of the text are Chinese, Korean, Japanese and the like; if the language is Latin language such as English, French and the like, turning to step 2 c); if the language is Virware, Arabic, etc., the step 2d) is proceeded to.
2b) Performing word segmentation on Chinese, Korean, Japanese and other languages, namely segmenting a character sequence into word sequences, and then removing stop words and punctuation marks;
2c) performing word instantiation on Latin character languages such as English and French, decomposing words contained in a sentence according to language rules, mainly adopting spaces, punctuations and the like for segmentation, and converting all capital letters into lowercase letters;
2d) the method is characterized in that words such as Vietwork and Arabic are instantiated, words contained in sentences are decomposed according to language rules, and space, punctuation marks such as Vietwork and Arabic and the like are mainly adopted for segmentation, and even words are segmented. The method is characterized in that different character spelling forms in the language are converted into the same spelling form, and the current Vi, Latin Vi, Schlavian, new character Vi and nonstandard Latin Vi are converted into the spelling form of the current Vi.
4. The intelligent mining method for multi-language harmful information features based on deep learning according to claim 1, wherein the step 3) describes a language-independent method for performing dimensionality reduction representation on text: RNSW (remove negative sample words) method, which can select candidate words from the text of each category of each language, establish a word pair set and use the word pair set to perform characteristic dimension reduction representation on the original text. The process establishes a set of word pairs only in the category of the language, and the number of words in the set is determined according to the specific situation of feature dimension reduction, including but not limited to 10000, 100000 and 1000000. The specific process of selecting candidate words is shown in fig. 3, and may be performed according to the following steps:
3a) establishing dictionary dit of words of the negative examples according to the negative examples of the category of the languageNegative sample={W1:V1,......,Wn:VnIn which W isiIs a word, ViN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.
3b) According to ViThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtainedNegative sample MAX={W1:V1,......,Wk:VkWhere the value of k includes, but is not limited to, 100, 1000, and 10000. Get W1To WkObtaining a set:
set(Wnegative sample MAX)={W1,W2,......Wk} (1)
3c) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the languagePositive sample={W1:V1,......,Wm:VmIn which W isxIs a word, VxM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W1To WmObtaining a set:
set(Wpositive sample)={W1,W2,.....Wm} (2)
3d) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)Candidates) Where s is the number of words of the candidate word set:
set(Wcandidates)=set(WPositive sample)-set(WNegative sample MAX)={W1,W2,......Ws} (3)。
5. The method as claimed in claim 1, wherein the step 4) is a process of vectorizing the sample data of each category of each language, and the process of vectorizing the positive and negative samples of the category of the language only uses the word pair set of the category of the language.
6. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 5) comprises a process of vector dimension expansion of the data vector X for each category of each language, based only on the number M of words of the largest sample of the category of the languagemaxThe number of words in the category of the language is extended in dimension.
7. The method for intelligently mining multi-lingual harmful information characteristics based on deep learning of claim 1, wherein the step 6) is a process for dividing the training set and the test set, wherein the dividing ratio of the process is set according to the effect of the specific implemented process, for example, the dividing ratio can be set but is not limited to 10: 1, wherein the training set is 10, and the test set is 1.
8. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein the model training process of step 7) is a batch size process that can be performed by using, but not limited to, 16, 32, and 64, and the CNN neural network model shown in fig. 4 is an embodiment of the CNN neural network model that requires the number M of words in each category of each language during implementationmaxAnd adjusting input parameters of the CNN neural network during the concrete implementation process.
9. The method as claimed in claim 1, wherein the step 8) is a process of obtaining the harmful information characteristics of the category of the language according to the trained model, wherein the harmful information characteristics are only used for identifying the harmful information of the category of the language.
10. The method as claimed in claim 1, wherein the step 9) is a process of filtering harmful information features by genetic algorithm, wherein the harmful information features selected by the genetic algorithm are only used for feature selection of the harmful information of the category of the language. The process of the genetic algorithm for feature selection is carried out according to the following steps:
9a) collecting the p word sets { W) generated in the step 8)1,W2,......,WpThe population of the words is used as a total population, the initial population adopts a random method to generate t individuals,
9b) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm1,W2,......,WpAnd establishing a similar word set of each word according to a set similarity threshold value.
9c) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:
f(x)=α×(Nidentified positive samples-NFalse-identified negative sample)/(NPositive sample+NNegative sample)-β×t
Wherein N isIdentified positive samplesThe number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, NFalse-identified negative sampleα is a parameter for controlling the contribution of the improvement of the classification accuracy to the evaluation function, β is a parameter for controlling the contribution of the reduction of the feature number to the evaluation function, t is the number of individuals in step 9a), N is the number of samples containing the feature word and the feature word similarity in the word set of each sample in the negative sample setPositive sampleNumber of harmful information text samples, N, of step 1)Negative sampleThe number of harmless information text samples in step 1).
9d) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent
9e) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population
9f) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. When the algorithm termination condition is not reached, the iteration of steps 9c) to 9e) is repeated.
11. The method as claimed in claim 1, wherein the step 10) is a process for identifying harmful information by using the harmful information features, wherein the harmful information features and the corresponding weights are only used for identifying the harmful information of the category of the language.
CN201911063979.6A 2019-11-04 2019-11-04 Multi-language harmful information feature intelligent mining method based on deep learning Pending CN111626318A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911063979.6A CN111626318A (en) 2019-11-04 2019-11-04 Multi-language harmful information feature intelligent mining method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911063979.6A CN111626318A (en) 2019-11-04 2019-11-04 Multi-language harmful information feature intelligent mining method based on deep learning

Publications (1)

Publication Number Publication Date
CN111626318A true CN111626318A (en) 2020-09-04

Family

ID=72258790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911063979.6A Pending CN111626318A (en) 2019-11-04 2019-11-04 Multi-language harmful information feature intelligent mining method based on deep learning

Country Status (1)

Country Link
CN (1) CN111626318A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112837677A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Harmful audio detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112837677A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Harmful audio detection method and device

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN108090070B (en) Chinese entity attribute extraction method
CN107273358B (en) End-to-end English chapter structure automatic analysis method based on pipeline mode
CN109670041A (en) A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods
CN112818694A (en) Named entity recognition method based on rules and improved pre-training model
CN105022725A (en) Text emotional tendency analysis method applied to field of financial Web
Huang et al. Rethinking chinese word segmentation: tokenization, character classification, or wordbreak identification
Layton et al. Recentred local profiles for authorship attribution
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN103020167B (en) A kind of computer Chinese file classification method
CN114153971B (en) Error correction recognition and classification equipment for Chinese text containing errors
CN111476036A (en) Word embedding learning method based on Chinese word feature substrings
CN116361472B (en) Method for analyzing public opinion big data of social network comment hot event
CN112069307B (en) Legal provision quotation information extraction system
CN114912453A (en) Chinese legal document named entity identification method based on enhanced sequence features
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN113420548A (en) Entity extraction sampling method based on knowledge distillation and PU learning
CN111444720A (en) Named entity recognition method for English text
CN114969294A (en) Expansion method of sound-proximity sensitive words
CN111626318A (en) Multi-language harmful information feature intelligent mining method based on deep learning
Premaratne et al. Lexicon and hidden Markov model-based optimisation of the recognised Sinhala script
CN109284392B (en) Text classification method, device, terminal and storage medium
CN112990388B (en) Text clustering method based on concept words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination