CN111626318A - Multi-language harmful information feature intelligent mining method based on deep learning - Google Patents
Multi-language harmful information feature intelligent mining method based on deep learning Download PDFInfo
- Publication number
- CN111626318A CN111626318A CN201911063979.6A CN201911063979A CN111626318A CN 111626318 A CN111626318 A CN 111626318A CN 201911063979 A CN201911063979 A CN 201911063979A CN 111626318 A CN111626318 A CN 111626318A
- Authority
- CN
- China
- Prior art keywords
- language
- word
- words
- category
- harmful
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2111—Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Physiology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a multilingual harmful information characteristic intelligent mining method based on deep learning, which is used for marking harmful and harmless information texts of various categories of various languages; selecting candidate words from the words of each category of each language by using an RNSW method and establishing an unique hot code; inputting the sample data into a CNN neural network model for training to obtain a score of each word belonging to the harmful category of the language as a weight; and screening the harmful information characteristics selected by machine learning by using a genetic algorithm to form final harmful information characteristics and weights. The invention provides a language-independent RNSW method for reducing the dimension of the text, which effectively reduces the number of parameters for model training, accelerates the training speed and improves the accuracy of model identification; secondly, intelligent mining of harmful information features is achieved by adopting a deep learning method, and the harmful information features are screened through a genetic algorithm, so that the interpretability of harmful information identification is better.
Description
The technical field is as follows:
the invention relates to a text analysis technology in the field of Internet, in particular to a harmful text recognition method, which is a multilingual harmful information feature intelligent mining method based on deep learning.
Background art:
two methods are commonly used for identifying harmful information, one is a method based on keyword and rule matching, and the other is a method based on machine learning. According to the method based on keyword and rule matching, a word bank of harmful words needs to be edited manually, sometimes the rules are complex enough to achieve a good effect, new words of harmful words on the network are layered endlessly, the updating iteration period is short, and a great deal of cost is consumed for maintaining the word bank and designing new rules. The method based on machine learning is a method gradually adopted in recent years, and has the advantages that technicians do not need to have deep field knowledge on harmful information and manually establish a large number of harmful word banks, but harmful words in network texts are automatically extracted by optimizing a machine learning algorithm, and the accuracy of harmful information identification is improved.
Zhang Jialiang et al in the patent "a web page harmful information identification method based on machine learning" (patent application number: 201811302974.X) proposed a web page harmful information identification method based on machine learning, through machine learning, training model, text classification technology, classification recognition to the web page that snatchs, according to the affiliated category of the web page recognition result, achieve the purpose of distinguishing whether the web page has harmful information, further judge whether the website has harmful information. The method is trained based on the collected corpus data, the trained model can only be used for identifying the text data of the corresponding language, and the method cannot extract the characteristics of harmful information and has poor interpretability of classification results.
The patent of kuweiming et al, a harmful information identification and webpage classification method based on multi-instance learning (patent application number: CN201410609728), proposes a webpage classification method based on multi-instance learning, which makes the algorithm more consistent with the actual distribution of the webpage content by taking the images and the related texts contained in the webpage as the instances in the webpage package, and can utilize the effective information of the webpage to deeply mine the complementarity of the image information and the text information, and finally obtain better effect than classification by using only single-mode information. The method needs images in the webpage as auxiliary information for identification, but most web texts have no pictures, and some pictures may not be related to texts, so the method is not convenient to use on large-scale text data.
Aiming at the problems, the patent provides a multilingual harmful information characteristic intelligent mining method based on deep learning, which is characterized in that samples of a labeled training data set are subjected to tokenization, a candidate word list of the class samples is established by an RNSW (remove Negative Sample words) method for language-independent text characteristic dimension reduction, and a unique One-Hot Encoding code is established for each word to correspond to the word, so that a word and coded data pair is formed. The data pairs are used for mapping a training data set to a vector space, then a CNN model is used for training, the trained optimal model is used for inputting candidate words to obtain the weight of each word, the initial range of harmful information features is determined according to the weight of each word, and a genetic algorithm is used for selecting the final number of harmful information feature words. The method can be used for mining the harmful text information characteristics of each category of each language. In addition, because the RNSW method of text feature dimension reduction is used, and candidate words are limited in the training sample set of the category of the language, the number of word vectors is greatly reduced, the number of parameters of a training model is greatly reduced, and the training speed is high; the optimal number of characteristic words representing harmful information are automatically screened out through the self-adaptability of the genetic algorithm, the accuracy rate of characteristic selection is high, and the method is suitable for processing large-scale text data.
The invention content is as follows:
the invention aims to provide a deep learning-based intelligent mining method for multi-language harmful information characteristics, which can be used for identifying harmful text data by mining harmful information characteristics in different languages through a general method for mining harmful information characteristics independent of specific languages.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a multilingual harmful information feature intelligent mining method based on deep learning is disclosed, as shown in the attached figure 1, and comprises the following steps:
1) collecting harmful information texts and harmless information texts of various languages and various categories, and establishing a data labeling set<S>Labeling positive and negative sample data of harmful information text data of each language and each category, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is NPositive sampleThe negative sample is the harmless information text of the category of the language, and the number of samples is NNegative sample。
2) Tokenizing the text of the positive and negative samples of each language and each category in the data tagging set < S > in the step 1).
3) Selecting n candidate words from the words of each category of each language in the step 2) by using an RNSW method, and establishing a word pair set of the words-ID of the category<W1,ID1>,<W2,ID2>,......,<Wn,IDn>N is the number of word pairs in the set of word pairs of the category, WxRepresenting words of various languages or words after word segmentation, IDxThe ID representing the word is represented by a unique one-hot code encoding of a unique one of the integer values in the set.
4) Collecting each sample data of each category of each language according to the word pairs of the category of the language in the step 3)<W1,ID1>,<W2,ID2>,......,<Wn,IDn>And converting into a data vector X of the corresponding ID: { Vec1,Vec2,......Vecm}。
5) The number M of words of the largest sample in each category of each language of the step 4)maxAs the number of words in this category, data vector X: { Vec1,Vec2,......VecmThe dimension of the vector of the Chinese word is less than MmaxIs filled with 0 at the front end. Setting a class vector of the corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1,y2,......,ymFor positive samples of this class yxIs [1, 0 ]]Negative sample is yxIs [0, 1 ]]。
6) Dividing the data vectors X and y of each category of each language in the step 5) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a certain proportion.
7) Inputting train _ x and train _ y obtained in the step 6) into a CNN neural network model shown in the attached figure 4 in batches according to the batch size of a certain numerical value, training by using an Adam optimizer and a cross entropy loss function, and finally performing normalized classification by using softmax to obtain a final classification result.
8) Set word pairs<W1,ID1>,<W2,ID2>,......,<Wn,IDn>Each word W ofxInputting the obtained data into a final model of the training optimization in the step 7) to obtain each word WxScore M of the harmful category belonging to the languagexWill MxAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight1,W2,......WpAnd if so, the word set is the harmful information characteristic of the category of the language selected by machine learning.
9) Machine learning selected harmful information features { W) for step 8) using genetic algorithm1,W2,......WpSelecting characteristics, selecting an optimal number of harmful information characteristic words, and forming final harmful information characteristics { W }1,W2,......WqAnd a weight { M }1,M2,......,Mq}。
10) Using the harmful information feature of step 9) { W }1,W2,......WqAnd corresponding weight { M }1,M2,......,MqAnd judging whether the text is harmful information.
The step 1) above describes the process of collecting and labeling harmful information and harmless information and the process of establishing a data label set < S >, which is the basis for performing harmful information feature mining, wherein harmful information in various languages includes but is not limited to the following languages: chinese, English, Vietnamese, Korean, Japanese, Arabic, German, French.
The step 2) described above describes the process of segmenting words or instantiating the harmful information text and the harmless information text in each language in the data label set < S >. Different processing procedures exist for different languages, as shown in fig. 2, and can be performed according to the following steps:
2a) judging the languages of the text, and turning to the step 2b if the languages of the text are Chinese, Korean, Japanese and the like; if the language is Latin language such as English, French and the like, turning to step 2 c); if the language is Virware, Arabic, etc., the step 2d) is proceeded to.
2b) Performing word segmentation on Chinese, Korean, Japanese and other languages, namely segmenting a character sequence into word sequences, and then removing stop words and punctuation marks;
2c) performing word instantiation on Latin character languages such as English and French, decomposing words contained in a sentence according to language rules, mainly adopting spaces, punctuations and the like for segmentation, and converting all capital letters into lowercase letters;
2d) the method is characterized in that words such as Vietwork and Arabic are instantiated, words contained in sentences are decomposed according to language rules, and space, punctuation marks such as Vietwork and Arabic and the like are mainly adopted for segmentation, and even words are segmented. For different spelling forms of characters in the language, the characters are converted into the same spelling form, for example, the current Vi, Latin Vi, Schlavian, new Vi and nonstandard Latin Vi are converted into the spelling form of the current Vi.
The method for performing dimensionality reduction representation on the text, which is irrelevant to the language and is described in the step 3), comprises the following steps: the RNSW method can select candidate words from the texts of each category of each language, establish a word pair set and use the word pair set to perform characteristic dimension reduction representation on the original texts. The process establishes a set of word pairs only in the category of the language, and the number of words in the set is determined according to the specific situation of feature dimension reduction, including but not limited to 10000, 100000 and 1000000. The specific process of selecting candidate words is shown in fig. 3, and may be performed according to the following steps:
3a) establishing dictionary dit of words of the negative examples according to the negative examples of the category of the languageNegative sample={W1:V1,......,Wn:VnIn which W isiIs a word, ViN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.
3b) According to ViThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtainedNegative sample MAX={W1:V1,......, Wk:VkWhere the value of k includes, but is not limited to, 100, 1000, and 10000. Get W1To WkObtaining a set:
set(Wnegative sample MAX)={W1,W2,......Wk} (1)
3c) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the languagePositive sample={W1:V1,......, Wm:VmIn which W isxIs a word, VxM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W1To WmObtaining a set:
set(Wpositive sample)={W1,W2,......Wm} (2)
3d) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)Candidates) Where s is the number of words of the candidate word set:
set(Wcandidates)=set(WPositive sample)-set(WNegative sample MAX)={W1,W2,......Ws} (3)
The above step 4) describes a process of vectorizing the sample data of each category of each language, which is to vectorize the positive and negative samples of the category of the language only by using the word pair set of the category of the language.
Step 5) above describes a process of vector dimension expansion of the data vector X for each category of each language, which is based only on the number M of words of the largest sample of that category of that languagemaxThe number of words in the category of the language is extended in dimension.
Step 6) above describes the division process of the training set and the test set, the division ratio of the process is set according to the effect of the embodied process, for example, it can be set to include but not limited to 10: 1, where the training set is 10 parts and the test set is 1 part.
The process of model training described in the above step 7), where the batch size can be, but is not limited to, 16, 32, 64, and the CNN neural network model shown in fig. 4 is a CNN neural network model of a specific embodiment, in which the number M of words of each category according to each language is required in the specific implementation processmaxAnd adjusting input parameters of the CNN neural network during the concrete implementation process. The above step 8) describes the process of obtaining the harmful information characteristics of the category of the language according to the trained model, wherein the harmful information characteristics are only used for identifying the harmful information of the category of the language, and the value of p is manually obtained according to the words W conforming to the category of the languagexIs determined.
The above step 9) describes the process of screening the harmful information features by the genetic algorithm, wherein the harmful information features selected by the genetic algorithm are only used for the feature selection of the harmful information of the category of the language. The process of the genetic algorithm for feature selection is carried out according to the following steps:
9a) collecting the p word sets { W) generated in the step 8)1,W2,......,WpThe population of the words is used as a total population, the initial population adopts a random method to generate t individuals,
9b) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm1,W2,......,WpAnd establishing a similar word set of each word according to a set similarity threshold value.
9c) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:
f(x)=α×(Nidentified positive samples-NFalse-identified negative sample)/(NPositive sample+NNegative sample)-β×t
Wherein N isIdentified positive samplesThe number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, NFalse-identified negative sampleα is a parameter for controlling the contribution of the improvement of the classification accuracy to the evaluation function, β is a parameter for controlling the contribution of the reduction of the feature number to the evaluation function, t is the number of individuals in step 9a), N is the number of samples containing the feature word and the feature word similarity in the word set of each sample in the negative sample setPositive sampleNumber of harmful information text samples, N, of step 1)Negative sampleThe number of harmless information text samples in step 1).
9d) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent
9e) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population
9f) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. When the algorithm termination condition is not reached, the iteration of steps 9c) to 9e) is repeated.
The above step 10) describes a process of performing harmful information identification using the harmful information feature, wherein the harmful information feature and the corresponding weight are used only for identification of harmful information of the category of the language.
The invention provides a language-independent RNSW method for reducing the dimension of the text, which effectively reduces the number of parameters for model training, accelerates the training speed and improves the accuracy of model identification; secondly, intelligent mining of harmful information characteristics is achieved by adopting a deep learning method, and screening of the harmful information characteristics is performed through a genetic algorithm, so that the harmful information category identification is more accurate, and the interpretability of the harmful information identification is better.
Drawings
FIG. 1 is a flow diagram of a harmful information feature intelligent mining method based on deep learning
FIG. 2 is a schematic diagram of the processing procedure of segmenting or instantiating words according to language samples
FIG. 3 is a process diagram of candidate word selection using RNSW method
FIG. 4 is a schematic diagram of a CNN network structure used in the present invention
The specific implementation mode is as follows:
the invention will now be further illustrated by way of example, without in any way limiting its scope, with reference to the accompanying drawings.
The general flow of the intelligent mining method for the characteristics of the multi-language harmful information based on deep learning is shown in the attached figure 1, taking the intelligent mining of the characteristics of the Chinese violence and terrorism harmful information as an example, the method specifically comprises the following steps:
1) collecting various language violence and terrorism harmful information texts and harmless information texts including Chinese, and establishing a data labeling set<S>Marking positive and negative sample data of the harmful information text data, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is NPositive sampleThe negative sample is the harmless information text of the category of the language, and the number of samples is NNegative sample。
2) According to the processing flow shown in fig. 2, when the language is judged to be Chinese, a Chinese word segmentation processing flow is executed, and the harmful information text and the harmless information text in the data label set < S > are segmented by using but not limited to a Jieba word segmentation tool.
3) According to the process flow of FIG. 3, a negative example of Chinese violence and terrorism is used to create a dictionary of negative example words, dittNegative sample={W1:V1,......,Wn:VnIn which W isiIs a word, ViN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.
4) According to ViThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtainedNegative sample MAX={W1:V1,......, Wk:VkWherein the value of k comprisesBut are not limited to 100, 1000 and 10000. Get W1To WkObtaining a set:
set(Wnegative sample MAX)={W1,W2,......Wk} (1)
5) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the languagePositive sample={W1:V1,......,Wm:VmIn which W isxIs a word, VxM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W1To WmObtaining a set:
set(Wpositive sample)={W1,W2,......Wm} (2)
6) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)Candidates) Where s is the number of words of the candidate word set:
set(Wcandidates)=set(WPositive sample)-set(WNegative sample MAX)={W1,W2,......Ws} (3)
7) Set of candidate words set (W) generated according to step 6)Candidates) Establishing word pair set of word-ID of Chinese violence and terrorism class<W1, ID1>,<W2,ID2>,......,<Wn,IDn>N is the number of word pairs in the set of word pairs of the category, WxRepresenting candidate words, IDxThe ID representing the word is represented by a unique one-hot code encoding of a unique one of the integer values in the set.
8) Collecting each sample data of Chinese violence and terrorism classes according to the word pairs in the step 7)<W1,ID1>,<W2, ID2>,......,<Wn,IDn>And converting into a data vector X of the corresponding ID: { Vec1,Vec2,......Vecm}。
9) Number M of words of the largest sample of step 8)maxThe number of words as Chinese violence and terrorism class is obtained by using data vector X:{Vec1,Vec2,......VecmThe dimension of the vector of the Chinese word is less than MmaxIs filled with 0 at the front end. And setting a Chinese violence and terrorist class vector of corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1,y2,......,ymFor positive samples y of Chinese violence and terrorismxIs [1, 0 ]]Negative sample is yxIs [0, 1 ]]。
10) Dividing the data vectors X and y of the step 9) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a ratio of 10: 1.
11) Inputting train _ x and train _ y of the step 10) into a CNN neural network model shown in the attached figure 4 in batches according to the batch size of 32, training and learning by using an Adam optimizer and a cross entropy loss function, and finally performing normalized classification by using softmax to obtain a final classification result.
12) Set word pairs<W1,ID1>,<W2,ID2>,......,<Wn,IDn>Each word W ofxInputting the word W into the final model of the training optimization in the step 11) to obtain each word WxScore M belonging to Chinese violence and terrorismxWill MxAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight1,W2,......WpAnd fourthly, the word set is the harmful information characteristics of the Chinese violence and terrorism class selected by machine learning.
13) Gathering the p word sets { W) generated in the step 12)1A population of words of W2......, Wp } as a total population, the initial population using a random method, producing t individuals,
14) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm1,W2,......,WpAnd establishing a similar word set of each word according to a set similarity threshold value.
15) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:
f(x)=α×(Nidentified positive samples-NFalse-identified negative sample)/(NPositive sample+NNegative sample)-β×t
Wherein N isIdentified positive samplesThe number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, NFalse-identified negative sampleα is a parameter for controlling contribution of improvement of classification accuracy to the evaluation function, and β is a parameter for controlling contribution of reduction of feature number to the evaluation function.
16) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent
17) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population
18) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. And when the algorithm termination condition is not reached, repeating the iteration from the step 15) to the step 17).
19) After the algorithm is ended, the feature words in the population of the last generation are the optimal harmful information feature words selected to form the final harmful information features { W }1,W2,......WqAnd a weight { M }1,M2,......,Mq}。
20) Using the harmful information features W of step 19)1,W2,......WqAnd corresponding weight { M }1,M2,......,MqAnd judging whether the text is harmful information of Chinese violence and terrorism.
Claims (11)
1. A multilingual harmful information feature intelligent mining method based on deep learning comprises the following steps:
1) collecting various language harmful information texts and noneHarmful information text, establishing data label set<S>Labeling positive and negative sample data of harmful information text data of each language and each category, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is NPositive sampleThe negative sample is the harmless information text of the language of the category, and the number of samples is NNegative sample。
2) Tokenizing each language harmful information text and harmless information text in the data tagging set < S > in the step 1), and then removing stop words and punctuation marks.
3) Selecting n candidate words from the words of each category of each language in the step 2) by using an RNSW (remove Negative Sample words) method, and establishing a word pair set of the words of the category, namely the ID (identity)<W1,ID1>,<W2,ID2>,......,<Wn,IDn>N is the number of word pairs in the set of word pairs of the category, WxRepresenting words of various languages or words after word segmentation, IDxThe ID representing the word is represented by a unique One-Hot Encoding of an integer value in the set.
4) Collecting each sample data of each category of each language in the step 3) according to the word pair set of the category of the language<W1,ID1>,<W2,ID2>,......,<Wn,IDn>And converting into a data vector X of the corresponding ID: { Vec1,Vec2,......Vecm}。
5) The number M of words of the largest sample in each category of each language of the step 4)maxAs the number of words in this category, data vector X: { Vec1,Vec2,......VecmThe dimension of the vector of the Chinese word is less than MmaxIs filled with 0 at the front end. Setting a class vector of the corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1, y2,......,ymFor positive samples of this class yxIs [1, 0 ]]Negative sample is yxIs [0, 1 ]]。
6) Dividing the data vectors X and y of each category of each language in the step 5) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a certain proportion.
7) Inputting the train _ x and the train _ y in the step 6) into a CNN neural network model shown in the attached figure 4 in batches for training and learning according to the batch size, using an Adam optimizer and a cross entropy loss function for training, and finally using softmax for normalized classification to obtain a final classification result.
8) Set word pairs<W1,ID1>,<W2,ID2>,......,<Wn,IDn>Each word W ofxInputting the obtained data into a final model of the training optimization in the step 7) to obtain each word WxScore M of the harmful category belonging to the languagexWill MxAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight1,W2,......,WpAnd if so, the word set is the harmful information characteristic of the category of the language selected by machine learning.
9) Machine learning selected harmful information features { W) for step 8) using genetic algorithm1,W2,......WpSelecting characteristics, selecting an optimal number of harmful information characteristic words, and forming final harmful information characteristics { W }1,W2,......WqAnd a weight { M }1,M2,......,Mq}。
10) Using the harmful information feature of step 9) { W }1,W2,......,WqAnd corresponding weight { M }1,M2,......,MqAnd judging whether the text is harmful information.
2. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 1) harmful-information and harmless-information collecting and labeling process and data label set < S > establishing process are the basis for harmful-information feature mining, wherein the harmful information of each language includes but is not limited to the following languages: chinese, English, Vietnamese, Korean, Japanese, Arabic, German, French.
3. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 2) is a process of tokenizing or tokenizing each language of harmful information text and harmless information text in the data tagging set < S >. Different processing procedures exist for different languages, and the processing procedures can be carried out according to the following steps:
2a) judging the languages of the text, and turning to the step 2b if the languages of the text are Chinese, Korean, Japanese and the like; if the language is Latin language such as English, French and the like, turning to step 2 c); if the language is Virware, Arabic, etc., the step 2d) is proceeded to.
2b) Performing word segmentation on Chinese, Korean, Japanese and other languages, namely segmenting a character sequence into word sequences, and then removing stop words and punctuation marks;
2c) performing word instantiation on Latin character languages such as English and French, decomposing words contained in a sentence according to language rules, mainly adopting spaces, punctuations and the like for segmentation, and converting all capital letters into lowercase letters;
2d) the method is characterized in that words such as Vietwork and Arabic are instantiated, words contained in sentences are decomposed according to language rules, and space, punctuation marks such as Vietwork and Arabic and the like are mainly adopted for segmentation, and even words are segmented. The method is characterized in that different character spelling forms in the language are converted into the same spelling form, and the current Vi, Latin Vi, Schlavian, new character Vi and nonstandard Latin Vi are converted into the spelling form of the current Vi.
4. The intelligent mining method for multi-language harmful information features based on deep learning according to claim 1, wherein the step 3) describes a language-independent method for performing dimensionality reduction representation on text: RNSW (remove negative sample words) method, which can select candidate words from the text of each category of each language, establish a word pair set and use the word pair set to perform characteristic dimension reduction representation on the original text. The process establishes a set of word pairs only in the category of the language, and the number of words in the set is determined according to the specific situation of feature dimension reduction, including but not limited to 10000, 100000 and 1000000. The specific process of selecting candidate words is shown in fig. 3, and may be performed according to the following steps:
3a) establishing dictionary dit of words of the negative examples according to the negative examples of the category of the languageNegative sample={W1:V1,......,Wn:VnIn which W isiIs a word, ViN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.
3b) According to ViThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtainedNegative sample MAX={W1:V1,......,Wk:VkWhere the value of k includes, but is not limited to, 100, 1000, and 10000. Get W1To WkObtaining a set:
set(Wnegative sample MAX)={W1,W2,......Wk} (1)
3c) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the languagePositive sample={W1:V1,......,Wm:VmIn which W isxIs a word, VxM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W1To WmObtaining a set:
set(Wpositive sample)={W1,W2,.....Wm} (2)
3d) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)Candidates) Where s is the number of words of the candidate word set:
set(Wcandidates)=set(WPositive sample)-set(WNegative sample MAX)={W1,W2,......Ws} (3)。
5. The method as claimed in claim 1, wherein the step 4) is a process of vectorizing the sample data of each category of each language, and the process of vectorizing the positive and negative samples of the category of the language only uses the word pair set of the category of the language.
6. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 5) comprises a process of vector dimension expansion of the data vector X for each category of each language, based only on the number M of words of the largest sample of the category of the languagemaxThe number of words in the category of the language is extended in dimension.
7. The method for intelligently mining multi-lingual harmful information characteristics based on deep learning of claim 1, wherein the step 6) is a process for dividing the training set and the test set, wherein the dividing ratio of the process is set according to the effect of the specific implemented process, for example, the dividing ratio can be set but is not limited to 10: 1, wherein the training set is 10, and the test set is 1.
8. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein the model training process of step 7) is a batch size process that can be performed by using, but not limited to, 16, 32, and 64, and the CNN neural network model shown in fig. 4 is an embodiment of the CNN neural network model that requires the number M of words in each category of each language during implementationmaxAnd adjusting input parameters of the CNN neural network during the concrete implementation process.
9. The method as claimed in claim 1, wherein the step 8) is a process of obtaining the harmful information characteristics of the category of the language according to the trained model, wherein the harmful information characteristics are only used for identifying the harmful information of the category of the language.
10. The method as claimed in claim 1, wherein the step 9) is a process of filtering harmful information features by genetic algorithm, wherein the harmful information features selected by the genetic algorithm are only used for feature selection of the harmful information of the category of the language. The process of the genetic algorithm for feature selection is carried out according to the following steps:
9a) collecting the p word sets { W) generated in the step 8)1,W2,......,WpThe population of the words is used as a total population, the initial population adopts a random method to generate t individuals,
9b) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm1,W2,......,WpAnd establishing a similar word set of each word according to a set similarity threshold value.
9c) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:
f(x)=α×(Nidentified positive samples-NFalse-identified negative sample)/(NPositive sample+NNegative sample)-β×t
Wherein N isIdentified positive samplesThe number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, NFalse-identified negative sampleα is a parameter for controlling the contribution of the improvement of the classification accuracy to the evaluation function, β is a parameter for controlling the contribution of the reduction of the feature number to the evaluation function, t is the number of individuals in step 9a), N is the number of samples containing the feature word and the feature word similarity in the word set of each sample in the negative sample setPositive sampleNumber of harmful information text samples, N, of step 1)Negative sampleThe number of harmless information text samples in step 1).
9d) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent
9e) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population
9f) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. When the algorithm termination condition is not reached, the iteration of steps 9c) to 9e) is repeated.
11. The method as claimed in claim 1, wherein the step 10) is a process for identifying harmful information by using the harmful information features, wherein the harmful information features and the corresponding weights are only used for identifying the harmful information of the category of the language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911063979.6A CN111626318A (en) | 2019-11-04 | 2019-11-04 | Multi-language harmful information feature intelligent mining method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911063979.6A CN111626318A (en) | 2019-11-04 | 2019-11-04 | Multi-language harmful information feature intelligent mining method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111626318A true CN111626318A (en) | 2020-09-04 |
Family
ID=72258790
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911063979.6A Pending CN111626318A (en) | 2019-11-04 | 2019-11-04 | Multi-language harmful information feature intelligent mining method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111626318A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112837677A (en) * | 2020-10-13 | 2021-05-25 | 讯飞智元信息科技有限公司 | Harmful audio detection method and device |
-
2019
- 2019-11-04 CN CN201911063979.6A patent/CN111626318A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112837677A (en) * | 2020-10-13 | 2021-05-25 | 讯飞智元信息科技有限公司 | Harmful audio detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN107943784B (en) | Relationship extraction method based on generation of countermeasure network | |
CN108090070B (en) | Chinese entity attribute extraction method | |
CN107273358B (en) | End-to-end English chapter structure automatic analysis method based on pipeline mode | |
CN109670041A (en) | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods | |
CN112818694A (en) | Named entity recognition method based on rules and improved pre-training model | |
CN105022725A (en) | Text emotional tendency analysis method applied to field of financial Web | |
Huang et al. | Rethinking chinese word segmentation: tokenization, character classification, or wordbreak identification | |
Layton et al. | Recentred local profiles for authorship attribution | |
CN113704416B (en) | Word sense disambiguation method and device, electronic equipment and computer-readable storage medium | |
CN103020167B (en) | A kind of computer Chinese file classification method | |
CN114153971B (en) | Error correction recognition and classification equipment for Chinese text containing errors | |
CN111476036A (en) | Word embedding learning method based on Chinese word feature substrings | |
CN116361472B (en) | Method for analyzing public opinion big data of social network comment hot event | |
CN112069307B (en) | Legal provision quotation information extraction system | |
CN114912453A (en) | Chinese legal document named entity identification method based on enhanced sequence features | |
CN107451116B (en) | Statistical analysis method for mobile application endogenous big data | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN113420548A (en) | Entity extraction sampling method based on knowledge distillation and PU learning | |
CN111444720A (en) | Named entity recognition method for English text | |
CN114969294A (en) | Expansion method of sound-proximity sensitive words | |
CN111626318A (en) | Multi-language harmful information feature intelligent mining method based on deep learning | |
Premaratne et al. | Lexicon and hidden Markov model-based optimisation of the recognised Sinhala script | |
CN109284392B (en) | Text classification method, device, terminal and storage medium | |
CN112990388B (en) | Text clustering method based on concept words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |