CN111626318A

CN111626318A - Multi-language harmful information feature intelligent mining method based on deep learning

Info

Publication number: CN111626318A
Application number: CN201911063979.6A
Authority: CN
Inventors: 赵全军; 吴敬征; 段旭; 陈宏江; 伊克拉木·伊力哈木; 刘立力
Original assignee: Sinosoft Co ltd
Current assignee: Sinosoft Co ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-09-04

Abstract

The invention discloses a multilingual harmful information characteristic intelligent mining method based on deep learning, which is used for marking harmful and harmless information texts of various categories of various languages; selecting candidate words from the words of each category of each language by using an RNSW method and establishing an unique hot code; inputting the sample data into a CNN neural network model for training to obtain a score of each word belonging to the harmful category of the language as a weight; and screening the harmful information characteristics selected by machine learning by using a genetic algorithm to form final harmful information characteristics and weights. The invention provides a language-independent RNSW method for reducing the dimension of the text, which effectively reduces the number of parameters for model training, accelerates the training speed and improves the accuracy of model identification; secondly, intelligent mining of harmful information features is achieved by adopting a deep learning method, and the harmful information features are screened through a genetic algorithm, so that the interpretability of harmful information identification is better.

Description

Multi-language harmful information feature intelligent mining method based on deep learning

The technical field is as follows:

the invention relates to a text analysis technology in the field of Internet, in particular to a harmful text recognition method, which is a multilingual harmful information feature intelligent mining method based on deep learning.

Background art:

two methods are commonly used for identifying harmful information, one is a method based on keyword and rule matching, and the other is a method based on machine learning. According to the method based on keyword and rule matching, a word bank of harmful words needs to be edited manually, sometimes the rules are complex enough to achieve a good effect, new words of harmful words on the network are layered endlessly, the updating iteration period is short, and a great deal of cost is consumed for maintaining the word bank and designing new rules. The method based on machine learning is a method gradually adopted in recent years, and has the advantages that technicians do not need to have deep field knowledge on harmful information and manually establish a large number of harmful word banks, but harmful words in network texts are automatically extracted by optimizing a machine learning algorithm, and the accuracy of harmful information identification is improved.

Zhang Jialiang et al in the patent "a web page harmful information identification method based on machine learning" (patent application number: 201811302974.X) proposed a web page harmful information identification method based on machine learning, through machine learning, training model, text classification technology, classification recognition to the web page that snatchs, according to the affiliated category of the web page recognition result, achieve the purpose of distinguishing whether the web page has harmful information, further judge whether the website has harmful information. The method is trained based on the collected corpus data, the trained model can only be used for identifying the text data of the corresponding language, and the method cannot extract the characteristics of harmful information and has poor interpretability of classification results.

The patent of kuweiming et al, a harmful information identification and webpage classification method based on multi-instance learning (patent application number: CN201410609728), proposes a webpage classification method based on multi-instance learning, which makes the algorithm more consistent with the actual distribution of the webpage content by taking the images and the related texts contained in the webpage as the instances in the webpage package, and can utilize the effective information of the webpage to deeply mine the complementarity of the image information and the text information, and finally obtain better effect than classification by using only single-mode information. The method needs images in the webpage as auxiliary information for identification, but most web texts have no pictures, and some pictures may not be related to texts, so the method is not convenient to use on large-scale text data.

Aiming at the problems, the patent provides a multilingual harmful information characteristic intelligent mining method based on deep learning, which is characterized in that samples of a labeled training data set are subjected to tokenization, a candidate word list of the class samples is established by an RNSW (remove Negative Sample words) method for language-independent text characteristic dimension reduction, and a unique One-Hot Encoding code is established for each word to correspond to the word, so that a word and coded data pair is formed. The data pairs are used for mapping a training data set to a vector space, then a CNN model is used for training, the trained optimal model is used for inputting candidate words to obtain the weight of each word, the initial range of harmful information features is determined according to the weight of each word, and a genetic algorithm is used for selecting the final number of harmful information feature words. The method can be used for mining the harmful text information characteristics of each category of each language. In addition, because the RNSW method of text feature dimension reduction is used, and candidate words are limited in the training sample set of the category of the language, the number of word vectors is greatly reduced, the number of parameters of a training model is greatly reduced, and the training speed is high; the optimal number of characteristic words representing harmful information are automatically screened out through the self-adaptability of the genetic algorithm, the accuracy rate of characteristic selection is high, and the method is suitable for processing large-scale text data.

The invention content is as follows:

the invention aims to provide a deep learning-based intelligent mining method for multi-language harmful information characteristics, which can be used for identifying harmful text data by mining harmful information characteristics in different languages through a general method for mining harmful information characteristics independent of specific languages.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a multilingual harmful information feature intelligent mining method based on deep learning is disclosed, as shown in the attached figure 1, and comprises the following steps:

1) collecting harmful information texts and harmless information texts of various languages and various categories, and establishing a data labeling set<S>Labeling positive and negative sample data of harmful information text data of each language and each category, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is N_{Positive sample}The negative sample is the harmless information text of the category of the language, and the number of samples is N_{Negative sample}。

2) Tokenizing the text of the positive and negative samples of each language and each category in the data tagging set < S > in the step 1).

3) Selecting n candidate words from the words of each category of each language in the step 2) by using an RNSW method, and establishing a word pair set of the words-ID of the category<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>N is the number of word pairs in the set of word pairs of the category, W_xRepresenting words of various languages or words after word segmentation, ID_xThe ID representing the word is represented by a unique one-hot code encoding of a unique one of the integer values in the set.

4) Collecting each sample data of each category of each language according to the word pairs of the category of the language in the step 3)<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>And converting into a data vector X of the corresponding ID: { Vec₁，Vec₂，......Vec_m}。

5) The number M of words of the largest sample in each category of each language of the step 4)_maxAs the number of words in this category, data vector X: { Vec₁，Vec₂，......Vec_mThe dimension of the vector of the Chinese word is less than M_maxIs filled with 0 at the front end. Setting a class vector of the corresponding data according to whether each sample is a positive sample or a negative sample: y: { y₁，y₂，......，y_mFor positive samples of this class y_xIs [1, 0 ]]Negative sample is y_xIs [0, 1 ]]。

6) Dividing the data vectors X and y of each category of each language in the step 5) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a certain proportion.

7) Inputting train _ x and train _ y obtained in the step 6) into a CNN neural network model shown in the attached figure 4 in batches according to the batch size of a certain numerical value, training by using an Adam optimizer and a cross entropy loss function, and finally performing normalized classification by using softmax to obtain a final classification result.

8) Set word pairs<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>Each word W of_xInputting the obtained data into a final model of the training optimization in the step 7) to obtain each word W_xScore M of the harmful category belonging to the language_xWill M_xAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight₁，W₂，......W_pAnd if so, the word set is the harmful information characteristic of the category of the language selected by machine learning.

9) Machine learning selected harmful information features { W) for step 8) using genetic algorithm₁，W₂，......W_pSelecting characteristics, selecting an optimal number of harmful information characteristic words, and forming final harmful information characteristics { W }₁，W₂，......W_qAnd a weight { M }₁，M₂，......，M_q}。

10) Using the harmful information feature of step 9) { W }₁，W₂，......W_qAnd corresponding weight { M }₁，M₂，......，M_qAnd judging whether the text is harmful information.

The step 1) above describes the process of collecting and labeling harmful information and harmless information and the process of establishing a data label set < S >, which is the basis for performing harmful information feature mining, wherein harmful information in various languages includes but is not limited to the following languages: chinese, English, Vietnamese, Korean, Japanese, Arabic, German, French.

The step 2) described above describes the process of segmenting words or instantiating the harmful information text and the harmless information text in each language in the data label set < S >. Different processing procedures exist for different languages, as shown in fig. 2, and can be performed according to the following steps:

2a) judging the languages of the text, and turning to the step 2b if the languages of the text are Chinese, Korean, Japanese and the like; if the language is Latin language such as English, French and the like, turning to step 2 c); if the language is Virware, Arabic, etc., the step 2d) is proceeded to.

2b) Performing word segmentation on Chinese, Korean, Japanese and other languages, namely segmenting a character sequence into word sequences, and then removing stop words and punctuation marks;

2c) performing word instantiation on Latin character languages such as English and French, decomposing words contained in a sentence according to language rules, mainly adopting spaces, punctuations and the like for segmentation, and converting all capital letters into lowercase letters;

2d) the method is characterized in that words such as Vietwork and Arabic are instantiated, words contained in sentences are decomposed according to language rules, and space, punctuation marks such as Vietwork and Arabic and the like are mainly adopted for segmentation, and even words are segmented. For different spelling forms of characters in the language, the characters are converted into the same spelling form, for example, the current Vi, Latin Vi, Schlavian, new Vi and nonstandard Latin Vi are converted into the spelling form of the current Vi.

The method for performing dimensionality reduction representation on the text, which is irrelevant to the language and is described in the step 3), comprises the following steps: the RNSW method can select candidate words from the texts of each category of each language, establish a word pair set and use the word pair set to perform characteristic dimension reduction representation on the original texts. The process establishes a set of word pairs only in the category of the language, and the number of words in the set is determined according to the specific situation of feature dimension reduction, including but not limited to 10000, 100000 and 1000000. The specific process of selecting candidate words is shown in fig. 3, and may be performed according to the following steps:

3a) establishing dictionary dit of words of the negative examples according to the negative examples of the category of the language_{Negative sample}＝{W₁：V₁，......，W_n：V_nIn which W is_iIs a word, V_iN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.

3b) According to V_iThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtained_{Negative sample MAX}＝{W₁：V₁，......， W_k：V_kWhere the value of k includes, but is not limited to, 100, 1000, and 10000. Get W₁To W_kObtaining a set:

set(W_{negative sample MAX})＝{W₁，W₂，......W_k} (1)

3c) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the language_{Positive sample}＝{W₁：V₁，......， W_m：V_mIn which W is_xIs a word, V_xM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W₁To W_mObtaining a set:

set(W_{positive sample})＝{W₁，W₂，......W_m} (2)

3d) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)_Candidates) Where s is the number of words of the candidate word set:

set(W_candidates)＝set(W_{Positive sample})-set(W_{Negative sample MAX})＝{W₁，W₂，......W_s} (3)

The above step 4) describes a process of vectorizing the sample data of each category of each language, which is to vectorize the positive and negative samples of the category of the language only by using the word pair set of the category of the language.

Step 5) above describes a process of vector dimension expansion of the data vector X for each category of each language, which is based only on the number M of words of the largest sample of that category of that language_maxThe number of words in the category of the language is extended in dimension.

Step 6) above describes the division process of the training set and the test set, the division ratio of the process is set according to the effect of the embodied process, for example, it can be set to include but not limited to 10: 1, where the training set is 10 parts and the test set is 1 part.

The process of model training described in the above step 7), where the batch size can be, but is not limited to, 16, 32, 64, and the CNN neural network model shown in fig. 4 is a CNN neural network model of a specific embodiment, in which the number M of words of each category according to each language is required in the specific implementation process_maxAnd adjusting input parameters of the CNN neural network during the concrete implementation process. The above step 8) describes the process of obtaining the harmful information characteristics of the category of the language according to the trained model, wherein the harmful information characteristics are only used for identifying the harmful information of the category of the language, and the value of p is manually obtained according to the words W conforming to the category of the language_xIs determined.

The above step 9) describes the process of screening the harmful information features by the genetic algorithm, wherein the harmful information features selected by the genetic algorithm are only used for the feature selection of the harmful information of the category of the language. The process of the genetic algorithm for feature selection is carried out according to the following steps:

9a) collecting the p word sets { W) generated in the step 8)₁，W₂，......，W_pThe population of the words is used as a total population, the initial population adopts a random method to generate t individuals,

9b) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm₁，W₂，......，W_pAnd establishing a similar word set of each word according to a set similarity threshold value.

9c) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:

f(x)＝α×(N_{identified positive samples}-N_{False-identified negative sample})/(N_{Positive sample}+N_{Negative sample})-β×t

Wherein N is_{Identified positive samples}The number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, N_{False-identified negative sample}α is a parameter for controlling the contribution of the improvement of the classification accuracy to the evaluation function, β is a parameter for controlling the contribution of the reduction of the feature number to the evaluation function, t is the number of individuals in step 9a), N is the number of samples containing the feature word and the feature word similarity in the word set of each sample in the negative sample set_{Positive sample}Number of harmful information text samples, N, of step 1)_{Negative sample}The number of harmless information text samples in step 1).

9d) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent

9e) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population

9f) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. When the algorithm termination condition is not reached, the iteration of steps 9c) to 9e) is repeated.

The above step 10) describes a process of performing harmful information identification using the harmful information feature, wherein the harmful information feature and the corresponding weight are used only for identification of harmful information of the category of the language.

The invention provides a language-independent RNSW method for reducing the dimension of the text, which effectively reduces the number of parameters for model training, accelerates the training speed and improves the accuracy of model identification; secondly, intelligent mining of harmful information characteristics is achieved by adopting a deep learning method, and screening of the harmful information characteristics is performed through a genetic algorithm, so that the harmful information category identification is more accurate, and the interpretability of the harmful information identification is better.

Drawings

FIG. 1 is a flow diagram of a harmful information feature intelligent mining method based on deep learning

FIG. 2 is a schematic diagram of the processing procedure of segmenting or instantiating words according to language samples

FIG. 3 is a process diagram of candidate word selection using RNSW method

FIG. 4 is a schematic diagram of a CNN network structure used in the present invention

The specific implementation mode is as follows:

the invention will now be further illustrated by way of example, without in any way limiting its scope, with reference to the accompanying drawings.

The general flow of the intelligent mining method for the characteristics of the multi-language harmful information based on deep learning is shown in the attached figure 1, taking the intelligent mining of the characteristics of the Chinese violence and terrorism harmful information as an example, the method specifically comprises the following steps:

1) collecting various language violence and terrorism harmful information texts and harmless information texts including Chinese, and establishing a data labeling set<S>Marking positive and negative sample data of the harmful information text data, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is N_{Positive sample}The negative sample is the harmless information text of the category of the language, and the number of samples is N_{Negative sample}。

2) According to the processing flow shown in fig. 2, when the language is judged to be Chinese, a Chinese word segmentation processing flow is executed, and the harmful information text and the harmless information text in the data label set < S > are segmented by using but not limited to a Jieba word segmentation tool.

3) According to the process flow of FIG. 3, a negative example of Chinese violence and terrorism is used to create a dictionary of negative example words, ditt_{Negative sample}＝{W₁：V₁，......，W_n：V_nIn which W is_iIs a word, V_iN is the number of times the word appears in the negative examples, and n is the maximum number of words in the negative examples after step 2) is performed.

4) According to V_iThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtained_{Negative sample MAX}＝{W₁：V₁，......， W_k：V_kWherein the value of k comprisesBut are not limited to 100, 1000 and 10000. Get W₁To W_kObtaining a set:

set(W_{negative sample MAX})＝{W₁，W₂，......W_k} (1)

5) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the language_{Positive sample}＝{W₁：V₁，......，W_m：V_mIn which W is_xIs a word, V_xM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W₁To W_mObtaining a set:

set(W_{positive sample})＝{W₁，W₂，......W_m} (2)

6) Performing set difference operation by formula (2) and formula (1) to obtain a candidate word set (W)_Candidates) Where s is the number of words of the candidate word set:

7) Set of candidate words set (W) generated according to step 6)_Candidates) Establishing word pair set of word-ID of Chinese violence and terrorism class<W₁， ID₁>，<W₂，ID₂>，......，<W_n，ID_n>N is the number of word pairs in the set of word pairs of the category, W_xRepresenting candidate words, ID_xThe ID representing the word is represented by a unique one-hot code encoding of a unique one of the integer values in the set.

8) Collecting each sample data of Chinese violence and terrorism classes according to the word pairs in the step 7)<W₁，ID₁>，<W₂， ID₂>，......，<W_n，ID_n>And converting into a data vector X of the corresponding ID: { Vec₁，Vec₂，......Vec_m}。

9) Number M of words of the largest sample of step 8)_maxThe number of words as Chinese violence and terrorism class is obtained by using data vector X：{Vec₁，Vec₂，......Vec_mThe dimension of the vector of the Chinese word is less than M_maxIs filled with 0 at the front end. And setting a Chinese violence and terrorist class vector of corresponding data according to whether each sample is a positive sample or a negative sample: y: { y₁，y₂，......，y_mFor positive samples y of Chinese violence and terrorism_xIs [1, 0 ]]Negative sample is y_xIs [0, 1 ]]。

10) Dividing the data vectors X and y of the step 9) into training sets train _ X and train _ y and test sets dev _ X and dev _ y according to a ratio of 10: 1.

11) Inputting train _ x and train _ y of the step 10) into a CNN neural network model shown in the attached figure 4 in batches according to the batch size of 32, training and learning by using an Adam optimizer and a cross entropy loss function, and finally performing normalized classification by using softmax to obtain a final classification result.

12) Set word pairs<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>Each word W of_xInputting the word W into the final model of the training optimization in the step 11) to obtain each word W_xScore M belonging to Chinese violence and terrorism_xWill M_xAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight₁，W₂，......W_pAnd fourthly, the word set is the harmful information characteristics of the Chinese violence and terrorism class selected by machine learning.

13) Gathering the p word sets { W) generated in the step 12)₁A population of words of W2......, Wp } as a total population, the initial population using a random method, producing t individuals,

14) using the data annotation set of step 1)<S>Or other larger external data corpus, computing the set { W by the word2vec algorithm₁，W₂，......，W_pAnd establishing a similar word set of each word according to a set similarity threshold value.

15) Calculating the fitness of each characteristic word of the population by using the sample of the data label set < S > in the step 1); the fitness is calculated by adopting the following formula:

Wherein N is_{Identified positive samples}The number of samples containing the characteristic word and the similar word of the characteristic word in the word set of each sample in the positive sample set, N_{False-identified negative sample}α is a parameter for controlling contribution of improvement of classification accuracy to the evaluation function, and β is a parameter for controlling contribution of reduction of feature number to the evaluation function.

16) Calculating the probability in direct proportion to the fitness, determining male parent and female parent, and generating each filial generation sample by crossing and mutation of the male parent and the female parent

17) Calculating the difference value of the fitness function of the new and old individuals, receiving the new individual according to the Boltzmann rule and the probability, and determining the new population

18) Algorithm termination conditions are as follows: and when the operation reaches the limited maximum algebra, the operation is terminated, or the fitness of the individuals of the continuous generations is changed slightly and is smaller than a set threshold value. And when the algorithm termination condition is not reached, repeating the iteration from the step 15) to the step 17).

19) After the algorithm is ended, the feature words in the population of the last generation are the optimal harmful information feature words selected to form the final harmful information features { W }₁，W₂，......W_qAnd a weight { M }₁，M₂，......，M_q}。

20) Using the harmful information features W of step 19)₁，W₂，......W_qAnd corresponding weight { M }₁，M₂，......，M_qAnd judging whether the text is harmful information of Chinese violence and terrorism.

Claims

1. A multilingual harmful information feature intelligent mining method based on deep learning comprises the following steps:

1) collecting various language harmful information texts and noneHarmful information text, establishing data label set<S>Labeling positive and negative sample data of harmful information text data of each language and each category, wherein the positive sample is the harmful information text of the language of the category, and the number of the samples is N_{Positive sample}The negative sample is the harmless information text of the language of the category, and the number of samples is N_{Negative sample}。

2) Tokenizing each language harmful information text and harmless information text in the data tagging set < S > in the step 1), and then removing stop words and punctuation marks.

3) Selecting n candidate words from the words of each category of each language in the step 2) by using an RNSW (remove Negative Sample words) method, and establishing a word pair set of the words of the category, namely the ID (identity)<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>N is the number of word pairs in the set of word pairs of the category, W_xRepresenting words of various languages or words after word segmentation, ID_xThe ID representing the word is represented by a unique One-Hot Encoding of an integer value in the set.

4) Collecting each sample data of each category of each language in the step 3) according to the word pair set of the category of the language<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>And converting into a data vector X of the corresponding ID: { Vec₁，Vec₂，......Vec_m}。

5) The number M of words of the largest sample in each category of each language of the step 4)_maxAs the number of words in this category, data vector X: { Vec₁，Vec₂，......Vec_mThe dimension of the vector of the Chinese word is less than M_maxIs filled with 0 at the front end. Setting a class vector of the corresponding data according to whether each sample is a positive sample or a negative sample: y: { y1, y₂，......，y_mFor positive samples of this class y_xIs [1, 0 ]]Negative sample is y_xIs [0, 1 ]]。

7) Inputting the train _ x and the train _ y in the step 6) into a CNN neural network model shown in the attached figure 4 in batches for training and learning according to the batch size, using an Adam optimizer and a cross entropy loss function for training, and finally using softmax for normalized classification to obtain a final classification result.

8) Set word pairs<W₁，ID₁>，<W₂，ID₂>，......，<W_n，ID_n>Each word W of_xInputting the obtained data into a final model of the training optimization in the step 7) to obtain each word W_xScore M of the harmful category belonging to the language_xWill M_xAs the weight of the word, the p word sets { W is obtained by sorting the words from large to small according to the weight₁，W₂，......，W_pAnd if so, the word set is the harmful information characteristic of the category of the language selected by machine learning.

10) Using the harmful information feature of step 9) { W }₁，W₂，......，W_qAnd corresponding weight { M }₁，M₂，......，M_qAnd judging whether the text is harmful information.

2. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 1) harmful-information and harmless-information collecting and labeling process and data label set < S > establishing process are the basis for harmful-information feature mining, wherein the harmful information of each language includes but is not limited to the following languages: chinese, English, Vietnamese, Korean, Japanese, Arabic, German, French.

3. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 2) is a process of tokenizing or tokenizing each language of harmful information text and harmless information text in the data tagging set < S >. Different processing procedures exist for different languages, and the processing procedures can be carried out according to the following steps:

2d) the method is characterized in that words such as Vietwork and Arabic are instantiated, words contained in sentences are decomposed according to language rules, and space, punctuation marks such as Vietwork and Arabic and the like are mainly adopted for segmentation, and even words are segmented. The method is characterized in that different character spelling forms in the language are converted into the same spelling form, and the current Vi, Latin Vi, Schlavian, new character Vi and nonstandard Latin Vi are converted into the spelling form of the current Vi.

4. The intelligent mining method for multi-language harmful information features based on deep learning according to claim 1, wherein the step 3) describes a language-independent method for performing dimensionality reduction representation on text: RNSW (remove negative sample words) method, which can select candidate words from the text of each category of each language, establish a word pair set and use the word pair set to perform characteristic dimension reduction representation on the original text. The process establishes a set of word pairs only in the category of the language, and the number of words in the set is determined according to the specific situation of feature dimension reduction, including but not limited to 10000, 100000 and 1000000. The specific process of selecting candidate words is shown in fig. 3, and may be performed according to the following steps:

3b) According to V_iThe values of (c) are sorted from large to small, the largest k words are taken, and the fact is obtained_{Negative sample MAX}＝{W₁：V₁，......，W_k：V_kWhere the value of k includes, but is not limited to, 100, 1000, and 10000. Get W₁To W_kObtaining a set:

set(W_{negative sample MAX})＝{W₁，W₂，......W_k} (1)

3c) Establishing a dictionary dit of words of the positive sample according to the positive sample of the category of the language_{Positive sample}＝{W₁：V₁，......，W_m：V_mIn which W is_xIs a word, V_xM is the number of times the word appears in the positive sample, and m is the maximum number of words in the positive sample after step 2) is performed. Get W₁To W_mObtaining a set:

set(W_{positive sample})＝{W₁，W₂，.....W_m} (2)

set(W_candidates)＝set(W_{Positive sample})-set(W_{Negative sample MAX})＝{W₁，W₂，......W_s} (3)。

5. The method as claimed in claim 1, wherein the step 4) is a process of vectorizing the sample data of each category of each language, and the process of vectorizing the positive and negative samples of the category of the language only uses the word pair set of the category of the language.

6. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein said step 5) comprises a process of vector dimension expansion of the data vector X for each category of each language, based only on the number M of words of the largest sample of the category of the language_maxThe number of words in the category of the language is extended in dimension.

7. The method for intelligently mining multi-lingual harmful information characteristics based on deep learning of claim 1, wherein the step 6) is a process for dividing the training set and the test set, wherein the dividing ratio of the process is set according to the effect of the specific implemented process, for example, the dividing ratio can be set but is not limited to 10: 1, wherein the training set is 10, and the test set is 1.

8. The deep learning-based multilingual harmful-information-feature intelligent mining method according to claim 1, wherein the model training process of step 7) is a batch size process that can be performed by using, but not limited to, 16, 32, and 64, and the CNN neural network model shown in fig. 4 is an embodiment of the CNN neural network model that requires the number M of words in each category of each language during implementation_maxAnd adjusting input parameters of the CNN neural network during the concrete implementation process.

9. The method as claimed in claim 1, wherein the step 8) is a process of obtaining the harmful information characteristics of the category of the language according to the trained model, wherein the harmful information characteristics are only used for identifying the harmful information of the category of the language.

10. The method as claimed in claim 1, wherein the step 9) is a process of filtering harmful information features by genetic algorithm, wherein the harmful information features selected by the genetic algorithm are only used for feature selection of the harmful information of the category of the language. The process of the genetic algorithm for feature selection is carried out according to the following steps:

11. The method as claimed in claim 1, wherein the step 10) is a process for identifying harmful information by using the harmful information features, wherein the harmful information features and the corresponding weights are only used for identifying the harmful information of the category of the language.