CN113868417A

CN113868417A - Sensitive comment identification method and device, terminal equipment and storage medium

Info

Publication number: CN113868417A
Application number: CN202111134559.XA
Authority: CN
Inventors: 宋威
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Shenzhen Ping An Smart Healthcare Technology Co ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31

Abstract

The application is suitable for the technical field of artificial intelligence, and provides a sensitive comment identification method, a sensitive comment identification device, terminal equipment and a storage medium. The method comprises the following steps: obtaining comment information issued by a target user; extracting text features of the comment information; fusing the text features and the user portrait features of the target user to obtain target feature data; inputting the target characteristic data into a trained sensitive comment recognition model, and outputting whether the comment information is a sensitive comment result or not through the sensitive comment recognition model. According to the method and the device, after the text features of the comment information are extracted, the user image features and the text features are fused, comment classification is carried out based on the fused feature data, and therefore deep feature mining can be carried out on the comment information on the basis of depth and breadth respectively, and accuracy of sensitive comment identification is improved.

Description

Sensitive comment identification method and device, terminal equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, and provides a sensitive comment identification method and device, terminal equipment and a storage medium.

Background

With the development of network communication technology, people can conveniently post own comments on current popular events through social platforms such as microblogs or friend circles, but some illegal people may post some negative comments containing sensitive words, which may cause adverse effects on the society.

For the problem, the social platform usually processes the comments made by the user by using a dictionary filtering method to filter out the comments containing the sensitive words. However, the semantic features of the comments cannot be obtained by adopting the dictionary filtering method, and the set sensitive word dictionary is difficult to cover all the sensitive words with different forms, so that the accuracy of sensitive comment recognition is low.

Disclosure of Invention

In view of this, the application provides a sensitive comment identification method, a sensitive comment identification device, a terminal device and a storage medium, which can improve the accuracy of sensitive comment identification.

In a first aspect, an embodiment of the present application provides a sensitive comment identification method, including:

obtaining comment information issued by a target user;

extracting text features of the comment information;

fusing the text features and the user portrait features of the target user to obtain target feature data;

inputting the target characteristic data into a trained sensitive comment recognition model for processing, and outputting whether the comment information is a sensitive comment result or not through the sensitive comment recognition model;

the sensitive comment identification model is obtained by training a sensitive comment sample and a non-sensitive comment sample which are used as sample sets, the sensitive comment sample is sample data which is fused with the text features and the user portrait features of the sample and is provided with a sensitive comment label, and the non-sensitive comment sample is sample data which is fused with the text features and the user portrait features of the sample and is provided with a non-sensitive comment label.

According to the embodiment of the application, after the text features of the comment information are extracted, the user image features and the text features are fused, comment classification is carried out based on the fused feature data, and therefore deep feature mining can be carried out on the comment information on the basis of depth and breadth respectively, and accuracy of sensitive comment identification is improved.

In an embodiment of the application, before extracting the text feature of the comment information, the method may further include:

performing word segmentation processing on the comment information to obtain each target word contained in the comment information;

detecting whether sensitive words recorded in a pre-constructed sensitive word dictionary exist in each target word;

if the sensitive words recorded in the sensitive word dictionary exist in the target words, judging that the comment information is a sensitive comment;

and if the sensitive words recorded in the sensitive word dictionary do not exist in the target words, triggering the step of extracting the text features of the comment information.

In one embodiment of the present application, the sensitive words included in the sensitive word dictionary may be augmented by:

expanding each sensitive word in the sensitive word dictionary according to the sound code of the Chinese character to obtain a sound code expanded sensitive word;

expanding each sensitive word in the sensitive word dictionary according to the font code of the Chinese character to obtain the font code expanded sensitive word;

expanding each sensitive word in the sensitive word dictionary according to the pinyin of the Chinese character to obtain pinyin expanded sensitive words;

adding the phonetic code expanded sensitive words, the shape code expanded sensitive words and the pinyin expanded sensitive words to the sensitive word dictionary.

Further, the expanding each sensitive word in the sensitive word dictionary according to the phonetic code of the chinese character to obtain the phonetic code expanded sensitive word may include:

respectively calculating the sound code similarity between each Chinese character recorded in a Chinese character dictionary and a target Chinese character, wherein the target Chinese character is a Chinese character contained in a target sensitive word, and the target sensitive word is any one sensitive word in the sensitive word dictionary;

determining the Chinese characters recorded in the Chinese character dictionary with the phonetic code similarity larger than a set threshold as first Chinese characters to be replaced;

replacing the target Chinese characters in the target sensitive words by using the first Chinese characters to be replaced respectively to obtain sensitive words expanded by the sound codes;

the expanding each sensitive word in the sensitive word dictionary according to the font code of the Chinese character to obtain the font code expanded sensitive word may include:

respectively calculating the similarity of the shape code between each Chinese character recorded in the Chinese character dictionary and the target Chinese character;

determining the Chinese characters with the similarity of the shape codes, which are recorded in the Chinese character dictionary, larger than a set threshold value as second Chinese characters to be replaced;

replacing the target Chinese characters in the target sensitive words by using the second Chinese characters to be replaced respectively to obtain sensitive words expanded by the shape codes;

the expanding each sensitive word in the sensitive word dictionary according to the pinyin of the Chinese character to obtain the pinyin-expanded sensitive word may include:

and replacing the target Chinese character in the target sensitive word with pinyin to obtain pinyin expansion sensitive words corresponding to the target sensitive word.

Furthermore, after determining the chinese character included in the chinese character dictionary whose phonetic code similarity is greater than a set threshold as the first chinese character to be replaced, the method may further include:

respectively counting the occurrence times of each first Chinese character to be replaced in the specified text;

sequencing all the first Chinese characters to be replaced according to the sequence of the occurrence times from high to low;

screening out the Chinese characters which are sequenced after the designated numerical value in each first Chinese character to be replaced;

after determining the chinese character included in the chinese character dictionary whose shape code similarity is greater than a set threshold as a second chinese character to be replaced, the method may further include:

respectively counting the occurrence times of each second Chinese character to be replaced in the specified text;

sequencing each second Chinese character to be replaced according to the sequence of the occurrence times from high to low;

and screening out the Chinese characters which are sequenced after the designated numerical value in each second Chinese character to be replaced.

In an embodiment of the present application, the format of the text feature is a text feature vector, and the fusing the text feature and the user portrait feature of the target user to obtain target feature data may include:

acquiring user portrait data of the target user;

normalizing the user portrait data;

constructing a user portrait feature vector according to the user portrait data after normalization processing;

splicing the user portrait feature vector and the text feature vector to obtain a target feature vector;

and performing dimensionality reduction on the target feature vector by using an auto-encoder to obtain the target feature data.

Further, the coding network of the self-encoder includes N convolutional layers, activation function layers and down-sampling pooling layers which are cascaded in sequence, the decoding network of the self-encoder includes N convolutional layers, activation function layers and up-sampling pooling layers which are cascaded in sequence, N is an integer greater than 2, the self-encoder is used to perform dimensionality reduction processing on the target feature vector to obtain the target feature data, which can include:

inputting the target feature vector into N convolution layers, an activation function layer and a down-sampling pooling layer of the coding network in sequence to obtain intermediate feature data;

and inputting the intermediate characteristic data into the N convolution layers, the activation function layer and the up-sampling pooling layer of the decoding network in sequence to obtain the target characteristic data.

In a second aspect, an embodiment of the present application provides a sensitive comment identification apparatus, including:

the comment information acquisition module is used for acquiring comment information issued by a target user;

the text feature extraction module is used for extracting text features of the comment information;

the feature fusion module is used for fusing the text features and the user portrait features of the target user to obtain target feature data;

the sensitive comment identification module is used for inputting the target characteristic data into a trained sensitive comment identification model for processing, and outputting whether the comment information is a sensitive comment result or not through the sensitive comment identification model; the sensitive comment identification model is obtained by training a sensitive comment sample and a non-sensitive comment sample which are used as sample sets, the sensitive comment sample is sample data which is fused with the text features and the user portrait features of the sample and is provided with a sensitive comment label, and the non-sensitive comment sample is sample data which is fused with the text features and the user portrait features of the sample and is provided with a non-sensitive comment label.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the sensitive comment identification method as set forth in the first aspect of the embodiment of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the sensitive comment identification method as set forth in the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product, which when running on a terminal device, causes the terminal device to execute the sensitive comment identification method as set forth in the first aspect of the present application.

The advantageous effects achieved by the second aspect to the fifth aspect described above can be referred to the description of the first aspect described above.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow diagram of one embodiment of a sensitive review identification method provided by embodiments of the present application;

FIG. 2 is a schematic diagram of the rules for encoding a phonographic code;

FIG. 3 is a block diagram of one embodiment of a sensitive comment identification device provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

With the development of network communication technology, more and more users can send comment opinions through the network, and in order to reduce the influence of bad comments on the network environment, various social platforms usually adopt a technical means of sensitive word filtering. At present, a sensitive word filtering mode generally includes manually maintaining a sensitive word dictionary, and filtering comments when sensitive words included in the sensitive word dictionary are detected in the comments. However, the semantic features of the comments cannot be obtained in this way, and the set sensitive word dictionary is also difficult to cover all the sensitive words with different forms, so that the accuracy of sensitive comment recognition is low.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It should be understood that an execution subject of the sensitive comment identification method provided in the embodiment of the present application may be a mobile phone, a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a large screen television, or other terminal device or server, and the embodiment of the present application does not limit the specific types of the terminal device and the server. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Referring to fig. 1, a first embodiment of a sensitive comment identification method in the embodiment of the present application includes:

101. obtaining comment information issued by a target user;

the execution subject of the embodiment of the application may be a server of a certain social platform, and the target user is any user registered in the social platform. After the target user posts comments through the social contact platform, the server can acquire corresponding comment information, and then executes subsequent processing steps to identify whether the comment information is sensitive comment. In practical operation, the comment information may be a length of Chinese or foreign language information.

102. Extracting text features of the comment information;

after the comment information is obtained, text features of the comment information need to be extracted, and the text features can be extracted by constructing deep neural networks of various types and structures.

In one implementation of the present application, a bert model may be used to extract textual features of the review information.

Bert is a neural network model built by google, emphasizes that the traditional one-way language model or a shallow splicing method of two one-way language models is not adopted for pre-training as before, but a new Masked Language Model (MLM) is adopted so as to generate deep two-way language representation. Structurally, the bert model consists of an input layer, an encoding layer, and an output layer. What is input to the bert model is a superposition of the following 3 types of features of the comment information:

(1) token Embedding: embedding word features (word vectors), aiming at Chinese, only embedding the word features is supported at present;

(2) segment Embedding: embedding sentence-level characteristics of words, namely embedding sentences A and B for a double-sentence input task, and only embedding sentence A for a single-sentence input task;

(3) position Embedding: the position feature of a word, for Chinese, currently has a maximum length of 512.

The Bert's coding layer uses a transform encoder with powerful feature extraction capability, which has both the RNN capability of extracting long-distance dependencies and the CNN capability of parallel computation. The two abilities are mainly benefited by self-attention structure in a Transformer-encoder, and when a current word is calculated, the word with the context is simultaneously utilized to extract long-distance dependency relationship among the words; since the computation of each word is independent and independent of each other, the features of all words can be computed simultaneously in parallel.

After inputting the comment information into the trained bert model, each word of the comment information is encoded into a vector of a specified dimension (e.g., 768 dimensions), which is called a feature vector of the comment information, i.e., a corresponding text feature.

In an implementation manner of the present application, before extracting the text feature of the comment information, the method may further include:

(1) performing word segmentation processing on the comment information to obtain each target word contained in the comment information;

(2) detecting whether sensitive words recorded in a pre-constructed sensitive word dictionary exist in each target word;

(3) if the sensitive words recorded in the sensitive word dictionary exist in the target words, judging that the comment information is a sensitive comment;

(4) and if the sensitive words recorded in the sensitive word dictionary do not exist in the target words, executing a step of extracting text features of the comment information and subsequent steps.

Firstly, segmenting the comment information by adopting various word segmentation methods (such as jieba segmentation) in the prior art to obtain each target word contained in the comment information; then, detecting whether sensitive words recorded in a pre-constructed sensitive word dictionary exist in the target words; if the comment information exists, the comment information can be directly judged to be the sensitive comment, and at the moment, the step of subsequently extracting text features of the comment information to identify the sensitive comment is not required to be continuously executed; and if the comment information does not exist, continuing to execute the step of subsequently extracting the text features of the comment information to perform sensitive comment identification. By the processing, dictionary filtering and semantic recognition filtering can be fused, so that the accuracy of sensitive comment recognition is further improved.

In the embodiment of the application, known sensitive words are included in advance to form a sensitive word dictionary, however, in order to avoid filtering of the sensitive words, an illegal person often changes the form of the sensitive words, for example, some characters in the sensitive words are changed into homonyms, pinyin, or the like, and therefore, a certain technical means needs to be adopted to expand the sensitive word dictionary, for example, random expansion of homonyms, and pinyin can be performed on the sensitive words in the dictionary, so as to ensure that the sensitive words in the changed form can be also filtered.

In one implementation of the present application, the sensitive words contained in the sensitive word dictionary may be augmented by:

(1) expanding each sensitive word in the sensitive word dictionary according to the sound code of the Chinese character to obtain a sound code expanded sensitive word;

(2) expanding each sensitive word in the sensitive word dictionary according to the font code of the Chinese character to obtain the font code expanded sensitive word;

(3) expanding each sensitive word in the sensitive word dictionary according to the pinyin of the Chinese character to obtain pinyin expanded sensitive words;

(4) adding the phonetic code expanded sensitive words, the shape code expanded sensitive words and the pinyin expanded sensitive words to the sensitive word dictionary.

The pronunciation and font code of Chinese character is one Chinese character coding mode, and is especially one Chinese character converting sequence with ten digits and thus certain pronunciation and font features. As shown in fig. 2, which is a schematic diagram of the rule of the phonographic code coding, in the ten-digit sequence shown in fig. 2, the first four digits are codes of the character pronunciation, which respectively represent the vowel, the initial consonant, the consonant, and the tone of the character pronunciation. The last six digits are the codes of the character patterns, which respectively represent the structure, the form and the stroke number of the Chinese character. The similarity calculation formula of a single Chinese character obtained according to the sound-shape code is as follows:

d＝w₁P+w₂S

w₁+w₂＝1

wherein P represents the similarity of phonetic codes, w₁Representing the weight occupied by the phonetic code; s represents the similarity of the shape codes, w₂Representing the weight occupied by the shape code. P can be obtained by weighted summation of similarity of 4 character-sound codes between two compared Chinese characters, and S can be obtained by weighted summation of similarity of 6 character-shape codes between two compared Chinese characters.

(1.1) respectively calculating the sound code similarity between each Chinese character recorded and received in a Chinese character dictionary and a target Chinese character, wherein the target Chinese character is a Chinese character contained in a target sensitive word, and the target sensitive word is any one sensitive word in the sensitive word dictionary;

(1.2) determining the Chinese character recorded in the Chinese character dictionary with the phonetic code similarity larger than a set threshold as a first Chinese character to be replaced;

and (1.3) replacing the target Chinese character in the target sensitive word by using each first Chinese character to be replaced respectively to obtain each sound code expanded sensitive word.

For each original sensitive word in the sensitive word dictionary, the corresponding sound code expansion sensitive word can be obtained in the same way, for example, for a certain target sensitive word, a target Chinese character included in the target sensitive word is selected, then the sound code similarity between each Chinese character included in the Chinese character dictionary (such as a Xinhua dictionary) and the target Chinese character is calculated respectively, and the specific calculation way can refer to the sound-shape code coding rule. And then, determining the Chinese characters with the phonetic code similarity larger than a set threshold (for example, 70%) included in the Chinese character dictionary as the Chinese characters to be replaced. And finally, replacing the target Chinese character in the target sensitive word by using each Chinese character to be replaced respectively, wherein the sound code expanded sensitive word can be obtained after each Chinese character to be replaced is replaced. In addition, the number of the selected target Chinese characters in the target sensitive word can be set to be smaller than a certain proportion (for example, 50%) of the total number of the Chinese characters contained in the target sensitive word, for example, a sensitive word "evil education" exists in a sensitive word dictionary, and because the length of the sensitive word is 2, the sensitive word "evil sedan", "crab education" and the like can be obtained for the sound code expansion aiming at a single Chinese character each time when the sensitive word is expanded.

and screening out the Chinese characters which are sequenced after the designated numerical value in each first Chinese character to be replaced.

Because the number of Chinese characters which are recorded in a Chinese character dictionary and have high similarity with the sound code of a target Chinese character can be large, a certain proportion of the Chinese characters belong to rare characters and are rarely used at ordinary times, and illegal personnel generally rarely use the rare characters to change the form of sensitive words, the storage space occupied by a sensitive word dictionary is reduced in order to avoid expanding a large number of sensitive words without practical significance, and the Chinese characters to be replaced with low use frequency can be screened out by adopting a word frequency statistical mode after determining the Chinese characters to be replaced with the sound code expansion. Specifically, the occurrence frequency of each Chinese character to be replaced in a specified text (which is regarded as the use frequency of the Chinese character to be replaced) can be respectively counted, the specified text can be publications such as a people daily report or a Xinhua daily report, and data samples within a certain period (for example, 5 years); and then, sequencing all Chinese characters to be replaced according to the sequence of the occurrence times from high to low, and screening out the Chinese characters to be replaced after the sequence of the Chinese characters to be replaced is a specified numerical value (for example, 10), wherein the screened Chinese characters to be replaced cannot be used for replacing the target Chinese character to form the sound code expanded sensitive word.

Further, the expanding each sensitive word in the sensitive word dictionary according to the font code of the chinese character to obtain the font code expanded sensitive word may include:

(2.1) respectively calculating the similarity of the shape code between each Chinese character included in the Chinese character dictionary and the target Chinese character;

(2.2) determining the Chinese characters which are included in the Chinese character dictionary and have the similarity of the shape codes larger than a set threshold value as second Chinese characters to be replaced;

and (2.3) replacing the target Chinese character in the target sensitive word by using each second Chinese character to be replaced respectively to obtain each sensitive word expanded by the shape code.

Similar to the method of sound code expansion, the corresponding sensitive words of shape code expansion can be obtained in the same way for each original sensitive word in the sensitive word dictionary, for example, for a certain target sensitive word, a target Chinese character included in the target sensitive word is selected first, and then the similarity of shape codes between each Chinese character included in the Chinese character dictionary (such as a Xinhua dictionary) and the target Chinese character is calculated respectively, and the specific calculation way can refer to the sound-shape code coding rule mentioned above. And then, determining the Chinese characters with the similarity of the shape codes, which are recorded in the Chinese character dictionary, larger than a set threshold (for example, 80%) as the Chinese characters to be replaced. And finally, replacing the target Chinese character in the target sensitive word by using each Chinese character to be replaced respectively, wherein the sensitive word expanded by the shape code can be obtained after each Chinese character to be replaced is replaced. In addition, the number of the selected target Chinese characters in the target sensitive word may be set to be smaller than a certain proportion (for example, 50%) of the total number of the Chinese characters included in the target sensitive word, for example, a sensitive word "evil education" exists in the sensitive word dictionary, and since the length of the sensitive word is 2, the sensitive words "chen education", "jow education" and the like may be obtained for the shape code expansion for each single Chinese character when the sensitive word is expanded.

Further, after determining the chinese character included in the chinese character dictionary whose shape code similarity is greater than a set threshold as a second chinese character to be replaced, the method may further include:

Similar to the sound code expansion, in order to avoid expanding a large number of sensitive words without practical meaning and reduce the storage space occupied by the sensitive word dictionary, after determining the Chinese characters to be replaced expanded by the shape codes, the Chinese characters to be replaced with lower use frequency can be screened out by adopting a word frequency statistical mode. Specifically, the occurrence frequency of each Chinese character to be replaced in the designated text (regarded as the use frequency of the Chinese character to be replaced) can be respectively counted, the Chinese characters to be replaced are sorted according to the sequence from high to low of the occurrence frequency, the Chinese characters to be replaced in the Chinese characters to be replaced, which are sorted after the designated numerical value (for example, 10), are screened out, and the screened Chinese characters to be replaced are not used for replacing the target Chinese character to form the sensitive word of the shape code expansion.

Further, the expanding each sensitive word in the sensitive word dictionary according to the pinyin of the chinese character to obtain pinyin-expanded sensitive words may include:

Aiming at each sensitive word in the sensitive word dictionary, a certain number of Chinese characters contained in the sensitive word can be selected and converted into pinyin, so that pinyin-expanded sensitive words are obtained. For example, the sensitive word "evil education" can be augmented by pinyin to obtain the sensitive word "xie education" or "evil jiao".

After the sensitive words in the sensitive word dictionary are expanded by adopting the method, the sensitive words with different forms can be successfully detected, so that the filtering effect of the sensitive words is effectively improved.

103. Fusing the text features and the user portrait features of the target user to obtain target feature data;

after the text features of the comment information are extracted, the text features and the user portrait features of the target user are fused to obtain target feature data. The user portrait characteristics can include the number of total violations of the target user, the frequency of posting comments, the user level, the user activity, the user violation type, the number of people concerned by the user, and the like.

In an implementation manner of the present application, the format of the text feature is a text feature vector, and the fusing the text feature and the user portrait feature of the target user to obtain the target feature data may include:

(1) acquiring user portrait data of the target user;

(2) normalizing the user portrait data;

(3) constructing a user portrait feature vector according to the user portrait data after normalization processing;

(4) splicing the user portrait feature vector and the text feature vector to obtain a target feature vector;

(5) and performing dimensionality reduction on the target feature vector by using an auto-encoder to obtain the target feature data.

After user portrait data of a target user is obtained, normalization processing is carried out on the data, the normalization of the data is to scale the data in proportion to enable the data to fall into a small specific interval, unit limitation of the data is removed, the data is converted into dimensionless pure numerical values, and indexes of different units or orders of magnitude can be compared and weighted conveniently, namely the data are mapped onto a [0, 1] interval in a unified mode. For example, 10 user image data values are obtained, the 10 values are first normalized, and then a 10-dimensional user image feature vector, such as [0.12,0.35,0.78, … …,0.45], is constructed. In addition, the text feature vector extracted by means of the bert model and the like can be a 768-dimensional word vector (for example, [0.32,0.55,0.38, … …,0.15]), and the two feature vectors are transversely spliced to obtain a 778-dimensional target feature vector. Finally, the target feature vector is subjected to dimensionality reduction by using an auto-encoder to obtain target feature data with a lower dimensionality, for example, after the target feature vector with 778 dimensionality is subjected to dimensionality reduction by using the auto-encoder, target feature data with 50 dimensionality can be obtained.

In an implementation manner of the embodiment of the present application, the encoding network of the self-encoder includes N convolutional layers, activation function layers, and down-sampling pooling layers, which are cascaded in sequence, the decoding network of the self-encoder includes N convolutional layers, activation function layers, and up-sampling pooling layers, which are cascaded in sequence, N is an integer greater than 2, it is right to use the self-encoder to perform dimension reduction processing on the target feature vector, obtaining the target feature data may include:

(1) inputting the target feature vector into N convolution layers, an activation function layer and a down-sampling pooling layer of the coding network in sequence to obtain intermediate feature data;

(2) and inputting the intermediate characteristic data into the N convolution layers, the activation function layer and the up-sampling pooling layer of the decoding network in sequence to obtain the target characteristic data.

The self-encoder used in the embodiment of the present application may be an AutoEncoder network, which belongs to one of unsupervised neural networks. The coding network structure contained in the network can be N convolutional layers, relu activation function layers and down-sampling pooling layers which are sequentially cascaded; the decoding network structure contained in the network can be sequentially cascaded N convolution layers, relu activation function layers and an up-sampling pooling layer. Compared with networks such as VGG, ResNet and the like, the neural network is not deep in depth, few in network parameters and unsupervised, and a large amount of data are not needed for training. After the target feature vector is processed by using an AutoEncoder network, the features are fused into target feature data with lower dimensionality.

104. And inputting the target characteristic data into a trained sensitive comment recognition model for processing, and outputting whether the comment information is a sensitive comment result or not through the sensitive comment recognition model.

After the text features and the user image features are fused to obtain target feature data, whether the comment information is a sensitive comment or not can be identified based on the target feature data. The target feature data comprises semantic features and user portrait features, and accuracy of identifying whether the comment information is sensitive comments or not can be improved based on the semantic features and the user portrait features. Specifically, the target feature data may be input into a trained sensitive comment recognition model to obtain a corresponding binary result, that is, whether the comment information is a sensitive comment or not.

The sensitive comment identification model is obtained by training a sensitive comment sample and a non-sensitive comment sample which are used as sample sets, wherein the sensitive comment sample is sample data which is fused with the text characteristics and the user portrait characteristics of the sample and is provided with a sensitive comment label, and the non-sensitive comment sample is sample data which is fused with the text characteristics and the user portrait characteristics of the sample and is provided with a non-sensitive comment label. For example, a large number of comments posted by different people may be collected in advance, the comments are classified manually to obtain corresponding sensitive comment tags or non-sensitive comment tags, then text features of the comments are extracted, and the extracted text features and user portrait features of the corresponding people are fused (the feature vector splicing method can be adopted for fusion), so that corresponding sensitive comment samples and non-sensitive comment samples are obtained. Then, a corresponding two-classification model (namely the sensitive comment recognition model) is obtained through training in a neural network training mode. In an implementation manner of the embodiment of the application, the sensitive comment identification model may adopt an XGBoost model.

The XGboost model is one of gradient lifting tree models, and the models are generated in series, and the sum of all the models is taken as output. The XGboost model carries out second-order Taylor expansion on the loss function, optimizes the loss function by using second-order derivative information of the loss function, and greedily selects whether to split the nodes according to whether the loss function is reduced. Meanwhile, means such as regularization, learning rate, column sampling, approximate optimal segmentation points and the like are added to the XGboost in the aspect of preventing overfitting. Certain optimization is also made in the aspect of processing missing values. When the XGboost model is trained, a large number of sample comments (including sensitive comment samples and non-sensitive comment samples) can be obtained, wherein one part of the sample comments is used as a training set, and the other part of the sample comments is used as a testing set; and training the XGboost model by using a training set, then evaluating and optimizing the trained XGboost model by using a testing set, and continuously updating the internal parameters of the model until the classification accuracy of the model meets the set requirement.

In summary, the application extracts the semantic information from the comment text by using the artificial intelligence algorithm, combines the semantic information with the user portrait data of the user, can perform feature mining on the comment data on the basis of depth and breadth, and can classify the mined features by using the XGboost model with excellent performance, so that the accuracy of identifying the sensitive comments is greatly improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 3 shows a block diagram of a sensitive comment identification apparatus provided in the embodiment of the present application, which corresponds to the sensitive comment identification method described in the above embodiment, and for convenience of explanation, only the relevant parts of the embodiment of the present application are shown.

Referring to fig. 3, the apparatus includes:

a comment information acquisition module 301, configured to acquire comment information issued by a target user;

a text feature extraction module 302, configured to extract a text feature of the comment information;

a feature fusion module 303, configured to fuse the text feature and the user portrait feature of the target user to obtain target feature data;

the sensitive comment identification module 304 is configured to input the target feature data into a trained sensitive comment identification model for processing, and output a result of whether the comment information is a sensitive comment or not through the sensitive comment identification model; the sensitive comment identification model is obtained by training a sensitive comment sample and a non-sensitive comment sample which are used as sample sets, the sensitive comment sample is sample data which is fused with the text features and the user portrait features of the sample and is provided with a sensitive comment label, and the non-sensitive comment sample is sample data which is fused with the text features and the user portrait features of the sample and is provided with a non-sensitive comment label.

In an implementation manner of the embodiment of the present application, the sensitive comment identifying apparatus may further include:

the word segmentation module is used for carrying out word segmentation processing on the comment information to obtain each target word contained in the comment information;

the sensitive word detection module is used for detecting whether sensitive words recorded in a pre-constructed sensitive word dictionary exist in each target word;

the sensitive comment judging module is used for judging that the comment information is a sensitive comment if the sensitive words recorded in the sensitive word dictionary exist in the target words;

and the triggering module is used for triggering the step of extracting the text features of the comment information if the sensitive words recorded in the sensitive word dictionary do not exist in the target words.

the sound code expansion module is used for expanding each sensitive word in the sensitive word dictionary according to the sound code of the Chinese character to obtain the sound code expanded sensitive word;

the font code expansion module is used for expanding each sensitive word in the sensitive word dictionary according to the font code of the Chinese character to obtain the font code expanded sensitive word;

the pinyin expansion module is used for expanding each sensitive word in the sensitive word dictionary according to the pinyin of the Chinese character to obtain pinyin expanded sensitive words;

and the sensitive word adding module is used for adding the sensitive words expanded by the phonetic codes, the sensitive words expanded by the shape codes and the sensitive words expanded by the pinyin to the sensitive word dictionary.

Further, the code expansion module may include:

the sound code similarity calculation unit is used for calculating the sound code similarity between each Chinese character recorded in the Chinese character dictionary and a target Chinese character respectively, the target Chinese character is a Chinese character contained in a target sensitive word, and the target sensitive word is any one sensitive word in the sensitive word dictionary;

the first Chinese character to be replaced determining unit is used for determining the Chinese characters recorded in the Chinese character dictionary, of which the similarity of the sound codes is greater than a set threshold value, as the first Chinese characters to be replaced;

the first Chinese character replacing unit is used for replacing the target Chinese character in the target sensitive word by using each first Chinese character to be replaced respectively to obtain each sound code expanded sensitive word;

the shape code expansion module may include:

the shape code similarity calculation unit is used for respectively calculating the shape code similarity between each Chinese character recorded in the Chinese character dictionary and the target Chinese character;

a second Chinese character to be replaced determining unit, configured to determine a Chinese character included in the Chinese character dictionary with the shape code similarity larger than a set threshold as a second Chinese character to be replaced;

the second Chinese character replacing unit is used for replacing the target Chinese character in the target sensitive word by using each second Chinese character to be replaced respectively to obtain each sensitive word expanded by the shape code;

the pinyin expansion module may include:

and the pinyin replacement unit is used for replacing the target Chinese characters in the target sensitive words with pinyin to obtain pinyin expansion sensitive words corresponding to the target sensitive words.

Furthermore, the code expansion module may further include:

the first count counting unit is used for respectively counting the occurrence frequency of each first Chinese character to be replaced in the specified text;

the first Chinese character sorting unit is used for sorting the first Chinese characters to be replaced according to the sequence of the occurrence times from high to low;

the first Chinese character screening unit is used for screening out the Chinese characters which are sequenced after the appointed numerical value in each first Chinese character to be replaced;

the shape code expansion module may further include:

the second time counting unit is used for respectively counting the occurrence times of each second Chinese character to be replaced in the specified text;

the second Chinese character sorting unit is used for sorting the second Chinese characters to be replaced according to the sequence of the occurrence times from high to low;

and the second Chinese character screening unit is used for screening out the Chinese characters which are sequenced after the designated numerical value in each second Chinese character to be replaced.

In an implementation manner of the embodiment of the present application, the feature fusion module may include:

a portrait data acquisition unit for acquiring user portrait data of the target user;

the normalization processing unit is used for performing normalization processing on the user portrait data;

the feature vector construction unit is used for constructing a user portrait feature vector according to the user portrait data after normalization processing;

the feature vector splicing unit is used for splicing the user portrait feature vector and the text feature vector to obtain a target feature vector;

and the feature vector dimension reduction unit is used for performing dimension reduction processing on the target feature vector by using an autoencoder to obtain the target feature data.

Further, the encoding network of the self-encoder includes N convolutional layers, activation function layers and down-sampling pooling layers which are cascaded in sequence, the decoding network of the self-encoder includes N convolutional layers, activation function layers and up-sampling pooling layers which are cascaded in sequence, N is an integer greater than 2, and the feature vector dimension reduction unit may include:

the first processing subunit is used for sequentially inputting the target feature vector into the N convolution layers, the activation function layer and the down-sampling pooling layer of the coding network to obtain intermediate feature data;

and the second processing subunit is used for sequentially inputting the intermediate characteristic data into the N convolution layers, the activation function layer and the up-sampling pooling layer of the decoding network to obtain the target characteristic data.

Embodiments of the present application further provide a computer-readable storage medium, which stores computer-readable instructions, and when executed by a processor, the computer-readable instructions implement any one of the sensitive comment identification methods as shown in fig. 1.

Embodiments of the present application further provide a computer program product, when the computer program product runs on a server, the server is caused to execute any one of the sensitive comment identification methods as represented in fig. 1.

Fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 4, the terminal device 4 of this embodiment includes: a processor 40, a memory 41, and computer readable instructions 42 stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer readable instructions 42, implements the steps in the various sensitive comment identification method embodiments described above, such as steps 101 through 104 shown in fig. 1. Alternatively, the processor 40, when executing the computer readable instructions 42, implements the functions of the modules/units in the above device embodiments, such as the functions of the modules 301 to 304 shown in fig. 3.

Illustratively, the computer readable instructions 42 may be partitioned into one or more modules/units that are stored in the memory 41 and executed by the processor 40 to accomplish the present application. The one or more modules/units may be a series of computer-readable instruction segments capable of performing specific functions, which are used for describing the execution process of the computer-readable instructions 42 in the terminal device 4.

The terminal device 4 may be a computing device such as a smart phone, a notebook, a palm computer, and a cloud terminal device. The terminal device 4 may include, but is not limited to, a processor 40 and a memory 41. It will be understood by those skilled in the art that fig. 4 is only an example of the terminal device 4, and does not constitute a limitation to the terminal device 4, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 4 may further include an input-output device, a network access device, a bus, etc.

The Processor 40 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an AppLication Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. The memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing the computer readable instructions and other programs and data required by the terminal device. The memory 41 may also be used to temporarily store data that has been output or is to be output.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A sensitive review identification method, comprising:

obtaining comment information issued by a target user;

extracting text features of the comment information;

2. The method of claim 1, prior to extracting textual features of the review information, further comprising:

3. The method of claim 2, wherein the sensitive word dictionary contains sensitive words augmented by:

4. The method as claimed in claim 3, wherein said expanding each sensitive word in said sensitive word dictionary according to the phonetic code of the chinese character to obtain the phonetic code expanded sensitive word comprises:

the expanding each sensitive word in the sensitive word dictionary according to the font code of the Chinese character to obtain the font code expanded sensitive word comprises the following steps:

the expanding each sensitive word in the sensitive word dictionary according to the pinyin of the Chinese character to obtain the pinyin expanded sensitive word comprises the following steps:

5. The method as claimed in claim 4, wherein after determining the Chinese character included in the Chinese character dictionary with the phonetic code similarity greater than a set threshold as the first Chinese character to be replaced, the method further comprises:

after the Chinese character with the similarity of the shape code larger than the set threshold value included in the Chinese character dictionary is determined as a second Chinese character to be replaced, the method further comprises the following steps:

6. The method of any one of claims 1 to 5, wherein the text feature is formatted as a text feature vector, and the fusing the text feature and the user representation feature of the target user to obtain target feature data comprises:

acquiring user portrait data of the target user;

normalizing the user portrait data;

7. The method of claim 6, wherein the coding network of the self-encoder comprises N convolutional layers, an activation function layer, and a down-sampling pooling layer which are sequentially cascaded, the decoding network of the self-encoder comprises N convolutional layers, an activation function layer, and an up-sampling pooling layer which are sequentially cascaded, N is an integer greater than 2, and the using the self-encoder to perform the dimension reduction processing on the target feature vector to obtain the target feature data comprises:

8. A sensitive comment recognition apparatus comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the sensitive comment identification method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a sensitive comment identification method according to any one of claims 1 to 7.