CN112487149B

CN112487149B - Text auditing method, model, equipment and storage medium

Info

Publication number: CN112487149B
Application number: CN202011439157.6A
Authority: CN
Inventors: 班涛
Original assignee: Zhejiang Nuonuo Network Technology Co ltd
Current assignee: Zhejiang Nuonuo Network Technology Co ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2023-04-07
Anticipated expiration: 2040-12-10
Also published as: CN112487149A

Abstract

The application discloses a text auditing method, a model, equipment and a storage medium, comprising the following steps: acquiring a sensitive word list and a training set; constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector; performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text; training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model; and inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence coefficient of the text to be detected output by the trained text classification model. The method and the device have the advantages that the end-to-end text auditing function of the fusion mode is realized by using sensitive word matching and text classification, the manual review cost is reduced, and the text auditing efficiency and accuracy are improved.

Description

Text auditing method, model, equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text auditing method, a model, equipment and a storage medium.

Background

The increase of network information transmission media such as mobile phones and the like enables more users to participate in the transmission of network information, so that the information transmission speed and capability are increased, and great challenges are provided for website content management. Most of the current text review modes are matching models combined with manual review modes, information is screened through the models, manual review pressure is relieved, and the matching models mainly comprise sensitive word matching and classification models. The sensitive word matching model is mainly used for matching based on symbols, but has more misjudgments, if variant sensitive contents such as symbols inserted in sensitive words, chinese-English mixture, pinyin character mixture and the like cannot be identified, sentences containing sensitive words but not sensitive contents cannot be effectively identified, the classification model is mainly based on statistical analysis, and has the defects of incapability of providing specific judgment basis, insufficient flexibility and the like, and the overall auditing capability is limited. In summary, in the prior art, there are at least technical problems of low text review accuracy and high manual review cost.

Disclosure of Invention

In view of this, the present invention provides a text review method, a model, a device, and a storage medium, which can reduce the cost of manual review and improve the efficiency and accuracy of text review. The specific scheme is as follows:

a first aspect of the present application provides a text auditing method, including:

acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;

constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector;

performing sensitive word matching on the training text based on the sensitive word list to obtain a target word vector corresponding to the sensitive words in the training text;

training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model;

inputting the text to be detected into the trained text classification model, and determining the auditing result of the text to be detected based on the sensitive category and the confidence of the text to be detected output by the trained text classification model.

Optionally, the constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector includes:

constructing a text classification model by using a text convolution neural network with an activation function as a linear rectification function;

mapping the training text into a random vector through an embedding layer of the text classification model to obtain a vectorized text;

and extracting the characteristic information of the vectorization text by using the convolution layer and the maximum pooling layer to obtain a target corpus vector.

Optionally, the performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to a sensitive word in the training text includes:

constructing a dictionary tree by using a finite automaton algorithm according to the sensitive word list;

and extracting target sensitive words in the training text by using the dictionary tree, and processing the target sensitive words by using the one-hot coding to obtain target vocabulary vectors.

Optionally, after performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector, the method further includes:

and judging whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector or not, and if the dimension of the target vocabulary vector is inconsistent with the dimension of the target corpus vector, adjusting the dimension of the target vocabulary vector to be consistent with the dimension of the target corpus vector in an addition or dot product mode.

Optionally, the training the text classification model by using the target corpus vector and the target vocabulary vector includes:

splicing the target corpus vector and the target vocabulary vector to obtain a spliced vector;

and training a full connection layer and a classifier in the text classification model by using the splicing vector.

Optionally, the determining an audit result of the text to be detected based on the output result of the trained text classification model includes:

acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table;

and determining the auditing result of the text to be detected according to the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model and the sensitive word matching table.

Optionally, the obtaining the training set includes:

and performing sensitive word matching on the unmarked training texts by using a dictionary tree constructed based on the sensitive word list, and performing sensitive category marking on the corresponding training texts according to matching results to obtain the training set containing the training texts and corresponding label information.

A second aspect of the present application provides a text audit model, comprising:

the data acquisition interface is used for acquiring the sensitive word list, the training set and the text to be detected; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;

the text classification model is used for outputting the sensitivity category and the confidence coefficient of the text to be detected;

the training device is used for extracting the characteristic information of the training text by using a text classification model constructed based on a machine learning algorithm to obtain a target corpus vector of the training text, performing sensitive word matching on the training text based on the sensitive vocabulary to obtain a target vocabulary vector corresponding to a sensitive word in the training text, and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model.

A third aspect of the application provides an electronic device comprising a processor and a memory; wherein the memory is used for storing a computer program which is loaded and executed by the processor to implement the aforementioned text auditing method.

A fourth aspect of the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the foregoing text auditing method is implemented.

In the application, a sensitive word list and a training set are obtained first, wherein the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts. And secondly, extracting the characteristic information of the training text by using a text classification model to obtain a target corpus vector, and performing sensitive word matching on the training text based on a sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text. And then, training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model. And finally, inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence of the text to be detected output by the trained text classification model. The method and the device have the advantages that the end-to-end text auditing function of the fusion mode is realized by using the sensitive word matching and the text classification, the manual review cost is reduced, and the text auditing efficiency and accuracy are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a text auditing method provided in the present application;

FIG. 2 is a schematic diagram of a specific text auditing method provided in the present application;

FIG. 3 is a flowchart of a specific text auditing method provided by the present application;

fig. 4 is a schematic diagram of a training set construction provided in the present application;

FIG. 5 is a schematic diagram of a text audit model provided herein;

fig. 6 is a block diagram of an electronic device for text auditing according to the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Most of the current text review modes are matching models combined with manual review modes, information is screened through the models, manual review pressure is relieved, and the matching models mainly comprise sensitive word matching or classification models. However, the above text auditing method has many misjudgments, and if the sensitive word matching cannot identify variant sensitive contents such as inserting symbols, chinese-English mixture, and alphabetic writing mixture in the sensitive word, or effectively identify sentences containing the sensitive word but not the sensitive contents, the classification model cannot provide specific judgment basis, and has insufficient flexibility, and the overall auditing capability is limited. In order to overcome the technical problems, the text review scheme provided by the application can reduce the manual review cost and improve the text review efficiency and accuracy.

Fig. 1 is a flowchart of a text auditing method according to an embodiment of the present application. Referring to fig. 1, the text auditing method includes:

s11: acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts.

In this embodiment, a sensitive vocabulary and a training set need to be obtained first, where the sensitive vocabulary is used to perform sensitive word matching, and the more comprehensive the sensitive words in the vocabulary are, the more accurate the obtained auditing result is, and the sensitive vocabulary may be a comprehensive vocabulary library including multiple categories such as violence, corruption, livelihood, and the like, or may be a sensitive vocabulary of a corresponding category according to business requirements. The training set is used for training a text classification model and comprises training texts and label information obtained after sensitive category marking is carried out on the training texts, the training texts can be sentences, paragraphs and the like, the sensitive categories are sensitive and insensitive and can be respectively represented by 0 and 1, the training texts marked as sensitive are positive samples, and the training texts marked as insensitive are negative samples. The training text is used for constructing corpus vectors, the sensitive word list is used for constructing word vectors, and the context word relationship of words is learned by associating the information before and after the text based on the labeled corpus, so that the strong text auditing capacity is provided, and the sensitive content checking capacity of a simple replacement mode is improved.

In an embodiment, the sensitive word list and the training set labeled with the sensitive category are provided at the same time, and the method can reduce the operation load of the system, and it should be noted that the obtained labeling result of the sensitive category of the training text in the training set is correlated with the sensitive word list as much as possible, that is, the construction of the training set should take the sensitive words in the sensitive word list into account. In another embodiment, only the sensitive vocabulary and the training text may be provided, the training set is obtained by processing the training text based on the sensitive vocabulary, the training set obtained by the above method is in a pulse bearing with the sensitive vocabulary, and the uniformity of the corpus vector and the vocabulary vector obtained in the subsequent steps is relatively high.

S12: and constructing a text classification model based on a machine learning algorithm, and extracting the characteristic information of the training text by using the text classification model to obtain a target corpus vector.

In this embodiment, the text classification model includes an embedding layer, a convolutional layer, a pooling layer, a fully-connected layer, and a classifier, and the embedding layer, the convolutional layer, and the pooling layer are used to perform feature extraction on the oppositely quantized training texts. The Text classification model based on the machine learning algorithm constructed in this embodiment is a Text classification model trained through the training set, and the Text classification model includes an embedded layer, a convolutional layer and a pooling layer network for performing feature extraction on the training Text, that is, a network for acquiring a target corpus vector of the training Text may be a Text convolutional network (Text-CNN), a long-short memory network (LSTM), a gated round robin unit network (GRU), and the like.

It should be noted that before the feature information of the training text is extracted by using the text classification model, data cleaning needs to be performed on the training text, and there are many methods for data cleaning, including but not limited to case conversion, simplified form conversion, full-half-corner conversion, and stop-word deletion performed on the training text. And then extracting feature information of the cleaned training text by using the text classification model to obtain a target corpus vector, wherein the target corpus vector is used for training the full-link layer and the classifier in the text classification model.

S13: and performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text.

In this embodiment, the sensitive words in the training text are mainly matched according to the sensitive word list, and vectorization processing is further performed on the sensitive words to obtain target vocabulary vectors corresponding to the sensitive words in the training text, where the target vocabulary vectors are used to train the full connection layer and the classifier in the text classification model. In this embodiment, sensitive word matching is performed on the training text by using a method of a dictionary tree, and further, the dictionary tree is constructed by using a Deterministic Finite Automaton (DFA) algorithm.

Sensitive word matching is independently used for auditing texts with obvious characteristic words, good text auditing capacity can be obtained by providing less data, the running and deployment speed is relatively high, but contents with the characteristic words which are not obvious and sensitive to some variants cannot be accurately identified, and the sensitive contents of the variants comprise inserting symbols, chinese-English inclusion, pinyin character mixing and the like in the sensitive words. Meanwhile, there may be a case where a sentence containing a sensitive word but not a sensitive content is erroneously identified, such as a sentence containing a negative sensitive word content, or a sentence which is matched to a sensitive word but not associated therewith, especially when the sensitive word is short. Therefore, only the text auditing capability through the sensitive word matching is limited, and the text auditing accuracy can be improved to a great extent by combining the corpus information of the training text after the sensitive word matching is performed on the training text to obtain the target vocabulary vector corresponding to the sensitive word in the training text.

S14: and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model.

In this embodiment, after the training text is processed to obtain the target corpus vector and the target vocabulary vector corresponding to the training text, the target corpus vector and the target vocabulary vector are input into the full-link layer in the text classification model for training, so that the trained text classification model can directly output the sensitive type and the corresponding confidence of the text to be detected. The training of the text classification model by using the target corpus vector and the target vocabulary vector can also be called model aggregation, the number of neurons is reduced and the overfitting rate of the model is reduced by a random inactivation (Dropout) mode in the full-link layer, and meanwhile, the classification result output by the classifier adopting a normalized exponential function (Softmax) is optimized by an Adaptive moment estimation optimizer (Adam) to obtain the final classification, and the loss function to be optimized is the cross entropy. The Softmax function is often added to an output layer of the neural network, and is used for converting an output result of the neural network into relative probability, namely, the confidence of the text classification model on the classification task.

S15: inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model.

In this embodiment, a text to be detected is input into the trained text classification model, the trained text classification model firstly performs feature extraction on the text to be detected to obtain a corpus vector of the text to be detected, then performs sensitive word matching on the text to be detected to obtain a vocabulary vector of the text to be detected, and finally outputs a sensitive category and a confidence of the text to be detected after auditing and predicting the text to be detected based on the corpus vector and the vocabulary vector of the text to be detected. And determining an auditing result of the text to be detected based on the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model.

Therefore, the sensitive word list and the training set are obtained firstly, wherein the training set comprises the training texts and the label information obtained after sensitive category labeling is carried out on the training texts. And secondly, extracting the characteristic information of the training text by using a text classification model to obtain a target corpus vector, and performing sensitive word matching on the training text based on a sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text. And then, training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model. And finally, inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence of the text to be detected output by the trained text classification model. According to the method and the device, the end-to-end text auditing function of the fusion mode is realized by using sensitive word matching and text classification, the manual review cost is reduced, and the text auditing efficiency and accuracy are improved.

Fig. 2 is a flowchart of a specific text auditing method according to an embodiment of the present application. Referring to fig. 2, the text auditing method includes:

s21: acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts.

In this embodiment, as to the specific process of the step S21, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

S22: and constructing a text classification model for the text convolution neural network of the linear rectification function by using the activation function.

S23: mapping the training text into a random vector through an embedding layer of the text classification model to obtain a vectorized text; and extracting the characteristic information of the vectorization text by utilizing the convolutional layer and the maximum pooling layer to obtain a target corpus vector.

In this embodiment, the text classification model is constructed by using a text convolutional neural network, which is a network that applies a convolutional neural network to a text classification task, and has the characteristics of high operation speed, high feature extraction efficiency, and local connection and weight sharing. Specifically, by combining the convolutional layer in the text convolutional neural network with the maximum pooling layer (1-Maxpooling), feature extraction is performed on the training text subjected to the vector quantization, firstly, the embedding layer of the text classification model converts words or words in the training text into numerical values through One-Hot-coding (One-Hot-coding), and further maps the numerical values into 128-dimensional random vectors, so as to obtain the vectorized text corresponding to the training text. And then, enabling the vectorized text output by the embedded layer to pass through three convolution kernels with different sizes in parallel and activating, wherein the sizes of the convolution kernels are n x 3, n x 4 and n x 5 respectively (n is the upper limit value of the length of a sentence corresponding to the training text), and using a Linear rectification activation function (ReLU, a Rectified Linear Unit) to accelerate the convergence speed and reduce the risk of model overfitting. And finally, serially splicing the vector set output and activated by the convolutional layer into a vector by using the maximum pooling layer, and performing dimensionality reduction extraction operation on the characteristic weight to obtain a target corpus vector of the training text.

S24: constructing a dictionary tree by using a finite automaton algorithm according to the sensitive word list; and extracting target sensitive words in the training text by using the dictionary tree, and processing the target sensitive words by using the one-hot coding to obtain target vocabulary vectors corresponding to the sensitive words in the training text.

In this embodiment, the sensitive words in the training text are extracted by constructing a dictionary tree, and the vectorization extraction result is obtained. Firstly, according to the sensitive word list, a Deterministic Finite Automata (DFA) algorithm is utilized to construct a DFA dictionary tree, the DFA tree is a variant of a hash tree, the searching distance is shortened by searching a public prefix, and the searching speed is high. When the training text is read, the DFA tree is used for searching and extracting the sensitive words in the training text and displaying and extracting the sensitive words in the training text to obtain a sensitive word list of the training text, the matched sensitive word list is converted into numerical values through a one-hot coding mode, and the numerical values are further mapped into one or more random vectors (the specific number is consistent with the number of elements in the sensitive word list) to obtain a target vocabulary vector corresponding to the sensitive words in the training text.

S25: and judging whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector or not, and if the dimension of the target vocabulary vector is inconsistent with the dimension of the target corpus vector, adjusting the dimension of the target vocabulary vector to be consistent with the dimension of the target corpus vector in an addition or dot product mode.

S26: and splicing the target corpus vector and the target vocabulary vector to obtain a spliced vector, and training a full connection layer and a classifier in the text classification model by using the spliced vector to obtain a trained text classification model.

In this embodiment, after a target corpus vector and a target vocabulary vector of the training text are obtained, in order to ensure that the target corpus vector and the target vocabulary vector can be spliced, it is necessary to make dimensions of the target corpus vector and the target vocabulary vector consistent, so before splicing the target corpus vector and the target vocabulary vector, it is necessary to determine whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector, if the dimension of the target vocabulary vector is not consistent with the dimension of the target corpus vector, merge the target vocabulary vector by an addition method to make the dimension consistent with the dimension of the target corpus vector, merge the target vocabulary vector by the addition method is based on the assumption that sentences are more sensitive, and of course, the addition method may be replaced by a dot product method, which is not limited in this embodiment. And splicing the target corpus vector and the target vocabulary vector by the method to obtain a spliced vector of the training text, and inputting the spliced vector into a full connection layer of the text classifier to train so as to obtain a trained text classification model.

S27: inputting the text to be detected into the trained text classification model, and determining the auditing result of the text to be detected based on the sensitive category and the confidence of the text to be detected output by the trained text classification model.

In this embodiment, for the specific process of the step S27, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated herein.

As can be seen, in the embodiment of the present application, a text convolutional neural network with an activation function as a linear rectification function is used to construct a text classification model, a vectorization processing is performed on the training text through an embedding layer of the text classification model, then a convolutional layer and a maximum pooling layer are used to extract feature information of the vectorized training text, a finite automata algorithm is used to construct a dictionary tree of the sensitive word list so as to extract target sensitive words in the training text, and unique thermal coding is used to process the feature information of the training text and the target sensitive words so as to obtain a target corpus vector of the training text and a target vocabulary vector corresponding to the sensitive words in the training text. And inputting the text to be detected into a text classification model obtained by training the text classification model by using a splicing vector obtained by splicing the training text and the target vocabulary vector so as to realize content verification of the text to be detected. The method and the device for the text review realize the end-to-end text review function of the fusion mode by using the sensitive word matching and the text classification, reduce the manual review cost and improve the text review efficiency and accuracy.

Fig. 3 is a flowchart of a specific text auditing method provided in an embodiment of the present application. Referring to fig. 3, the text auditing method includes:

s31: and acquiring a sensitive word list, performing sensitive word matching on the unmarked training text by using a dictionary tree constructed based on the sensitive word list, and performing sensitive category marking on the corresponding training text according to a matching result to obtain a training set containing the training text and corresponding label information.

In the embodiment, in a data preparation stage, only the sensitive word list is acquired, a dictionary tree is constructed based on the sensitive word list, a sentence which is not subjected to sensitive category labeling in an external unlabeled text pool is retrieved through the dictionary tree, a text which contains sensitive words in the sensitive word list is labeled as a text content sensitive category, a text which does not contain the sensitive words in the sensitive word list is labeled as a text content insensitive category, verification and confirmation are performed on the labeled sensitive category through a human-computer interaction interface, and when the sensitive category received through the human-computer interaction interface is inconsistent with the labeled sensitive category, the sentence is labeled again by taking the sensitive category received through the human-computer interaction interface as a criterion so as to improve the accuracy. And then supplementing the re-labeled sentences into a training set to obtain a training set containing the training texts and corresponding label information, wherein the specific process is shown in fig. 4.

S32: and constructing a text classification model based on a machine learning algorithm, and extracting the characteristic information of the training text by using the text classification model to obtain a target corpus vector.

S33: and performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text.

S34: and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model.

In this embodiment, for the specific processes from step S32 to step S34, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and details are not repeated herein.

S35: inputting a text to be detected into the trained text classification model, and acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table.

S36: and determining the auditing result of the text to be detected according to the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model and the sensitive word matching table.

In this embodiment, when a text to be detected is input to the trained text classification model to obtain a sensitive category and a confidence of the text to be detected output by the trained text classification model, a sensitive word matching table corresponding to the text to be detected is further obtained, where the sensitive word matching table includes the text to be detected and a sensitive word obtained by matching the sensitive word with the text to be detected based on the sensitive word table, the sensitive word matching table may be obtained by matching the sensitive word with the text to be detected based on the sensitive word table again, or the sensitive word matching table may be directly output by using the trained text classification model, and the sensitive category of the text to be detected output by the trained text classification model is confirmed by referring to the sensitive word matching table to obtain a final verification result of the text to be detected.

Therefore, in the embodiment of the application, under the condition that only the sensitive word list is provided, the sensitive word matching is performed on the unmarked training text by using the dictionary tree constructed based on the sensitive word list, and the sensitive category marking is performed on the corresponding training text according to the matching result to obtain the training set containing the training text and the corresponding label information, so that the corpus vector and the vocabulary vector of the text to be detected are relatively unified. Meanwhile, the auditing result of the text to be detected is determined according to the sensitive category and the confidence of the text to be detected output by the trained text classification model and the sensitive word matching table, so that the auditing result has specific judgment basis and is more accurate and specific.

Referring to fig. 5, the embodiment of the present application further discloses a text auditing model correspondingly, including:

the data acquisition interface 11 is used for acquiring a sensitive word list, a training set and a text to be detected; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;

the text classification model 12 is used for outputting the sensitive category and the confidence coefficient of the text to be detected;

the trainer 13 is configured to extract feature information of the training text by using a text classification model constructed based on a machine learning algorithm to obtain a target corpus vector of the training text, perform sensitive word matching on the training text based on the sensitive vocabulary to obtain a target vocabulary vector corresponding to a sensitive word in the training text, and train the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model.

Furthermore, the embodiment of the application also provides electronic equipment. Fig. 6 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, which should not be construed as limiting the scope of the application in any way.

Fig. 6 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the text auditing method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in this embodiment may be specifically a server.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, text data 223, etc., and the storage may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the processor 21 on the mass text data 223 in the memory 22, and may be Windows Server, netware, unix, linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the text auditing method performed by the electronic device 20 disclosed in any of the foregoing embodiments. Data 223 may include various textual information collected by electronic device 20.

Further, an embodiment of the present application further discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the steps of the text auditing method disclosed in any of the foregoing embodiments are implemented.

In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" \8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The text auditing method, the model, the equipment and the storage medium provided by the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A text auditing method is characterized by comprising the following steps:

inputting a text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence of the text to be detected output by the trained text classification model;

wherein, the determining the auditing result of the text to be detected based on the output result of the trained text classification model comprises:

determining an auditing result of the text to be detected according to the sensitive category and the confidence of the text to be detected output by the trained text classification model and the sensitive word matching table;

before extracting the feature information of the training text by using the text classification model, the method further comprises the following steps:

and cleaning the data of the training text based on case and case conversion, simplified and traditional conversion, full half-angle conversion and stop word deletion.

2. The text auditing method according to claim 1, wherein the constructing a text classification model based on a machine learning algorithm and extracting feature information of the training text using the text classification model to obtain a target corpus vector comprises:

3. The method of claim 2, wherein performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to a sensitive word in the training text comprises:

and extracting target sensitive words in the training text by using the dictionary tree, and processing the target sensitive words by using the one-hot coding to obtain target vocabulary vectors corresponding to the sensitive words in the training text.

4. The method of claim 3, wherein after performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to a sensitive word in the training text, the method further comprises:

and judging whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector or not, and if the dimension of the target vocabulary vector is inconsistent with the dimension of the target corpus vector, adjusting the dimension of the target vocabulary vector to be consistent with the dimension of the target corpus vector in a mode of addition or dot product.

5. The text review method of claim 4, wherein the training of the text classification model using the target corpus vector and the target vocabulary vector comprises:

6. The text auditing method of any one of claims 1-5 where obtaining the training set comprises:

and performing sensitive word matching on the unmarked training texts by using the dictionary tree constructed based on the sensitive word list, and performing sensitive category marking on the corresponding training texts according to matching results to obtain the training set containing the training texts and corresponding label information.

7. A text audit model, comprising:

the training device is used for extracting the characteristic information of the training text by using a text classification model constructed based on a machine learning algorithm to obtain a target corpus vector of the training text, performing sensitive word matching on the training text based on the sensitive vocabulary to obtain a target vocabulary vector corresponding to a sensitive word in the training text, and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model;

the trainer is specifically used for acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table;

determining an auditing result of the text to be detected according to the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model and the sensitive word matching table;

the text auditing model is specifically used for carrying out data cleaning on the training text based on case conversion, simplified and simplified conversion, full half-angle conversion and stop word deletion.

8. An electronic device, wherein the electronic device comprises a processor and a memory; wherein the memory is for storing a computer program that is loaded and executed by the processor to implement a text auditing method according to any one of claims 1 to 6.

9. A computer-readable storage medium storing computer-executable instructions which, when loaded and executed by a processor, implement a text review method as claimed in any one of claims 1 to 6.