CN112487149B - Text auditing method, model, equipment and storage medium - Google Patents

Text auditing method, model, equipment and storage medium Download PDF

Info

Publication number
CN112487149B
CN112487149B CN202011439157.6A CN202011439157A CN112487149B CN 112487149 B CN112487149 B CN 112487149B CN 202011439157 A CN202011439157 A CN 202011439157A CN 112487149 B CN112487149 B CN 112487149B
Authority
CN
China
Prior art keywords
text
training
sensitive
classification model
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011439157.6A
Other languages
Chinese (zh)
Other versions
CN112487149A (en
Inventor
班涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Nuonuo Network Technology Co ltd
Original Assignee
Zhejiang Nuonuo Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Nuonuo Network Technology Co ltd filed Critical Zhejiang Nuonuo Network Technology Co ltd
Priority to CN202011439157.6A priority Critical patent/CN112487149B/en
Publication of CN112487149A publication Critical patent/CN112487149A/en
Application granted granted Critical
Publication of CN112487149B publication Critical patent/CN112487149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a text auditing method, a model, equipment and a storage medium, comprising the following steps: acquiring a sensitive word list and a training set; constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector; performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text; training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model; and inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence coefficient of the text to be detected output by the trained text classification model. The method and the device have the advantages that the end-to-end text auditing function of the fusion mode is realized by using sensitive word matching and text classification, the manual review cost is reduced, and the text auditing efficiency and accuracy are improved.

Description

Text auditing method, model, equipment and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text auditing method, a model, equipment and a storage medium.
Background
The increase of network information transmission media such as mobile phones and the like enables more users to participate in the transmission of network information, so that the information transmission speed and capability are increased, and great challenges are provided for website content management. Most of the current text review modes are matching models combined with manual review modes, information is screened through the models, manual review pressure is relieved, and the matching models mainly comprise sensitive word matching and classification models. The sensitive word matching model is mainly used for matching based on symbols, but has more misjudgments, if variant sensitive contents such as symbols inserted in sensitive words, chinese-English mixture, pinyin character mixture and the like cannot be identified, sentences containing sensitive words but not sensitive contents cannot be effectively identified, the classification model is mainly based on statistical analysis, and has the defects of incapability of providing specific judgment basis, insufficient flexibility and the like, and the overall auditing capability is limited. In summary, in the prior art, there are at least technical problems of low text review accuracy and high manual review cost.
Disclosure of Invention
In view of this, the present invention provides a text review method, a model, a device, and a storage medium, which can reduce the cost of manual review and improve the efficiency and accuracy of text review. The specific scheme is as follows:
a first aspect of the present application provides a text auditing method, including:
acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;
constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector;
performing sensitive word matching on the training text based on the sensitive word list to obtain a target word vector corresponding to the sensitive words in the training text;
training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model;
inputting the text to be detected into the trained text classification model, and determining the auditing result of the text to be detected based on the sensitive category and the confidence of the text to be detected output by the trained text classification model.
Optionally, the constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector includes:
constructing a text classification model by using a text convolution neural network with an activation function as a linear rectification function;
mapping the training text into a random vector through an embedding layer of the text classification model to obtain a vectorized text;
and extracting the characteristic information of the vectorization text by using the convolution layer and the maximum pooling layer to obtain a target corpus vector.
Optionally, the performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to a sensitive word in the training text includes:
constructing a dictionary tree by using a finite automaton algorithm according to the sensitive word list;
and extracting target sensitive words in the training text by using the dictionary tree, and processing the target sensitive words by using the one-hot coding to obtain target vocabulary vectors.
Optionally, after performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector, the method further includes:
and judging whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector or not, and if the dimension of the target vocabulary vector is inconsistent with the dimension of the target corpus vector, adjusting the dimension of the target vocabulary vector to be consistent with the dimension of the target corpus vector in an addition or dot product mode.
Optionally, the training the text classification model by using the target corpus vector and the target vocabulary vector includes:
splicing the target corpus vector and the target vocabulary vector to obtain a spliced vector;
and training a full connection layer and a classifier in the text classification model by using the splicing vector.
Optionally, the determining an audit result of the text to be detected based on the output result of the trained text classification model includes:
acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table;
and determining the auditing result of the text to be detected according to the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model and the sensitive word matching table.
Optionally, the obtaining the training set includes:
and performing sensitive word matching on the unmarked training texts by using a dictionary tree constructed based on the sensitive word list, and performing sensitive category marking on the corresponding training texts according to matching results to obtain the training set containing the training texts and corresponding label information.
A second aspect of the present application provides a text audit model, comprising:
the data acquisition interface is used for acquiring the sensitive word list, the training set and the text to be detected; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;
the text classification model is used for outputting the sensitivity category and the confidence coefficient of the text to be detected;
the training device is used for extracting the characteristic information of the training text by using a text classification model constructed based on a machine learning algorithm to obtain a target corpus vector of the training text, performing sensitive word matching on the training text based on the sensitive vocabulary to obtain a target vocabulary vector corresponding to a sensitive word in the training text, and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model.
A third aspect of the application provides an electronic device comprising a processor and a memory; wherein the memory is used for storing a computer program which is loaded and executed by the processor to implement the aforementioned text auditing method.
A fourth aspect of the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the foregoing text auditing method is implemented.
In the application, a sensitive word list and a training set are obtained first, wherein the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts. And secondly, extracting the characteristic information of the training text by using a text classification model to obtain a target corpus vector, and performing sensitive word matching on the training text based on a sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text. And then, training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model. And finally, inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence of the text to be detected output by the trained text classification model. The method and the device have the advantages that the end-to-end text auditing function of the fusion mode is realized by using the sensitive word matching and the text classification, the manual review cost is reduced, and the text auditing efficiency and accuracy are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a text auditing method provided in the present application;
FIG. 2 is a schematic diagram of a specific text auditing method provided in the present application;
FIG. 3 is a flowchart of a specific text auditing method provided by the present application;
fig. 4 is a schematic diagram of a training set construction provided in the present application;
FIG. 5 is a schematic diagram of a text audit model provided herein;
fig. 6 is a block diagram of an electronic device for text auditing according to the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Most of the current text review modes are matching models combined with manual review modes, information is screened through the models, manual review pressure is relieved, and the matching models mainly comprise sensitive word matching or classification models. However, the above text auditing method has many misjudgments, and if the sensitive word matching cannot identify variant sensitive contents such as inserting symbols, chinese-English mixture, and alphabetic writing mixture in the sensitive word, or effectively identify sentences containing the sensitive word but not the sensitive contents, the classification model cannot provide specific judgment basis, and has insufficient flexibility, and the overall auditing capability is limited. In order to overcome the technical problems, the text review scheme provided by the application can reduce the manual review cost and improve the text review efficiency and accuracy.
Fig. 1 is a flowchart of a text auditing method according to an embodiment of the present application. Referring to fig. 1, the text auditing method includes:
s11: acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts.
In this embodiment, a sensitive vocabulary and a training set need to be obtained first, where the sensitive vocabulary is used to perform sensitive word matching, and the more comprehensive the sensitive words in the vocabulary are, the more accurate the obtained auditing result is, and the sensitive vocabulary may be a comprehensive vocabulary library including multiple categories such as violence, corruption, livelihood, and the like, or may be a sensitive vocabulary of a corresponding category according to business requirements. The training set is used for training a text classification model and comprises training texts and label information obtained after sensitive category marking is carried out on the training texts, the training texts can be sentences, paragraphs and the like, the sensitive categories are sensitive and insensitive and can be respectively represented by 0 and 1, the training texts marked as sensitive are positive samples, and the training texts marked as insensitive are negative samples. The training text is used for constructing corpus vectors, the sensitive word list is used for constructing word vectors, and the context word relationship of words is learned by associating the information before and after the text based on the labeled corpus, so that the strong text auditing capacity is provided, and the sensitive content checking capacity of a simple replacement mode is improved.
In an embodiment, the sensitive word list and the training set labeled with the sensitive category are provided at the same time, and the method can reduce the operation load of the system, and it should be noted that the obtained labeling result of the sensitive category of the training text in the training set is correlated with the sensitive word list as much as possible, that is, the construction of the training set should take the sensitive words in the sensitive word list into account. In another embodiment, only the sensitive vocabulary and the training text may be provided, the training set is obtained by processing the training text based on the sensitive vocabulary, the training set obtained by the above method is in a pulse bearing with the sensitive vocabulary, and the uniformity of the corpus vector and the vocabulary vector obtained in the subsequent steps is relatively high.
S12: and constructing a text classification model based on a machine learning algorithm, and extracting the characteristic information of the training text by using the text classification model to obtain a target corpus vector.
In this embodiment, the text classification model includes an embedding layer, a convolutional layer, a pooling layer, a fully-connected layer, and a classifier, and the embedding layer, the convolutional layer, and the pooling layer are used to perform feature extraction on the oppositely quantized training texts. The Text classification model based on the machine learning algorithm constructed in this embodiment is a Text classification model trained through the training set, and the Text classification model includes an embedded layer, a convolutional layer and a pooling layer network for performing feature extraction on the training Text, that is, a network for acquiring a target corpus vector of the training Text may be a Text convolutional network (Text-CNN), a long-short memory network (LSTM), a gated round robin unit network (GRU), and the like.
It should be noted that before the feature information of the training text is extracted by using the text classification model, data cleaning needs to be performed on the training text, and there are many methods for data cleaning, including but not limited to case conversion, simplified form conversion, full-half-corner conversion, and stop-word deletion performed on the training text. And then extracting feature information of the cleaned training text by using the text classification model to obtain a target corpus vector, wherein the target corpus vector is used for training the full-link layer and the classifier in the text classification model.
S13: and performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text.
In this embodiment, the sensitive words in the training text are mainly matched according to the sensitive word list, and vectorization processing is further performed on the sensitive words to obtain target vocabulary vectors corresponding to the sensitive words in the training text, where the target vocabulary vectors are used to train the full connection layer and the classifier in the text classification model. In this embodiment, sensitive word matching is performed on the training text by using a method of a dictionary tree, and further, the dictionary tree is constructed by using a Deterministic Finite Automaton (DFA) algorithm.
Sensitive word matching is independently used for auditing texts with obvious characteristic words, good text auditing capacity can be obtained by providing less data, the running and deployment speed is relatively high, but contents with the characteristic words which are not obvious and sensitive to some variants cannot be accurately identified, and the sensitive contents of the variants comprise inserting symbols, chinese-English inclusion, pinyin character mixing and the like in the sensitive words. Meanwhile, there may be a case where a sentence containing a sensitive word but not a sensitive content is erroneously identified, such as a sentence containing a negative sensitive word content, or a sentence which is matched to a sensitive word but not associated therewith, especially when the sensitive word is short. Therefore, only the text auditing capability through the sensitive word matching is limited, and the text auditing accuracy can be improved to a great extent by combining the corpus information of the training text after the sensitive word matching is performed on the training text to obtain the target vocabulary vector corresponding to the sensitive word in the training text.
S14: and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model.
In this embodiment, after the training text is processed to obtain the target corpus vector and the target vocabulary vector corresponding to the training text, the target corpus vector and the target vocabulary vector are input into the full-link layer in the text classification model for training, so that the trained text classification model can directly output the sensitive type and the corresponding confidence of the text to be detected. The training of the text classification model by using the target corpus vector and the target vocabulary vector can also be called model aggregation, the number of neurons is reduced and the overfitting rate of the model is reduced by a random inactivation (Dropout) mode in the full-link layer, and meanwhile, the classification result output by the classifier adopting a normalized exponential function (Softmax) is optimized by an Adaptive moment estimation optimizer (Adam) to obtain the final classification, and the loss function to be optimized is the cross entropy. The Softmax function is often added to an output layer of the neural network, and is used for converting an output result of the neural network into relative probability, namely, the confidence of the text classification model on the classification task.
S15: inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model.
In this embodiment, a text to be detected is input into the trained text classification model, the trained text classification model firstly performs feature extraction on the text to be detected to obtain a corpus vector of the text to be detected, then performs sensitive word matching on the text to be detected to obtain a vocabulary vector of the text to be detected, and finally outputs a sensitive category and a confidence of the text to be detected after auditing and predicting the text to be detected based on the corpus vector and the vocabulary vector of the text to be detected. And determining an auditing result of the text to be detected based on the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model.
Therefore, the sensitive word list and the training set are obtained firstly, wherein the training set comprises the training texts and the label information obtained after sensitive category labeling is carried out on the training texts. And secondly, extracting the characteristic information of the training text by using a text classification model to obtain a target corpus vector, and performing sensitive word matching on the training text based on a sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text. And then, training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model. And finally, inputting the text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence of the text to be detected output by the trained text classification model. According to the method and the device, the end-to-end text auditing function of the fusion mode is realized by using sensitive word matching and text classification, the manual review cost is reduced, and the text auditing efficiency and accuracy are improved.
Fig. 2 is a flowchart of a specific text auditing method according to an embodiment of the present application. Referring to fig. 2, the text auditing method includes:
s21: acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts.
In this embodiment, as to the specific process of the step S21, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
S22: and constructing a text classification model for the text convolution neural network of the linear rectification function by using the activation function.
S23: mapping the training text into a random vector through an embedding layer of the text classification model to obtain a vectorized text; and extracting the characteristic information of the vectorization text by utilizing the convolutional layer and the maximum pooling layer to obtain a target corpus vector.
In this embodiment, the text classification model is constructed by using a text convolutional neural network, which is a network that applies a convolutional neural network to a text classification task, and has the characteristics of high operation speed, high feature extraction efficiency, and local connection and weight sharing. Specifically, by combining the convolutional layer in the text convolutional neural network with the maximum pooling layer (1-Maxpooling), feature extraction is performed on the training text subjected to the vector quantization, firstly, the embedding layer of the text classification model converts words or words in the training text into numerical values through One-Hot-coding (One-Hot-coding), and further maps the numerical values into 128-dimensional random vectors, so as to obtain the vectorized text corresponding to the training text. And then, enabling the vectorized text output by the embedded layer to pass through three convolution kernels with different sizes in parallel and activating, wherein the sizes of the convolution kernels are n x 3, n x 4 and n x 5 respectively (n is the upper limit value of the length of a sentence corresponding to the training text), and using a Linear rectification activation function (ReLU, a Rectified Linear Unit) to accelerate the convergence speed and reduce the risk of model overfitting. And finally, serially splicing the vector set output and activated by the convolutional layer into a vector by using the maximum pooling layer, and performing dimensionality reduction extraction operation on the characteristic weight to obtain a target corpus vector of the training text.
S24: constructing a dictionary tree by using a finite automaton algorithm according to the sensitive word list; and extracting target sensitive words in the training text by using the dictionary tree, and processing the target sensitive words by using the one-hot coding to obtain target vocabulary vectors corresponding to the sensitive words in the training text.
In this embodiment, the sensitive words in the training text are extracted by constructing a dictionary tree, and the vectorization extraction result is obtained. Firstly, according to the sensitive word list, a Deterministic Finite Automata (DFA) algorithm is utilized to construct a DFA dictionary tree, the DFA tree is a variant of a hash tree, the searching distance is shortened by searching a public prefix, and the searching speed is high. When the training text is read, the DFA tree is used for searching and extracting the sensitive words in the training text and displaying and extracting the sensitive words in the training text to obtain a sensitive word list of the training text, the matched sensitive word list is converted into numerical values through a one-hot coding mode, and the numerical values are further mapped into one or more random vectors (the specific number is consistent with the number of elements in the sensitive word list) to obtain a target vocabulary vector corresponding to the sensitive words in the training text.
S25: and judging whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector or not, and if the dimension of the target vocabulary vector is inconsistent with the dimension of the target corpus vector, adjusting the dimension of the target vocabulary vector to be consistent with the dimension of the target corpus vector in an addition or dot product mode.
S26: and splicing the target corpus vector and the target vocabulary vector to obtain a spliced vector, and training a full connection layer and a classifier in the text classification model by using the spliced vector to obtain a trained text classification model.
In this embodiment, after a target corpus vector and a target vocabulary vector of the training text are obtained, in order to ensure that the target corpus vector and the target vocabulary vector can be spliced, it is necessary to make dimensions of the target corpus vector and the target vocabulary vector consistent, so before splicing the target corpus vector and the target vocabulary vector, it is necessary to determine whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector, if the dimension of the target vocabulary vector is not consistent with the dimension of the target corpus vector, merge the target vocabulary vector by an addition method to make the dimension consistent with the dimension of the target corpus vector, merge the target vocabulary vector by the addition method is based on the assumption that sentences are more sensitive, and of course, the addition method may be replaced by a dot product method, which is not limited in this embodiment. And splicing the target corpus vector and the target vocabulary vector by the method to obtain a spliced vector of the training text, and inputting the spliced vector into a full connection layer of the text classifier to train so as to obtain a trained text classification model.
S27: inputting the text to be detected into the trained text classification model, and determining the auditing result of the text to be detected based on the sensitive category and the confidence of the text to be detected output by the trained text classification model.
In this embodiment, for the specific process of the step S27, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated herein.
As can be seen, in the embodiment of the present application, a text convolutional neural network with an activation function as a linear rectification function is used to construct a text classification model, a vectorization processing is performed on the training text through an embedding layer of the text classification model, then a convolutional layer and a maximum pooling layer are used to extract feature information of the vectorized training text, a finite automata algorithm is used to construct a dictionary tree of the sensitive word list so as to extract target sensitive words in the training text, and unique thermal coding is used to process the feature information of the training text and the target sensitive words so as to obtain a target corpus vector of the training text and a target vocabulary vector corresponding to the sensitive words in the training text. And inputting the text to be detected into a text classification model obtained by training the text classification model by using a splicing vector obtained by splicing the training text and the target vocabulary vector so as to realize content verification of the text to be detected. The method and the device for the text review realize the end-to-end text review function of the fusion mode by using the sensitive word matching and the text classification, reduce the manual review cost and improve the text review efficiency and accuracy.
Fig. 3 is a flowchart of a specific text auditing method provided in an embodiment of the present application. Referring to fig. 3, the text auditing method includes:
s31: and acquiring a sensitive word list, performing sensitive word matching on the unmarked training text by using a dictionary tree constructed based on the sensitive word list, and performing sensitive category marking on the corresponding training text according to a matching result to obtain a training set containing the training text and corresponding label information.
In the embodiment, in a data preparation stage, only the sensitive word list is acquired, a dictionary tree is constructed based on the sensitive word list, a sentence which is not subjected to sensitive category labeling in an external unlabeled text pool is retrieved through the dictionary tree, a text which contains sensitive words in the sensitive word list is labeled as a text content sensitive category, a text which does not contain the sensitive words in the sensitive word list is labeled as a text content insensitive category, verification and confirmation are performed on the labeled sensitive category through a human-computer interaction interface, and when the sensitive category received through the human-computer interaction interface is inconsistent with the labeled sensitive category, the sentence is labeled again by taking the sensitive category received through the human-computer interaction interface as a criterion so as to improve the accuracy. And then supplementing the re-labeled sentences into a training set to obtain a training set containing the training texts and corresponding label information, wherein the specific process is shown in fig. 4.
S32: and constructing a text classification model based on a machine learning algorithm, and extracting the characteristic information of the training text by using the text classification model to obtain a target corpus vector.
S33: and performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to the sensitive words in the training text.
S34: and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model.
In this embodiment, for the specific processes from step S32 to step S34, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and details are not repeated herein.
S35: inputting a text to be detected into the trained text classification model, and acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table.
S36: and determining the auditing result of the text to be detected according to the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model and the sensitive word matching table.
In this embodiment, when a text to be detected is input to the trained text classification model to obtain a sensitive category and a confidence of the text to be detected output by the trained text classification model, a sensitive word matching table corresponding to the text to be detected is further obtained, where the sensitive word matching table includes the text to be detected and a sensitive word obtained by matching the sensitive word with the text to be detected based on the sensitive word table, the sensitive word matching table may be obtained by matching the sensitive word with the text to be detected based on the sensitive word table again, or the sensitive word matching table may be directly output by using the trained text classification model, and the sensitive category of the text to be detected output by the trained text classification model is confirmed by referring to the sensitive word matching table to obtain a final verification result of the text to be detected.
Therefore, in the embodiment of the application, under the condition that only the sensitive word list is provided, the sensitive word matching is performed on the unmarked training text by using the dictionary tree constructed based on the sensitive word list, and the sensitive category marking is performed on the corresponding training text according to the matching result to obtain the training set containing the training text and the corresponding label information, so that the corpus vector and the vocabulary vector of the text to be detected are relatively unified. Meanwhile, the auditing result of the text to be detected is determined according to the sensitive category and the confidence of the text to be detected output by the trained text classification model and the sensitive word matching table, so that the auditing result has specific judgment basis and is more accurate and specific.
Referring to fig. 5, the embodiment of the present application further discloses a text auditing model correspondingly, including:
the data acquisition interface 11 is used for acquiring a sensitive word list, a training set and a text to be detected; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;
the text classification model 12 is used for outputting the sensitive category and the confidence coefficient of the text to be detected;
the trainer 13 is configured to extract feature information of the training text by using a text classification model constructed based on a machine learning algorithm to obtain a target corpus vector of the training text, perform sensitive word matching on the training text based on the sensitive vocabulary to obtain a target vocabulary vector corresponding to a sensitive word in the training text, and train the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model.
Furthermore, the embodiment of the application also provides electronic equipment. Fig. 6 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, which should not be construed as limiting the scope of the application in any way.
Fig. 6 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the text auditing method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in this embodiment may be specifically a server.
In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, text data 223, etc., and the storage may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the processor 21 on the mass text data 223 in the memory 22, and may be Windows Server, netware, unix, linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the text auditing method performed by the electronic device 20 disclosed in any of the foregoing embodiments. Data 223 may include various textual information collected by electronic device 20.
Further, an embodiment of the present application further discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the steps of the text auditing method disclosed in any of the foregoing embodiments are implemented.
In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" \8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The text auditing method, the model, the equipment and the storage medium provided by the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (9)

1. A text auditing method is characterized by comprising the following steps:
acquiring a sensitive word list and a training set; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;
constructing a text classification model based on a machine learning algorithm, and extracting feature information of the training text by using the text classification model to obtain a target corpus vector;
performing sensitive word matching on the training text based on the sensitive word list to obtain a target word vector corresponding to the sensitive words in the training text;
training the text classification model by using the target corpus vector and the target vocabulary vector to obtain a trained text classification model;
inputting a text to be detected into the trained text classification model, and determining an auditing result of the text to be detected based on the sensitive class and the confidence of the text to be detected output by the trained text classification model;
wherein, the determining the auditing result of the text to be detected based on the output result of the trained text classification model comprises:
acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table;
determining an auditing result of the text to be detected according to the sensitive category and the confidence of the text to be detected output by the trained text classification model and the sensitive word matching table;
before extracting the feature information of the training text by using the text classification model, the method further comprises the following steps:
and cleaning the data of the training text based on case and case conversion, simplified and traditional conversion, full half-angle conversion and stop word deletion.
2. The text auditing method according to claim 1, wherein the constructing a text classification model based on a machine learning algorithm and extracting feature information of the training text using the text classification model to obtain a target corpus vector comprises:
constructing a text classification model by using a text convolution neural network with an activation function as a linear rectification function;
mapping the training text into a random vector through an embedding layer of the text classification model to obtain a vectorized text;
and extracting the characteristic information of the vectorization text by using the convolution layer and the maximum pooling layer to obtain a target corpus vector.
3. The method of claim 2, wherein performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to a sensitive word in the training text comprises:
constructing a dictionary tree by using a finite automaton algorithm according to the sensitive word list;
and extracting target sensitive words in the training text by using the dictionary tree, and processing the target sensitive words by using the one-hot coding to obtain target vocabulary vectors corresponding to the sensitive words in the training text.
4. The method of claim 3, wherein after performing sensitive word matching on the training text based on the sensitive word list to obtain a target vocabulary vector corresponding to a sensitive word in the training text, the method further comprises:
and judging whether the dimension of the target vocabulary vector is consistent with the dimension of the target corpus vector or not, and if the dimension of the target vocabulary vector is inconsistent with the dimension of the target corpus vector, adjusting the dimension of the target vocabulary vector to be consistent with the dimension of the target corpus vector in a mode of addition or dot product.
5. The text review method of claim 4, wherein the training of the text classification model using the target corpus vector and the target vocabulary vector comprises:
splicing the target corpus vector and the target vocabulary vector to obtain a spliced vector;
and training a full connection layer and a classifier in the text classification model by using the splicing vector.
6. The text auditing method of any one of claims 1-5 where obtaining the training set comprises:
and performing sensitive word matching on the unmarked training texts by using the dictionary tree constructed based on the sensitive word list, and performing sensitive category marking on the corresponding training texts according to matching results to obtain the training set containing the training texts and corresponding label information.
7. A text audit model, comprising:
the data acquisition interface is used for acquiring the sensitive word list, the training set and the text to be detected; the training set comprises training texts and label information obtained after sensitive category labeling is carried out on the training texts;
the text classification model is used for outputting the sensitivity category and the confidence coefficient of the text to be detected;
the training device is used for extracting the characteristic information of the training text by using a text classification model constructed based on a machine learning algorithm to obtain a target corpus vector of the training text, performing sensitive word matching on the training text based on the sensitive vocabulary to obtain a target vocabulary vector corresponding to a sensitive word in the training text, and training the text classification model by using the target corpus vector and the target vocabulary vector to obtain the trained text classification model;
the trainer is specifically used for acquiring a sensitive word matching table corresponding to the text to be detected; the sensitive word matching table comprises the text to be detected and sensitive words obtained by matching the sensitive words of the text to be detected based on the sensitive word table;
determining an auditing result of the text to be detected according to the sensitive category and the confidence coefficient of the text to be detected output by the trained text classification model and the sensitive word matching table;
the text auditing model is specifically used for carrying out data cleaning on the training text based on case conversion, simplified and simplified conversion, full half-angle conversion and stop word deletion.
8. An electronic device, wherein the electronic device comprises a processor and a memory; wherein the memory is for storing a computer program that is loaded and executed by the processor to implement a text auditing method according to any one of claims 1 to 6.
9. A computer-readable storage medium storing computer-executable instructions which, when loaded and executed by a processor, implement a text review method as claimed in any one of claims 1 to 6.
CN202011439157.6A 2020-12-10 2020-12-10 Text auditing method, model, equipment and storage medium Active CN112487149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011439157.6A CN112487149B (en) 2020-12-10 2020-12-10 Text auditing method, model, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011439157.6A CN112487149B (en) 2020-12-10 2020-12-10 Text auditing method, model, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112487149A CN112487149A (en) 2021-03-12
CN112487149B true CN112487149B (en) 2023-04-07

Family

ID=74941336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011439157.6A Active CN112487149B (en) 2020-12-10 2020-12-10 Text auditing method, model, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112487149B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989789B (en) * 2021-03-15 2024-05-17 京东科技信息技术有限公司 Test method and device of text auditing model, computer equipment and storage medium
CN113553844B (en) * 2021-08-11 2023-07-25 四川长虹电器股份有限公司 Domain identification method based on prefix tree features and convolutional neural network
CN114266247A (en) * 2021-12-20 2022-04-01 中国农业银行股份有限公司 Sensitive word filtering method and device, storage medium and electronic equipment
CN114637896B (en) * 2022-05-23 2022-09-09 杭州闪马智擎科技有限公司 Data auditing method and device, storage medium and electronic device
CN114943228B (en) * 2022-06-06 2023-11-24 北京百度网讯科技有限公司 Training method of end-to-end sensitive text recall model and sensitive text recall method
CN115002508A (en) * 2022-06-07 2022-09-02 中国工商银行股份有限公司 Live data stream method and device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766441B (en) * 2018-12-28 2021-07-09 奇安信科技集团股份有限公司 Text classification method, device and system
CN110442875A (en) * 2019-08-12 2019-11-12 北京思维造物信息科技股份有限公司 A kind of text checking method, apparatus and system
CN111061881A (en) * 2019-12-27 2020-04-24 浪潮通用软件有限公司 Text classification method, equipment and storage medium
CN111462735B (en) * 2020-04-10 2023-11-28 杭州网易智企科技有限公司 Voice detection method, device, electronic equipment and storage medium
CN111782811A (en) * 2020-07-03 2020-10-16 湖南大学 E-government affair sensitive text detection method based on convolutional neural network and support vector machine

Also Published As

Publication number Publication date
CN112487149A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN112487149B (en) Text auditing method, model, equipment and storage medium
CN112685565B (en) Text classification method based on multi-mode information fusion and related equipment thereof
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
CN108345672A (en) Intelligent response method, electronic device and storage medium
WO2020108063A1 (en) Feature word determining method, apparatus, and server
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
US11580119B2 (en) System and method for automatic persona generation using small text components
CN113010638B (en) Entity recognition model generation method and device and entity extraction method and device
CN112084334B (en) Label classification method and device for corpus, computer equipment and storage medium
CN113095080B (en) Theme-based semantic recognition method and device, electronic equipment and storage medium
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN110737774A (en) Book knowledge graph construction method, book recommendation method, device, equipment and medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
KR102193228B1 (en) Apparatus for evaluating non-financial information based on deep learning and method thereof
CN110826315B (en) Method for identifying timeliness of short text by using neural network system
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN114266255B (en) Corpus classification method, apparatus, device and storage medium based on clustering model
CN112364649B (en) Named entity identification method and device, computer equipment and storage medium
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant