CN115587588B - Text content auditing method and device and electronic equipment - Google Patents

Text content auditing method and device and electronic equipment Download PDF

Info

Publication number
CN115587588B
CN115587588B CN202211552965.2A CN202211552965A CN115587588B CN 115587588 B CN115587588 B CN 115587588B CN 202211552965 A CN202211552965 A CN 202211552965A CN 115587588 B CN115587588 B CN 115587588B
Authority
CN
China
Prior art keywords
text
preset
prediction
violation
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211552965.2A
Other languages
Chinese (zh)
Other versions
CN115587588A (en
Inventor
李文举
李海峰
吴一超
支蕴倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Deepctrl Co ltd
Original Assignee
Beijing Deepctrl Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deepctrl Co ltd filed Critical Beijing Deepctrl Co ltd
Priority to CN202211552965.2A priority Critical patent/CN115587588B/en
Publication of CN115587588A publication Critical patent/CN115587588A/en
Application granted granted Critical
Publication of CN115587588B publication Critical patent/CN115587588B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text content auditing method, a text content auditing device and electronic equipment, and relates to the technical field of computers, wherein the method comprises the following steps: acquiring a text to be audited; preprocessing the text to be audited to obtain a preprocessed text; inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; and determining whether the text to be audited is illegal or not based on the first prediction result and the second prediction result in the text auditing result. According to the method, semantic prediction is respectively carried out on the sentence meaning of each clause of the text to be checked and the corresponding field of each clause, and the semantic prediction results are fused, so that the accuracy of text content checking is improved.

Description

Text content auditing method and device and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a text content auditing method and device and electronic equipment.
Background
In the text content auditing, whether the text violates rules or not needs to be judged.
At present, the common technical scheme comprises a rule detection method based on keyword matching. Namely, the violation sensitive words of each problem category are collected in advance, and a sensitive word bank is constructed. When the method is actually used, violation sensitive words are searched in the text to be detected, and if the sensitive words of a certain problem category are found, the violation of the corresponding category is judged. And secondly, a text classification method based on a machine learning model. A large amount of texts to be detected are collected in advance, manual labeling is carried out, and whether each section of text violates rules or not and which type of violation rules belong to are noted. A text classification model is then trained to predict whether a text violation and what type of violation it belongs to. The text classification model can be a traditional machine learning method or a deep learning-based neural network model.
However, when the sentence to be examined has an illegal word but the sentence to be examined does not have an illegal meaning, the text examination result obtained by the method is not accurate.
Disclosure of Invention
The invention aims to provide a text content auditing method and device and electronic equipment so as to improve the accuracy of text content auditing.
In a first aspect, an embodiment of the present invention provides a text content auditing method, where the method includes: acquiring a text to be audited; preprocessing the text to be audited to obtain a preprocessed text; inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; and determining whether the text to be audited violates rules or not based on the first prediction result and the second prediction result in the text audit result.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the text prediction model is constructed in the following manner: acquiring a preset training data set; the training data set comprises a preset text and a label of the preset text; the above-mentioned labels include: the preset text corresponds to first violation type labels of a second clause and second violation type labels of a plurality of character segments corresponding to the second clause; training a preset initial transformer model according to the training data set until a preset training end condition is met, and obtaining a trained transformer model; and constructing and obtaining the text prediction model based on the trained transform model, a preset classifier and a preset sequence header processing program.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the training a preset initial transform model according to the training data set until a preset training end condition is met, to obtain a trained transform model, where the training includes: after the training data set is input into the initial transformer model, adding a first embedded vector of preset characters into the second clause of the training data set to obtain a first intermediate training data set, and determining second embedded vectors of a plurality of character segments corresponding to the second clause to obtain a second intermediate training data set; determining a first violation prediction category according to the first intermediate training data set and determining a second violation prediction category according to the second intermediate training data set; respectively performing cross entropy calculation model loss on the first violation prediction category and the second violation prediction category to obtain a first loss and a second loss; determining a loss function based on the first loss and the second loss; and performing back propagation according to the loss function, and updating the parameters of the initial transform model until a preset training end condition is met to obtain a trained transform model.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the step of determining, according to the first intermediate training data set, a first violation prediction category and determining, according to the second intermediate training data set, a second violation prediction category includes: determining a first intermediate violation prediction category according to the first intermediate training data set, and determining a second intermediate violation prediction category according to the second intermediate training data set; and respectively carrying out normalization processing on the first intermediate violation prediction type and the second intermediate violation prediction type to obtain a first violation prediction type and a second violation prediction type.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the determining a loss function according to the first loss and the second loss includes: determining a sum of the first loss and the second loss as the loss function.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the step of obtaining a preset training data set includes: crawling a preset type of original text on a website through a web crawler; preprocessing the original text to obtain a preset text; carrying out sentence division processing on the preset text according to punctuations to obtain a middle preset text; adding the first violation type labels to third sentences corresponding to the middle preset text respectively, and adding the second violation type labels to a plurality of character segments corresponding to each third sentence in the middle preset text to obtain a labeled middle preset text; and determining a training data set according to the preset text and the marked middle preset text.
With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the step of preprocessing the text to be audited to obtain a preprocessed text includes: and carrying out data cleaning on the text to be audited based on a preset cleaning rule to obtain a preprocessed text.
With reference to the first aspect, an embodiment of the present invention provides a seventh possible implementation manner of the first aspect, where the step of determining whether the text to be reviewed is illegal based on a first prediction result and a second prediction result in the text review result includes: merging the second prediction results based on a preset merging rule to obtain merged second prediction results; and determining whether the text to be audited violates the rule or not according to the combined second prediction result and the first prediction result.
In a second aspect, an embodiment of the present invention provides a text content auditing apparatus, where the apparatus includes: the text acquisition module is used for acquiring a text to be audited; the text preprocessing module is used for preprocessing the text to be audited to obtain a preprocessed text; the model prediction module is used for inputting the preprocessed text into a pre-trained text prediction model and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; and the violation result output module is used for determining whether the text to be audited is violated based on the first prediction result and the second prediction result in the text audit result.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores machine-executable instructions executable by the processor, and the processor executes the machine-executable instructions to implement the text content auditing method according to any one of the seventh possible implementation manners of the first aspect to the first aspect.
The embodiment of the invention brings the following beneficial effects:
the embodiment of the invention provides a text content auditing method, a text content auditing device and electronic equipment, wherein the text content auditing method comprises the following steps: acquiring a text to be audited; preprocessing the text to be audited to obtain a preprocessed text; inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; and determining whether the text to be audited violates rules or not based on the first prediction result and the second prediction result in the text audit result. According to the method, semantic prediction is respectively carried out on the sentence meaning of each clause of the text to be checked and the corresponding field of each clause, and the semantic prediction results are fused, so that the accuracy of text content checking is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flowchart of a text content auditing method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a training method of a text prediction model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a text content auditing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Icon: 31-a text acquisition module; 32-a text pre-processing module; 33-a model prediction module; 34-violation result output module; 41-a memory; 42-a processor; 43-bus; 44-communication interface.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, in the process of text content auditing, whether a text violates rules or not needs to be judged. One of the common technical schemes is a rule detection method based on keyword matching. Namely, the violation sensitive words of each problem category are collected in advance, and a sensitive word bank is constructed. When the method is actually used, violation sensitive words are searched in the text to be detected, and if the sensitive words of a certain problem category are found, the violation of the corresponding category is judged. And secondly, a text classification method based on a machine learning model. A large number of texts to be detected are collected in advance, manual labeling is carried out, and whether each section of text violates rules and which type of violation rules belong to each section of text is noted. A text classification model is then trained to predict whether a text violation and what type of violation it belongs to. The text classification model can be a traditional machine learning method or a deep learning-based neural network model. However, when the sentence to be examined has an illegal word but the sentence to be examined does not have an illegal meaning, the text examination result obtained by the method is not accurate.
Based on this, the embodiment of the invention provides a text content auditing method and device and electronic equipment, and the technology can improve the accuracy of text content auditing. In order to facilitate understanding of the embodiment of the present invention, a text content auditing method disclosed in the embodiment of the present invention is first described in detail.
Example 1
Fig. 1 is a schematic flowchart of a text content auditing method according to an embodiment of the present invention. As seen in fig. 1, the method comprises the steps of:
step S101: and acquiring the text to be audited.
Step S102: and preprocessing the text to be audited to obtain a preprocessed text.
In this embodiment, the step S102 specifically includes: and carrying out data cleaning on the text to be audited based on a preset cleaning rule to obtain a preprocessed text.
Further, the data cleaning mode is to delete html tags, messy codes and codes in a preset format in the text to be audited so as to obtain the preprocessed text.
Step S103: inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the preprocessed text corresponding to a plurality of character segments.
Step S104: and determining whether the text to be audited is illegal or not based on the first prediction result and the second prediction result in the text auditing result.
In this embodiment, the step S104 includes: firstly, the second prediction results are merged based on a preset merging rule to obtain a merged second prediction result. And then, determining whether the text to be audited violates rules according to the combined second prediction result and the first prediction result.
Here, the second prediction results may be combined based on a preset order.
The embodiment of the invention provides a text content auditing method, which comprises the following steps: acquiring a text to be audited; preprocessing the text to be audited to obtain a preprocessed text; inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; and determining whether the text to be audited is illegal or not based on the first prediction result and the second prediction result in the text auditing result. According to the method, semantic prediction is respectively carried out on the sentence meaning of each clause of the text to be checked and the corresponding field of each clause, and the semantic prediction results are fused, so that the accuracy of text content checking is improved.
Example 2
On the basis of the method shown in fig. 1, the present invention further provides another text content auditing method, which focuses on describing the training process of the text prediction model in step S103 in embodiment 1. As shown in fig. 2, which is a schematic flow chart of a training method of a text prediction model according to an embodiment of the present invention, as shown in fig. 2, the text prediction model is obtained through the following steps:
step S201: acquiring a preset training data set; the training data set comprises a preset text and a label of the preset text; the above-mentioned labels include: and the preset text corresponds to the first violation type label of a second clause and the second violation type labels of a plurality of character segments corresponding to the second clause.
In one embodiment, the preset training data set is obtained through the following steps A1 to A5, including:
step A1: and crawling a preset type of original text on the website through a web crawler.
Step A2: and preprocessing the original text to obtain a preset text.
Step A3: and performing sentence division processing on the preset text according to punctuations to obtain an intermediate preset text.
Step A4: and adding the first violation type labels to third sentences corresponding to the middle preset text respectively, and adding the second violation type labels to a plurality of character segments corresponding to each third sentence in the middle preset text to obtain a labeled middle preset text.
Step A5: and determining a training data set according to the preset text and the marked middle preset text.
Step S202: and training a preset initial transformer model according to the training data set until a preset training end condition is met, so as to obtain the trained transformer model.
In actual practice, the above step S202 includes the following steps B1-B5:
step B1: and after the training data set is input into the initial transform model, adding a first embedded vector of a preset character into the second clause of the training data set to obtain a first intermediate training data set, and determining second embedded vectors of a plurality of character segments corresponding to the second clause to obtain a second intermediate training data set.
And step B2: a first violation prediction category is determined based on the first intermediate training data set and a second violation prediction category is determined based on the second intermediate training data set.
The step B2 specifically includes: first, a first intermediate violation prediction category is determined based on the first intermediate training data set, and a second intermediate violation prediction category is determined based on the second intermediate training data set. Then, normalization processing is performed on the first intermediate violation prediction category and the second intermediate violation prediction category respectively to obtain a first violation prediction category and a second violation prediction category.
And step B3: and respectively carrying out cross entropy calculation model loss on the first violation prediction category and the second violation prediction category to obtain a first loss and a second loss.
And step B4: determining a loss function based on the first loss and the second loss.
Here, the sum of the first loss and the second loss is determined as the loss function.
And step B5: and performing back propagation according to the loss function, and updating the parameters of the initial transform model until a preset training end condition is met to obtain a trained transform model.
Step S203: and constructing and obtaining the text prediction model based on the trained transform model, a preset classifier and a preset sequence header processing program.
The embodiment of the invention provides a text content auditing method, which comprises the following steps: acquiring a text to be audited; preprocessing the text to be audited to obtain a preprocessed text; inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; wherein, the text prediction model comprises the following training steps: firstly, acquiring a preset training data set; the training data set comprises a preset text and a mark of the preset text; the above-mentioned labels include: the preset text corresponds to first violation type labels of a second clause and second violation type labels of a plurality of character segments corresponding to the second clause; training a preset initial transformer model according to the training data set until a preset training end condition is met, and obtaining a trained transformer model; constructing and obtaining the text prediction model based on the trained transform model, a preset classifier and a preset sequence header processing program; and determining whether the text to be audited is illegal or not based on the first prediction result and the second prediction result in the text auditing result. According to the method, the initial transformer model is trained through the preset text and the label of the preset text, and the accuracy of text content auditing is further improved.
Example 3
An embodiment of the present invention further provides a text content auditing apparatus, as shown in fig. 3, and a schematic structural diagram of the text content auditing apparatus according to the embodiment of the present invention is provided, including:
the text obtaining module 31 is configured to obtain a text to be checked.
And the text preprocessing module 32 is configured to preprocess the text to be audited to obtain a preprocessed text.
The model prediction module 33 is configured to input the preprocessed text into a pre-trained text prediction model, and output a text review result of the text to be reviewed; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the preprocessed text corresponding to a plurality of character segments.
And the violation result output module 34 is configured to determine whether the text to be reviewed is violated based on the first prediction result and the second prediction result in the text review result.
The text acquisition module 31, the text preprocessing module 32, the model prediction module 33, and the violation result output module 34 are connected in sequence.
In one embodiment, the model prediction module 33 is further configured to obtain a preset training data set; the training data set comprises a preset text and a label of the preset text; the above-mentioned labels include: the preset text corresponds to first violation type labels of a second clause and second violation type labels of a plurality of character segments corresponding to the second clause; training a preset initial transformer model according to the training data set until a preset training end condition is met, and obtaining a trained transformer model; and constructing and obtaining the text prediction model based on the trained transform model, a preset classifier and a preset sequence header processing program.
In one embodiment, the model prediction module 33 is further configured to, after the training data set is input into the initial transform model, add a first embedded vector of a preset character to the second clause of the training data set to obtain a first intermediate training data set, and determine second embedded vectors of a plurality of character segments corresponding to the second clause to obtain a second intermediate training data set; determining a first violation prediction category according to the first intermediate training data set and determining a second violation prediction category according to the second intermediate training data set; respectively performing cross entropy calculation model loss on the first violation prediction category and the second violation prediction category to obtain a first loss and a second loss; determining a loss function based on the first loss and the second loss; and performing back propagation according to the loss function, and updating the parameters of the initial transform model until a preset training end condition is met to obtain a trained transform model.
In one embodiment, the model prediction module 33 is further configured to determine a first intermediate violation prediction category according to the first intermediate training data set, and determine a second intermediate violation prediction category according to the second intermediate training data set; and respectively carrying out normalization processing on the first intermediate violation prediction type and the second intermediate violation prediction type to obtain a first violation prediction type and a second violation prediction type.
In one embodiment, the model prediction module 33 is further configured to determine a sum of the first loss and the second loss as the loss function.
In one embodiment, the model prediction module 33 is further configured to crawl a preset type of original text on a website through a web crawler; preprocessing the original text to obtain a preset text; carrying out sentence division processing on the preset text according to punctuations to obtain a middle preset text; adding the first violation type labels to third sentences corresponding to the middle preset text respectively, and adding the second violation type labels to a plurality of character segments corresponding to each third sentence in the middle preset text to obtain a labeled middle preset text; and determining a training data set according to the preset text and the marked middle preset text.
In one embodiment, the text preprocessing module 32 is further configured to perform data cleaning on the text to be checked based on a preset cleaning rule to obtain a preprocessed text.
In one embodiment, the violation result output module 34 merges the second prediction results based on a preset merge rule to obtain a merged second prediction result; and determining whether the text to be audited violates the rule or not according to the combined second prediction result and the first prediction result.
The text content auditing device provided by the embodiment of the invention has the same technical characteristics as the text content auditing method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Example 4
The present embodiment provides an electronic device comprising a processor and a memory, the memory storing computer-executable instructions capable of being executed by the processor, the processor executing the computer-executable instructions to implement the steps of the text content auditing method.
The present embodiment provides a computer-readable storage medium in which a computer program is stored which, when being executed by a processor, carries out the steps of a text content auditing method.
Referring to fig. 4, a schematic structural diagram of an electronic device is shown, where the electronic device includes: the memory 41 and the processor 42, wherein the memory 41 stores a computer program operable on the processor 42, and the processor implements the steps provided by the text content auditing method when executing the computer program.
As shown in fig. 4, the apparatus further includes: a bus 43 and a communication interface 44, the processor 42, the communication interface 44 and the memory 41 being connected by the bus 43; the processor 42 is for executing executable modules, such as computer programs, stored in the memory 41.
The Memory 41 may include a Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 44 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
The bus 43 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.
The memory 41 is used for storing a program, and the processor 42 executes the program after receiving an execution instruction, and the method executed by the apparatus for reviewing content of the text disclosed in any of the foregoing embodiments of the invention may be applied to the processor 42, or implemented by the processor 42. The processor 42 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 42. The Processor 42 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 41, and the processor 42 reads the information in the memory 41 and completes the steps of the method in combination with the hardware.
Further, embodiments of the present invention also provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by processor 42, cause processor 42 to implement the text content auditing method described above.
The electronic device and the computer-readable storage medium provided by the embodiment of the invention have the same technical characteristics, so the same technical problems can be solved, and the same technical effects can be achieved.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Claims (9)

1. A text content auditing method is characterized by comprising the following steps:
acquiring a text to be audited;
preprocessing the text to be audited to obtain a preprocessed text;
inputting the preprocessed text into a pre-trained text prediction model, and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; the text prediction model is constructed in the following way:
acquiring a preset training data set; the training data set comprises a preset text and a mark of the preset text; the labeling comprises the following steps: the preset text corresponds to first violation type labels of a second clause and second violation type labels of a plurality of character segments corresponding to the second clause;
training a preset initial transformer model according to the training data set until a preset training end condition is met, and obtaining a trained transformer model;
constructing and obtaining the text prediction model based on the trained transform model, a preset classifier and a preset sequence header processing program;
and determining whether the text to be audited is illegal or not based on the first prediction result and the second prediction result in the text auditing result.
2. The method for auditing text contents according to claim 1, wherein the step of training a preset initial transform model according to the training data set until a preset training end condition is satisfied to obtain a trained transform model comprises:
after the training data set is input into the initial transform model, adding a first embedded vector of a preset character into the second clause of the training data set to obtain a first intermediate training data set, and determining second embedded vectors of a plurality of character segments corresponding to the second clause to obtain a second intermediate training data set;
determining a first violation prediction category according to the first intermediate training data set and a second violation prediction category according to the second intermediate training data set;
respectively performing cross entropy calculation model loss on the first violation prediction category and the second violation prediction category to obtain a first loss and a second loss;
determining a loss function according to the first loss and the second loss;
and performing back propagation according to the loss function, and updating the parameters of the initial transform model until a preset training end condition is met to obtain a trained transform model.
3. The method for auditing text content according to claim 2, wherein the step of determining a first violation prediction category from the first intermediate training data set and a second violation prediction category from the second intermediate training data set comprises:
determining a first intermediate violation prediction category according to the first intermediate training data set, and determining a second intermediate violation prediction category according to the second intermediate training data set;
and respectively carrying out normalization processing on the first intermediate violation prediction category and the second intermediate violation prediction category to obtain a first violation prediction category and a second violation prediction category.
4. The text content auditing method of claim 2 where the step of determining a loss function based on the first loss and the second loss comprises:
determining a sum of the first loss and the second loss as the loss function.
5. The text content auditing method of claim 1 where the step of obtaining a preset training data set includes:
crawling a preset type of original text on a website through a web crawler;
preprocessing the original text to obtain a preset text;
performing sentence division processing on the preset text according to punctuation marks to obtain a middle preset text;
adding the first violation type labels to third sentences corresponding to the middle preset text respectively, and adding the second violation type labels to a plurality of character segments corresponding to each third sentence in the middle preset text to obtain a labeled middle preset text;
and determining a training data set according to the preset text and the marked middle preset text.
6. The method for auditing the contents of a text according to claim 1, wherein the step of preprocessing the text to be audited to obtain a preprocessed text comprises:
and carrying out data cleaning on the text to be audited based on a preset cleaning rule to obtain a preprocessed text.
7. The text content auditing method according to claim 1, characterized in that the step of determining whether the text to be audited is illegal based on the first prediction result and the second prediction result in the text auditing results comprises:
merging the second prediction results based on a preset merging rule to obtain a merged second prediction result;
and determining whether the text to be audited violates the rule or not according to the combined second prediction result and the first prediction result.
8. A text content auditing apparatus, characterized by comprising:
the text acquisition module is used for acquiring a text to be audited;
the text preprocessing module is used for preprocessing the text to be audited to obtain a preprocessed text;
the model prediction module is used for inputting the preprocessed text into a pre-trained text prediction model and outputting a text auditing result of the text to be audited; the text auditing result comprises a first prediction result of the preprocessed text corresponding to a first clause and a second prediction result of the first clause corresponding to a plurality of character segments; the text prediction model is constructed in the following way: acquiring a preset training data set; the training data set comprises a preset text and a mark of the preset text; the labeling comprises the following steps: the preset text corresponds to first violation type labels of a second clause and second violation type labels of a plurality of character segments corresponding to the second clause; training a preset initial transformer model according to the training data set until a preset training end condition is met, and obtaining a trained transformer model; constructing and obtaining the text prediction model based on the trained transform model, a preset classifier and a preset sequence header processing program;
and the violation result output module is used for determining whether the text to be audited is in violation based on the first prediction result and the second prediction result in the text auditing result.
9. An electronic device, comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the text content auditing method of any one of claims 1 to 7.
CN202211552965.2A 2022-12-06 2022-12-06 Text content auditing method and device and electronic equipment Active CN115587588B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211552965.2A CN115587588B (en) 2022-12-06 2022-12-06 Text content auditing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211552965.2A CN115587588B (en) 2022-12-06 2022-12-06 Text content auditing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN115587588A CN115587588A (en) 2023-01-10
CN115587588B true CN115587588B (en) 2023-02-28

Family

ID=84782994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211552965.2A Active CN115587588B (en) 2022-12-06 2022-12-06 Text content auditing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115587588B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738011A (en) * 2020-05-09 2020-10-02 完美世界(北京)软件科技发展有限公司 Illegal text recognition method and device, storage medium and electronic device
CN113312449A (en) * 2021-05-17 2021-08-27 华南理工大学 Text auditing method, system and medium based on keywords and deep learning
CN114579693A (en) * 2021-12-02 2022-06-03 广州趣丸网络科技有限公司 NLP text security audit multistage retrieval system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11461552B2 (en) * 2020-07-06 2022-10-04 Sap Se Automated document review system combining deterministic and machine learning algorithms for legal document review

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738011A (en) * 2020-05-09 2020-10-02 完美世界(北京)软件科技发展有限公司 Illegal text recognition method and device, storage medium and electronic device
CN113312449A (en) * 2021-05-17 2021-08-27 华南理工大学 Text auditing method, system and medium based on keywords and deep learning
CN114579693A (en) * 2021-12-02 2022-06-03 广州趣丸网络科技有限公司 NLP text security audit multistage retrieval system

Also Published As

Publication number Publication date
CN115587588A (en) 2023-01-10

Similar Documents

Publication Publication Date Title
CN112800201B (en) Natural language processing method and device and electronic equipment
CN111881983B (en) Data processing method and device based on classification model, electronic equipment and medium
CN112860841B (en) Text emotion analysis method, device, equipment and storage medium
CN112231431B (en) Abnormal address identification method and device and computer readable storage medium
CN113312899B (en) Text classification method and device and electronic equipment
CN110427628A (en) Web assets classes detection method and device based on neural network algorithm
CN110309301B (en) Enterprise category classification method and device and intelligent terminal
CN110413307B (en) Code function association method and device and electronic equipment
CN111401065A (en) Entity identification method, device, equipment and storage medium
CN113282955A (en) Method, system, terminal and medium for extracting privacy information in privacy policy
CN113672731B (en) Emotion analysis method, device, equipment and storage medium based on field information
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111666274B (en) Data fusion method, device, electronic equipment and computer readable storage medium
CN112214984A (en) Content plagiarism identification method, device, equipment and storage medium
CN110868419A (en) Method and device for detecting WEB backdoor attack event and electronic equipment
CN111291551B (en) Text processing method and device, electronic equipment and computer readable storage medium
CN115858751A (en) Processing method and device of table question-answer data and electronic equipment
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN114329112A (en) Content auditing method and device, electronic equipment and storage medium
CN108694192B (en) Webpage type judging method and device
CN115587588B (en) Text content auditing method and device and electronic equipment
CN111859862A (en) Text data labeling method and device, storage medium and electronic device
CN114579834B (en) Webpage login entity identification method and device, electronic equipment and storage medium
CN111079042A (en) Webpage hidden link detection method and device based on text theme
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant