CN113688630B - Text content auditing method, device, computer equipment and storage medium - Google Patents

Text content auditing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN113688630B
CN113688630B CN202111012089.XA CN202111012089A CN113688630B CN 113688630 B CN113688630 B CN 113688630B CN 202111012089 A CN202111012089 A CN 202111012089A CN 113688630 B CN113688630 B CN 113688630B
Authority
CN
China
Prior art keywords
detection
text
sensitive
checked
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111012089.XA
Other languages
Chinese (zh)
Other versions
CN113688630A (en
Inventor
程相
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202111012089.XA priority Critical patent/CN113688630B/en
Publication of CN113688630A publication Critical patent/CN113688630A/en
Application granted granted Critical
Publication of CN113688630B publication Critical patent/CN113688630B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Marketing (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application relates to the field of artificial intelligence, and provides a text content auditing method, a text content auditing device, a computer device and a storage medium, wherein text to be audited is obtained; performing word segmentation processing on the text to be checked through a preset word segmentation processing method; respectively inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist; if the detection unit does not detect the corresponding sensitive word, inputting the text to be checked into a detection model for detecting the sensitive word of the same kind as the detection unit to obtain a corresponding detection result; and determining whether the text to be checked is qualified or not according to the detection result. By the text content auditing method, the text content auditing device, the computer equipment and the storage medium, the text content can be audited more accurately.

Description

Text content auditing method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a text content auditing method, apparatus, computer device, and storage medium.
Background
In recent years, text content has become an important carrier for marketing and acquisition, which also brings great pressure to content auditing. Traditional text content audits are manually audited piece by a large number of trained auditors. They need to verify text content in words and sentences, which is very labor intensive and inefficient, requiring significant manpower. In addition, this mode cannot cope well with the change of the content, the timeliness of the content itself, and the erroneous judgment of the content.
Meanwhile, auditing personnel of manual auditing are to audit text contents based on own experience, so that different auditing personnel can make different auditing results on the same text contents, misjudgment is easy to cause, accuracy is low, development of content auditing is restricted, and distribution of the text contents is affected.
Disclosure of Invention
The application mainly aims to provide a text content auditing method, a text content auditing device, computer equipment and a storage medium, and aims to solve the technical problem of low accuracy of text content auditing at present.
In order to achieve the above object, the present application provides a text content auditing method, comprising the steps of:
acquiring a text to be checked;
performing word segmentation processing on the text to be checked through a preset word segmentation processing method;
respectively inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist;
if the detection unit does not detect the corresponding sensitive word, inputting the text to be checked into a detection model for detecting the sensitive word of the same kind as the detection unit to obtain a corresponding detection result; the detection model is trained based on a neural network model;
and determining whether the text to be checked is qualified or not according to the detection result.
Further, the step of determining whether the text to be checked is qualified according to the detection result includes:
calculating the similarity between the detected sensitive words in the detection result obtained by the detection model and all the sensitive words of the dictionary tree of the corresponding detection unit;
selecting the sensitive word with the highest similarity with the detection sensitive word as the target sensitive word of the detection sensitive word;
acquiring the priority of the target sensitive word;
determining the auditing score of the detection sensitive word according to the priority;
and determining whether the text to be checked is qualified or not according to the checking score.
Further, the step of determining whether the text to be checked is qualified according to the checking score includes:
summing the auditing scores of all the detection sensitive words corresponding to the detection results to obtain corresponding target auditing scores of the detection results;
comparing the target audit score with a preset score threshold;
and if the detection result is smaller than the preset score threshold value, determining that the text to be checked is qualified.
Further, after the step of comparing the target audit score with a preset score threshold, the method includes:
if the target audit score is greater than the preset score threshold, marking the text to be audited according to the detection sensitive word;
and inputting the marked text to be checked into a manual checking channel.
Further, the step of inputting the text to be checked after the word segmentation processing into a preset detection unit to detect whether corresponding sensitive words exist or not respectively includes:
detecting a description scene of the text to be checked;
acquiring a scene sensitive dictionary of the detection unit under the description scene;
and determining whether the corresponding sensitive word exists in the text to be checked according to the scene sensitive dictionary.
Further, after the step of inputting the text to be checked after the word segmentation processing into a preset detection unit to detect whether corresponding sensitive words exist, the method comprises the following steps:
if the administrative sensitive word or the toxic sensitive word is detected, ending the auditing flow.
Further, the detection model includes a terrorism detection model, a yellow detection model, a detecting model of an advertisement and a detecting model of an advertisement, the step of inputting the text to be checked into a preset detection model to detect, and obtaining a corresponding detection result includes:
inputting the text to be checked into the terrorism detection model, the yellow-involving detection model, the abuse detection model and the advertisement detection model respectively for detection to obtain corresponding detection results; the terrorist detection unit is trained based on a FastText model; the yellow-related detection model, the curtailed detection model 35881 or the advertisement detection model are respectively trained based on an Albert model.
The application also provides a text content auditing device, which comprises:
the acquisition unit is used for acquiring the text to be checked;
the word segmentation processing unit is used for carrying out word segmentation processing on the text to be checked through a preset word segmentation processing method;
the first detection unit is used for respectively inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist or not;
the second detection unit is used for inputting the text to be checked into a detection model for detecting the same kind of sensitive words as the detection unit if the detection unit does not detect the corresponding sensitive words, so as to obtain a corresponding detection result; the detection model is trained based on a neural network model;
and the determining unit is used for determining whether the text to be checked is qualified or not according to the detection result.
The application also provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of any one of the text content auditing methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text content auditing method of any of the above.
According to the text content auditing method, the text content auditing device, the computer equipment and the storage medium, the detection unit performs first screening to hit some sensitive words with higher sensitive confidence. And then setting a detection model to perform second screening, and detecting sensitive words on some sensitive variants and contents to hit sensitive words which do not exist in some detection units, so that the detection result is more accurate, and the labor cost of content auditing is greatly saved.
Drawings
FIG. 1 is a schematic diagram showing steps of a text content auditing method according to an embodiment of the present application;
FIG. 2 is a block diagram of a text content auditing apparatus according to an embodiment of the present application;
fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Referring to fig. 1, an embodiment of the present application provides a text content auditing method, including the following steps:
s1, acquiring a text to be checked;
s2, performing word segmentation on the text to be checked through a preset word segmentation processing method;
step S3, respectively inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist; the detection unit is used for detecting different kinds of sensitive words;
step S4, if the detection unit does not detect the corresponding sensitive word, inputting the text to be checked into a detection model which is used for detecting the sensitive word of the same kind as the detection unit to detect, and obtaining a corresponding detection result; the detection model is trained based on a neural network model;
and S5, determining whether the text to be checked is qualified or not according to the detection result.
In this embodiment, as described in step S1, the text to be checked refers to a text that the user needs to publish to a corresponding platform for other users to view, and before the user can view the text, the user needs to check the content of the text, so as to avoid the user from publishing an improper language.
As described in the above step S2, the preset word segmentation method includes the step of barking and word segmentation, the step of SnowNLP, THULAC, and the like, the word segmentation processing is performed on the text to be audited through the word segmentation processing method, the part of speech is marked on each word obtained by the word segmentation processing, and the stop words are deleted, wherein the stop words mainly include functional words contained in human language, such as mood aid words, which are extremely common, and have no actual meaning compared with other words.
As described in the step S3, the detection units include a political detection unit, a toxic detection unit, a terrorist detection unit, a yellow detection unit, a curse detection unit, a 35881 curse detection unit, and an advertisement detection unit, where the political detection unit is used to identify forbidden information such as sensitive events, political tasks, and reaction propaganda in the text; the terrorist detection unit is used for identifying illicit text contents such as violent behaviors, terrorist descriptions, gun ammunition and the like; the toxic detection unit is used for identifying the illegal text content related to toxic substances such as toxic substances, toxin manufacturing, vending and the like; the yellow-related detection unit is used for identifying yellow-related contents such as pornography description, pornography resource link, low-custom friends making, pollution Wen Ai and the like in the text; the abuse detection unit of 35881 is used for identifying bad contents such as abuse, personal attack, negative release and the like in the text; the advertisement detection unit is used for identifying content which is included in the text information and is promoted to guide flow to a third party.
Each detection unit is provided with a corresponding dictionary tree, the dictionary tree comprises a plurality of corresponding sensitive words, the text to be checked after word segmentation is input into each detection unit for detection, namely, the text to be checked is matched with the sensitive words in the dictionary tree, if the text to be checked can be matched with the corresponding sensitive words, the text to be checked comprises contents unsuitable for release, and the text to be checked is forbidden to be released on a public platform. The sensitive words contained in the dictionary tree are commonly forbidden at present, and when a detection unit detects that the corresponding sensitive words exist, the sensitive words are forbidden to be released on a public platform.
As described in the above steps S4-S5, when no corresponding sensitive word is detected in each detection unit, the corresponding detection result is obtained by inputting the corresponding sensitive word into a preset detection model, the detection result may be a null value or a specific detection sensitive word, the detection model is trained based on a neural network model, and specifically, different detection models may be trained based on different neural network models. Each detection model is respectively corresponding to one detection unit and is used for detecting detection sensitive words of the same kind as the corresponding detection unit. The detection unit is directly used for matching based on the dictionary tree, and when the corresponding sensitive words are not matched, the detection unit is used for detecting some variants, namely the sensitive words which do not belong to the dictionary tree. And when no corresponding detection sensitive word is detected in the detection result, namely the detection result is a null value, determining that the text to be audited is qualified.
In this embodiment, the detection unit performs a first filtering to hit some sensitive words with higher sensitive confidence. And then setting a detection model to perform second screening, and detecting sensitive words on some sensitive variants and contents to hit sensitive words which do not exist in some detection units, so that the detection result is more accurate, and the labor cost of content auditing is greatly saved.
In an embodiment, the step S5 of determining whether the text to be checked is qualified according to the detection result includes:
S5A, calculating the similarity between the detected sensitive words in the detection result obtained by the detection model and all the sensitive words of the dictionary tree of the corresponding detection unit;
S5B, selecting a sensitive word with the highest similarity with the detection sensitive word as a target sensitive word of the detection sensitive word;
S5C, acquiring the priority of the target sensitive word;
S5D, determining the auditing score of the detection sensitive word according to the priority;
and S5E, determining whether the text to be checked is qualified or not according to the checking score.
In this embodiment, the detection unit corresponds to the detection model, for example, the terrorist detection unit corresponds to the terrorist detection model, and is used for detecting illicit text contents such as violent behaviors, terrorist descriptions, gun ammunition, and the like. Calculating the similarity between the detection sensitive words in each detection result and the sensitive words in the corresponding detection unit, taking the detection result obtained by the terrorism detection model as an example, wherein the detection result obtained by the terrorism detection model can comprise one or more detection sensitive words, and the detection sensitive words detected in the terrorism detection model do not belong to the existing sensitive words in the terrorism unit, so that the similarity between the detection sensitive words and the sensitive words in the terrorism unit is calculated, and specifically, the cosine similarity, the Euclidean distance and the Manhattan distance can be used as the similarity between the two.
And selecting the sensitive word with the highest similarity as the target sensitive word of the detection sensitive word, indicating that the sensitive word is closest to the detection sensitive word, and respectively calculating the target sensitive word of each detection sensitive word when a plurality of detection sensitive words exist.
The sensitive words in the terrorism detection unit are divided into priorities in advance, each priority corresponds to a certain score, the higher the priority is, the higher the corresponding score is, and the priority of the target sensitive word is used as the priority of the detection sensitive word, so that the auditing score of the detection sensitive word is obtained. When a plurality of detection sensitive words exist, corresponding target sensitive words of the detection sensitive words are obtained, so that corresponding audit scores are obtained, for example, the sensitive words in the terrorist unit are divided into 10 grades, the first grade is divided into 10 grades and the second grade is divided into 9 grades, if 3 detection sensitive words exist, the first grade, the fourth grade and the eighth grade are respectively arranged, and the corresponding audit scores are respectively divided into 10 grades, 7 grades and 3 grades. And obtaining corresponding audit scores by the calculation of the detection sensitive words in the detection results of each detection model.
In the embodiment, the similarity is calculated between the detected sensitive words detected by the detection model and the sensitive words in the corresponding detection units, so that the sensitivity degree of the sensitive words detected by the detection model is determined, and the text to be audited is conveniently and better audited according to the sensitive words detected by the detection model.
In one embodiment, the step S5E of determining whether the text to be checked is qualified according to the checking score includes:
step S5E1, the auditing scores of all detection sensitive words corresponding to the detection result are summed up to obtain a corresponding target auditing score of the detection result;
S5E2, comparing the target audit score with a preset score threshold;
and S5E3, if the detection result is smaller than the preset score threshold, determining that the text to be checked is qualified.
In this embodiment, a detection model corresponds to a detection result, one detection result may include one or more detection sensitive words, each detection sensitive word may obtain an audit score, the audit scores belonging to the same detection result are added to obtain a target audit score of the detection result, each target audit score is compared with a preset score threshold, and when each target audit score is less than or equal to the preset score threshold, it is indicated that the text to be audited is qualified. In another embodiment, each detection result may correspond to a preset score threshold with different values, and the target audit score of each detection result is compared with the corresponding preset score threshold, so as to determine whether the text to be audited is qualified.
In an embodiment, after the step S5E2 of comparing the target audit score with a preset score threshold, the method includes:
S5E4, if the target audit score is greater than the preset score threshold, marking the text to be audited according to the detection sensitive word;
and S5E5, inputting the marked text to be checked into a manual checking channel.
In this embodiment, the target audit score is obtained according to the audit score of each detection sensitive word, so that the sensitivity degree of the text to be audited can be well evaluated from four aspects including terrorism, yellow, abuse of 35881 and advertisement, when the target audit score is greater than the preset score threshold, the corresponding sensitivity degree of the text to be audited is high, intervention of manual audit is required, the detected detection sensitive word is marked in the text to be audited and then put into a manual audit channel, the detection sensitive word can be conveniently and directly positioned for audit when the manual audit agrees with each detection sensitive word detected by the detection model, the detection sensitive word can be put into a corresponding detection unit, and when a new text to be audited exists, the detection sensitive word can be directly detected in the detection unit without using the model for detection. Further, during manual auditing, sensitive words which are not audited by the detection models may be audited, and when the sensitive words which are not audited by the detection models are audited manually, the sensitive words are added into the training set of the corresponding detection models to retrain the detection models, so that the corresponding detection models are further optimized.
In an embodiment, the step S3 of inputting the text to be checked after the word segmentation process into a preset detection unit to detect whether a corresponding sensitive word exists includes:
step S31, detecting a description scene of the text to be checked;
step S32, acquiring a scene sensitive dictionary of the detection unit under the description scene;
and step S33, determining whether the corresponding sensitive words exist in the text to be checked according to the scene sensitive dictionary.
In this embodiment, the sources of the service data have large differences, the sensitivity standards are different, and the sensitivity determination requirements in the scene a are not necessarily consistent in the scene B, so that different description scenes are provided with different scene sensitive dictionaries, and the description scenes of the text to be audited are determined first, so that the corresponding sensitive dictionaries are determined to detect the sensitive words, and the detection of the sensitive words can be more accurate. Specifically, each detection unit is provided with a large dictionary tree, the dictionary tree comprises a plurality of corresponding sensitive words, each sensitive word is provided with a scene tag of a corresponding applicable scene, and the corresponding scene sensitive dictionary is combined according to the description scene of the text to be checked, so that the sensitive dictionary of the detection unit can be applicable to each description scene. Specifically, when uploading the text to be checked, the user selects a corresponding scene description tag, and the description scene of the text to be checked, such as social, entertainment, sports, health and other scene description tags, is determined according to the scene description tag selected by the user. In another embodiment, the text to be checked is input into a preset scene detection model to detect the description scene to be checked, the preset scene detection model is used for detecting the description scene of the text to be checked to obtain the corresponding description scene label, the description scene of the text to be checked is detected through the scene detection model, the description scene of the text to be checked can be more accurately determined, and meanwhile, the situation that the user cannot obtain the description scene when the user does not select the scene description label can be avoided.
In an embodiment, after the step S3 of inputting the text to be checked after the word segmentation process into a preset detection unit to detect whether a corresponding sensitive word exists, the method includes:
and step S3A, if the administrative sensitive word or the toxic sensitive word is detected, ending the auditing process.
In this embodiment, when any one of the political detection unit or the toxic detection unit detects a sensitive word, the subsequent detection is not required, the detection requirements on politics and drugs are higher, the political detection unit and the toxic detection unit include the currently related and forbidden sensitive word, and the related sensitive word is directly matched in the detection unit, so that the file to be checked is very non-compliant, and the subsequent checking is not required. Therefore, when the detection unit is adopted for auditing, the political detection unit and the toxic detection unit are used for detecting, and when the corresponding sensitive word is not detected, the subsequent detection is carried out. The subsequent detection can be carried out in a serial mode, after the detection of the political detection unit and the toxic detection unit is finished, the detection is carried out by using the riot detection unit, no sensitive word is detected, and then the detection is carried out by using the riot detection model. Finally, detecting corresponding sensitive words by adopting a yellow-related detection unit, a curse detection unit, an advertisement detection unit, and when the detection is not performed, detecting by adopting a corresponding model.
In an embodiment, the detection model includes a terrorism detection model, a yellow detection model, a # 35881 detection model and an advertisement detection model, and the step of inputting the text to be checked into a preset detection model to detect, and obtaining a corresponding detection result includes:
inputting the text to be checked into the terrorism detection model, the yellow-involving detection model, the abuse detection model and the advertisement detection model respectively for detection to obtain corresponding detection results; the terrorist detection unit is trained based on a FastText model; the yellow-related detection model, the curtailed detection model 35881 or the advertisement detection model are respectively trained based on an Albert model.
In this embodiment, the detection model is not set for detecting the wading and poling, and the wading detection model is trained based on the fastcext model, which is a fast text classification algorithm, and has two advantages: the training speed and the testing speed can be increased while the high precision is kept, pre-trained word vectors are not needed, and the terrorism detection model is trained through the Fasttext model, so that whether the text content related to violence and horror exists in the text to be checked can be accurately detected. Gathering predictions with sensitive information while training the FastText model; collecting terrorism-related sensitive words and adding the terrorism-related sensitive words into a terrorism dictionary tree; utilizing terrorist sensitive words to match and filter corpus to construct a preliminary training corpus; manually extracting part of the training corpus to carry out manual verification and sort out problematic data features; selecting part of normal training corpus data and adding the data into the preliminary training corpus to construct a training test set; the training expectation in the training test set is input into the FastText model for training until the training is completed.
When the yellow-related detection unit, the # 35881 detection unit or the advertisement detection unit does not detect the corresponding sensitive word, the corresponding sensitive word is input into the corresponding model for detection. The yellow detection model, the_358812 curtailed detection model and the advertisement detection model are all trained based on an Albert model, and the Albert model reduces the size of a single token of the BERT model through matrix decomposition. Specifically, matrix decomposition may be performed based on trigonometric decomposition, QR decomposition, jordan decomposition, and the like. The full connection layer and the attention layer in the Albert model share parameters, namely share all parameters in the encoder, so that the parameter quantity of the Albert model is greatly reduced, and the training speed is greatly improved. Meanwhile, the Albert model proposes a new task SOP (sense-order prediction), the positive sample of the SOP and the NSP are obtained in the same way, and the negative sample reverses the sequence of the positive samples. SOP is selected from the same document because it only focuses on the sequence of sentences and has no effect on the topic. The Albert model removes the dropout layer so that the model fits well. When the corresponding training set is constructed, data enhancement is adopted, mixed time and letters are added into corresponding sensitive words, and some sound character mixed variants, harmonic sound variants, pinyin abbreviation variants, front and back nose sounds and flat tongue sound variants, reverse reading variants, character filling variants, character missing variants, character disassembling variants, character approaching variants and synonym variants are added at the same time so as to improve the robustness of the Albert model on the sensitive words.
Referring to fig. 2, an embodiment of the present application provides a text content auditing apparatus, including:
an acquiring unit 10 for acquiring a text to be checked;
the word segmentation processing unit 20 is used for performing word segmentation processing on the text to be checked through a preset word segmentation processing method;
the first detection unit 30 is configured to input the text to be checked after word segmentation into preset detection units respectively to detect whether corresponding sensitive words exist;
the second detection unit 40 is configured to input the text to be checked into a detection model for detecting the same kind of sensitive words as the detection unit to obtain a corresponding detection result if the detection unit does not detect the corresponding sensitive words; the detection model is trained based on a neural network model;
and the determining unit 50 is used for determining whether the text to be checked is qualified or not according to the detection result.
In an embodiment, the determining unit 50 includes:
the calculating subunit is used for calculating the similarity between the detected sensitive words in the detection result obtained by the detection model and all the sensitive words of the dictionary tree of the corresponding detection unit;
a selecting subunit, configured to select a sensitive word with the highest similarity to the detected sensitive word as a target sensitive word of the detected sensitive word;
the first acquisition subunit is used for acquiring the priority of the target sensitive word;
the first determining subunit is used for determining the auditing score of the detection sensitive word according to the priority;
and the second determining subunit is used for determining whether the text to be checked is qualified or not according to the checking score.
In an embodiment, the second determining subunit includes:
the summing module is used for summing the auditing scores of all the detection sensitive words corresponding to the detection result to obtain a corresponding target auditing score of the detection result;
the comparison module is used for comparing the target audit score with a preset score threshold;
and the determining module is used for determining that the text to be checked is qualified if the detection result is smaller than the preset score threshold value.
In an embodiment, the second determining subunit includes:
the labeling module is used for labeling the detection sensitive words in the text to be inspected according to the detection sensitive words if the target auditing score is larger than the preset score threshold;
and the input module is used for inputting the marked text to be checked into the manual checking channel.
In one embodiment, the first detecting unit 30 includes:
the first detection subunit is used for detecting the description scene of the text to be checked;
the second acquisition subunit is used for acquiring the scene sensitive dictionary of the detection unit under the description scene;
and the third determining subunit is used for determining whether the corresponding sensitive word exists in the text to be checked according to the scene sensitive dictionary.
In an embodiment, the text content auditing apparatus further includes:
and the ending unit is used for ending the auditing flow if the administrative sensitive word or the toxic sensitive word is detected.
In one embodiment, the second detecting unit 40 includes:
a second detection subunit, is used for respectively inputting the text to be checked into the terrorism detection model, the yellow-involving detection model, the abuse detection model of 35881 and the advertisement detection model for detection, obtaining a corresponding detection result; the terrorist detection unit is trained based on a FastText model; the yellow-related detection model, the curtailed detection model 35881 or the advertisement detection model are respectively trained based on an Albert model.
In this embodiment, the specific implementation of each unit, sub-unit and module described in the above method embodiment is referred to in the above method embodiment, and will not be described herein again.
Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text content auditing method.
It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.
An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a text content auditing method.
In summary, the text content auditing method, device, computer equipment and storage medium provided by the embodiment of the application acquire the text to be audited; performing word segmentation processing on the text to be checked through a preset word segmentation processing method; respectively inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist; if the detection unit does not detect the corresponding sensitive word, inputting the text to be checked into each preset detection model for detection, and obtaining a corresponding detection result; the detection model is trained based on a neural network model; and determining whether the text to be checked is qualified or not according to the detection result. According to the text content auditing method provided by the application, the detection unit is used for carrying out first screening to hit some sensitive words with higher sensitive confidence. And then setting a detection model to perform second screening, and detecting sensitive words on some sensitive variants and contents to hit sensitive words which do not exist in some detection units, so that the detection result is more accurate, and the labor cost of content auditing is greatly saved.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by hardware associated with a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or direct or indirect application in other related technical fields are included in the scope of the present application.

Claims (8)

1. A text content auditing method, comprising the steps of:
acquiring a text to be checked;
performing word segmentation processing on the text to be checked through a preset word segmentation processing method;
inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist;
if the detection unit does not detect the corresponding sensitive word, inputting the text to be checked into a detection model of the sensitive word on the sensitive variant and/or the content for detection, and obtaining a corresponding detection result; the detection model is trained based on a neural network model;
determining whether the text to be checked is qualified or not according to the detection result;
the step of determining whether the text to be checked is qualified according to the detection result comprises the following steps:
calculating the similarity between the detected sensitive words in the detection result obtained by the detection model and all the sensitive words of the dictionary tree of the corresponding detection unit;
selecting the sensitive word with the highest similarity with the detection sensitive word as the target sensitive word of the detection sensitive word;
acquiring the priority of the target sensitive word;
determining the auditing score of the detection sensitive word according to the priority;
determining whether the text to be checked is qualified or not according to the checking score;
the step of inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist or not, comprises the following steps:
detecting a description scene of the text to be checked;
acquiring a scene sensitive dictionary of the detection unit under the description scene;
and determining whether the corresponding sensitive word exists in the text to be checked according to the scene sensitive dictionary.
2. The text content auditing method according to claim 1, wherein the step of determining whether the text to be audited is acceptable according to the auditing score includes:
summing the auditing scores of all the detection sensitive words corresponding to the detection results to obtain corresponding target auditing scores of the detection results;
comparing the target audit score with a preset score threshold;
and if the detection result is smaller than the preset score threshold value, determining that the text to be checked is qualified.
3. The text content auditing method of claim 2, wherein after the step of comparing the target audit score with a preset score threshold, comprising:
if the target audit score is greater than the preset score threshold, marking the text to be audited according to the detection sensitive word;
and inputting the marked text to be checked into a manual checking channel.
4. The text content auditing method according to claim 1, wherein after the step of inputting the text to be audited after the word segmentation processing into a preset detection unit to detect whether a corresponding sensitive word exists, the method comprises the steps of:
if the administrative sensitive word or the toxic sensitive word is detected, ending the auditing flow.
5. The text content auditing method according to claim 1, wherein the detection model includes a terrorist detection model, a yellow-involving detection model, a curse detection model, a 35881-curse detection model and an advertisement detection model, and the step of inputting the text to be audited into a preset detection model for detection to obtain a corresponding detection result includes:
inputting the text to be checked into the terrorism detection model, the yellow-involving detection model, the abuse detection model and the advertisement detection model respectively for detection to obtain corresponding detection results; the terrorism detection model is trained based on a FastText model; the yellow-related detection model, the curtailed detection model 35881 or the advertisement detection model are respectively trained based on an Albert model.
6. A text content auditing apparatus, comprising:
the acquisition unit is used for acquiring the text to be checked;
the word segmentation processing unit is used for carrying out word segmentation processing on the text to be checked through a preset word segmentation processing method;
the first detection unit is used for inputting the text to be checked after word segmentation into a preset detection unit to detect whether corresponding sensitive words exist;
the first detection subunit is used for detecting the description scene of the text to be checked;
the second acquisition subunit is used for acquiring the scene sensitive dictionary of the detection unit under the description scene;
a third determining subunit, configured to determine, according to the scene sensitivity dictionary, whether the corresponding sensitive word exists in the text to be checked;
the second detection unit is used for inputting the text to be checked into a detection model for detecting the same kind of sensitive words as the detection unit if the detection unit does not detect the corresponding sensitive words, so as to obtain a corresponding detection result; the detection model is trained based on a neural network model;
the determining unit is used for determining whether the text to be checked is qualified or not according to the detection result;
the calculating subunit is used for calculating the similarity between the detected sensitive words in the detection result obtained by the detection model and all the sensitive words of the dictionary tree of the corresponding detection unit;
a selecting subunit, configured to select a sensitive word with the highest similarity to the detected sensitive word as a target sensitive word of the detected sensitive word;
the first acquisition subunit is used for acquiring the priority of the target sensitive word;
the first determining subunit is used for determining the auditing score of the detection sensitive word according to the priority;
and the second determining subunit is used for determining whether the text to be checked is qualified or not according to the checking score.
7. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the text content auditing method of any of claims 1 to 5.
8. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the text content auditing method of any of claims 1 to 5.
CN202111012089.XA 2021-08-31 2021-08-31 Text content auditing method, device, computer equipment and storage medium Active CN113688630B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111012089.XA CN113688630B (en) 2021-08-31 2021-08-31 Text content auditing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111012089.XA CN113688630B (en) 2021-08-31 2021-08-31 Text content auditing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113688630A CN113688630A (en) 2021-11-23
CN113688630B true CN113688630B (en) 2023-09-12

Family

ID=78584393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111012089.XA Active CN113688630B (en) 2021-08-31 2021-08-31 Text content auditing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113688630B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580397A (en) * 2022-03-14 2022-06-03 中国工商银行股份有限公司 Method and system for detecting < 35881 > and cursory comments
CN115129867A (en) * 2022-05-23 2022-09-30 广州趣丸网络科技有限公司 Text content auditing method, device, equipment and storage medium
CN115238044A (en) * 2022-09-21 2022-10-25 广州市千钧网络科技有限公司 Sensitive word detection method, device and equipment and readable storage medium
CN117332039A (en) * 2023-09-20 2024-01-02 鹏城实验室 Text detection method, device, equipment and storage medium
CN117435692A (en) * 2023-11-02 2024-01-23 北京云上曲率科技有限公司 Variant-based antagonism sensitive text recognition method and system
CN117349407B (en) * 2023-12-04 2024-01-30 江苏君立华域信息安全技术股份有限公司 Automatic detection method and system for content security

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299975A (en) * 2018-09-03 2019-02-01 拉扎斯网络科技(上海)有限公司 Characteristics of objects parameter determination method, device, electronic equipment and readable storage medium storing program for executing
CN110460636A (en) * 2019-07-05 2019-11-15 中国平安人寿保险股份有限公司 Data response method, device, computer equipment and storage medium
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN113282746A (en) * 2020-08-08 2021-08-20 西北工业大学 Novel network media platform variant comment confrontation text generation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9406040B2 (en) * 2014-08-13 2016-08-02 Sap Se Classification and modelling of exception types for integration middleware systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299975A (en) * 2018-09-03 2019-02-01 拉扎斯网络科技(上海)有限公司 Characteristics of objects parameter determination method, device, electronic equipment and readable storage medium storing program for executing
CN110460636A (en) * 2019-07-05 2019-11-15 中国平安人寿保险股份有限公司 Data response method, device, computer equipment and storage medium
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN113282746A (en) * 2020-08-08 2021-08-20 西北工业大学 Novel network media platform variant comment confrontation text generation method

Also Published As

Publication number Publication date
CN113688630A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN113688630B (en) Text content auditing method, device, computer equipment and storage medium
CN110209764B (en) Corpus annotation set generation method and device, electronic equipment and storage medium
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
CN109902285B (en) Corpus classification method, corpus classification device, computer equipment and storage medium
CN111738011A (en) Illegal text recognition method and device, storage medium and electronic device
WO2021047341A1 (en) Text classification method, electronic device and computer-readable storage medium
CN111046670B (en) Entity and relationship combined extraction method based on drug case legal documents
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN111858242A (en) System log anomaly detection method and device, electronic equipment and storage medium
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN109271627A (en) Text analyzing method, apparatus, computer equipment and storage medium
CN113723288B (en) Service data processing method and device based on multi-mode hybrid model
CN111723870B (en) Artificial intelligence-based data set acquisition method, apparatus, device and medium
CN110543920B (en) Performance detection method and device of image recognition model, server and storage medium
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
CN110852071B (en) Knowledge point detection method, device, equipment and readable storage medium
CN109660621A (en) A kind of content delivery method and service equipment
CN111786999B (en) Intrusion behavior detection method, device, equipment and storage medium
CN115982388B (en) Case quality control map establishment method, case document quality inspection method, case quality control map establishment equipment and storage medium
CN111986259A (en) Training method of character and face detection model, auditing method of video data and related device
CN112990147A (en) Method and device for identifying administrative-related images, electronic equipment and storage medium
CN117332084B (en) Machine learning method suitable for detecting malicious comments and false news simultaneously
CN112861757B (en) Intelligent record auditing method based on text semantic understanding and electronic equipment
CN114358153A (en) Data classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant