CN112052424A

CN112052424A - Content auditing method and device

Info

Publication number: CN112052424A
Application number: CN202011086925.4A
Authority: CN
Inventors: 刘飞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2020-12-08

Abstract

The application provides a content auditing method and device, belongs to the technical field of computers, and relates to artificial intelligence and natural language processing technology. The content auditing method comprises the following steps: acquiring a text unit to be checked in the content to be checked; determining the auditing result of the text unit to be audited according to the semantic information of the text unit to be audited and the corresponding relation between the semantic information and the auditing result; the corresponding relation between the semantic information and the verification result is determined according to the semantic information of the training text unit and the verification label of the training text unit; determining an audit mark of the text unit to be audited according to the audit result of the text unit to be audited; and adding an audit mark in the text unit to be audited and displaying.

Description

Content auditing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a content auditing method and apparatus.

Background

Today, the number of media is increasing, the content of information APP (Application) becomes rich and colorful, and meanwhile, a lot of harmful information is generated, which seriously affects the reading experience of the user. In order to effectively filter out such harmful information, it is necessary to audit the contents generated every day.

The articles are checked for problems, and the checking is generally performed based on rules or by using a classification model. However, in these auditing manners, generally, the whole article is judged, and the article is divided into a normal article or an article with a problem, and this method has low auditing accuracy and is easy to miss, and when an auditor performs rechecking, the article needs to be read throughout, which seriously affects auditing efficiency.

Disclosure of Invention

In order to solve technical problems in the related art, embodiments of the present application provide a content auditing method and apparatus, which can improve the efficiency and accuracy of content auditing.

In order to achieve the above purpose, the technical solution of the embodiment of the present application is implemented as follows:

in one aspect, an embodiment of the present application provides a content auditing method, where the method includes:

acquiring a text unit to be checked in the content to be checked;

determining the auditing result of the text unit to be audited according to the semantic information of the text unit to be audited and the corresponding relation between the semantic information and the auditing result; the corresponding relation between the semantic information and the verification result is determined according to the semantic information of the training text unit and the verification label of the training text unit;

determining an audit mark of the text unit to be audited according to the audit result of the text unit to be audited;

and adding an audit mark in the text unit to be audited and displaying.

On the other hand, an embodiment of the present application further provides a content auditing apparatus, where the apparatus includes:

the acquiring unit is used for acquiring a text unit to be checked in the content to be checked;

the auditing unit is used for determining the auditing result of the text unit to be audited according to the semantic information of the text unit to be audited and the corresponding relation between the semantic information and the auditing result; the corresponding relation between the semantic information and the verification result is determined according to the semantic information of the training text unit and the verification label of the training text unit;

the marking unit is used for determining an audit mark of the text unit to be audited according to the audit result of the text unit to be audited; and adding an audit mark in the text unit to be audited.

In an optional embodiment, the auditing unit is specifically configured to:

segmenting words of the text unit to be examined to obtain a segmented word set to be examined;

determining semantic information of each target participle text in the participle set to be audited and semantic information of each target single character text in the text unit to be audited;

and determining the target word text of the target word segmentation text of the auditing result of the text unit to be audited according to the semantic information of each target word text in the text unit to be audited, the semantic information of the target word segmentation text where the target word text is located, and the corresponding relation between the semantic information and the auditing result.

In an optional embodiment, the auditing unit is specifically configured to:

determining the word weight of each target word segmentation text in the text unit to be checked; the word weight reflects the importance degree of the target word segmentation text to the text unit to be audited;

determining the word weight of each target single word text in the target word segmentation text according to the word weight of the target word segmentation text;

and determining the auditing result of the text unit to be audited according to the character weight corresponding to each target single character text and the target single character text in the text unit to be audited, the word weight corresponding to the target word segmentation text in which the target single character text is located and the target word segmentation text, and the corresponding relation between semantic information and the auditing result.

In an optional embodiment, the auditing unit is specifically configured to:

inputting the text unit to be checked into a word segmentation model for word segmentation to obtain a target word segmentation text;

inputting the text unit to be checked and the target word segmentation text in the text unit to be checked into a weight determination model, and determining the word weight of each target word segmentation text according to the occurrence frequency of each target word segmentation text in a text database and the occurrence frequency of each target word segmentation text in the text unit to be checked;

determining the input characteristics of a text auditing model according to each target single word text in the text unit to be audited, the word weight corresponding to the target single word text, and the word weight corresponding to the target word segmentation text and the target word segmentation text where the target single word text is located;

inputting all input features into a trained text auditing model, and determining an auditing result of the text unit to be audited;

the text audit model is trained according to a training text unit and an audit label of the training text unit, model parameters are obtained through learning, and the training text unit comprises a positive training sample with an audit label being a positive label and a negative training sample with an audit label being a negative label.

In an optional embodiment, the system further includes a training unit, configured to jointly train the word segmentation model, the weight determination model, and the text review model.

In an optional embodiment, the training unit is specifically configured to train the word segmentation model, the weight determination model, and the text review model according to the following manners:

acquiring a training text sample, wherein the training text sample comprises a training text unit and an audit tag corresponding to the training text unit;

performing iterative training according to the training text sample and a text auditing model to be trained until iteration is terminated to obtain a text auditing model, wherein the text auditing model comprises the word segmentation model, the weight determination model and the text auditing model; wherein each iterative training process comprises:

inputting the training text unit into the word segmentation model for word segmentation to obtain a training word segmentation text;

inputting the training text unit and each training word segmentation text in the training text unit into the weight determination model, and determining the word weight of each training word segmentation text in the training text unit according to the occurrence frequency of the training word segmentation text in the text database and the occurrence frequency of the training word segmentation text in the training text unit;

determining the character weight of each training single character text in the training word segmentation text according to the weight of the training word segmentation text;

determining each training single character text in the training text unit and a training word segmentation text where the training single character text is located;

inputting a word weight corresponding to each training single word text and each training single word text in the training text unit, and a word weight corresponding to a training participle text in which the training single word text is located and the training participle text into the text auditing model;

and determining a loss function according to the audit tag output by the text audit model and the actual audit tag of the training text unit, and adjusting the model parameters of the text audit model based on the loss function.

In an optional embodiment, the auditing result of the text unit to be audited is the violation probability of the text unit to be audited;

the marking unit is specifically configured to:

determining a target probability interval corresponding to the violation probability of the text unit to be audited;

determining a target audit mark corresponding to the text unit to be audited according to the corresponding relation between the probability interval and the audit mark;

and marking the text unit to be checked by using the target checking mark.

On the other hand, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the content auditing method is implemented.

On the other hand, the embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and when the computer program is executed by the processor, the processor is enabled to implement the content auditing method.

According to the content auditing method, after the content to be audited is obtained, the content to be audited is divided into a plurality of text units to be audited. And aiming at each text unit to be audited, determining the audit result of the text unit to be audited according to the semantic information of the text unit to be audited and the corresponding relation between the semantic information and the audit result. And determining the corresponding relation between the semantic information and the verification result according to the semantic information of the training text unit and the verification label of the training text unit. And then, according to the auditing result of the to-be-audited text unit, determining the auditing mark of the to-be-audited text unit, and adding the determined auditing mark in the to-be-audited text unit and displaying. According to the method and the device for separating the content to be checked, the content to be checked is separated into the text units to be checked, the content to be checked can be separated according to paragraphs or sentences, and one paragraph or sentence is one text unit to be checked. And then, content verification is carried out on each text unit to be verified, whether the text unit to be verified has a problem or not can be judged, and thus when the article has the problem, the position of the article problem is definitely identified, specifically to a paragraph or a sentence, so that the verification efficiency and accuracy are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of an application architecture of a content auditing method in an embodiment of the present application;

fig. 2 is a flowchart of a content auditing method according to an embodiment of the present application;

FIG. 3 is a flow diagram of a method for content auditing during implementation according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a content auditing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The terms "first" and "second" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

Pre-training the model: when the neural network model with a large model scale obtained on the super-large scale Pre-Training corpus through an unsupervised Training method is applied to a specific task, fine tuning needs to be performed based on new data and adding new modules, such as BERT (Bidirectional Encoder Representation based on a converter), OpenAI GPT (Generative Pre-Training), and GPT 2.

BERT: the method is generally used for tasks such as question answering systems, emotion analysis, spam filtering, named entity recognition, document clustering and the like, and serves as an infrastructure of the tasks, namely a language model. The innovative point of BERT is that it uses a bi-directional Transformer for the language model, which is a model that inputs a sequence of text from left to right, or combines training from left to right and from right to left. BERT is a one-time reading of the entire text sequence, a feature that enables models to learn based on both sides of words, equivalent to a two-way function. The results of the experiments show that a bi-directionally trained language model will understand the context more deeply than a uni-directional language model.

An attention mechanism is as follows: an internal process of biological observation behavior, i.e., a mechanism of aligning internal experience and external feeling to increase the fineness of observation of a partial region, is simulated. Attention mechanism can quickly extract important features of sparse data, and thus is widely used for natural language processing tasks, especially machine translation.

TF-IDF (term frequency-inverse document frequency): is a common weighting technique used for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query.

The present application will be described in further detail with reference to the following drawings and specific embodiments.

In order to effectively filter spam in network content, a commonly used strategy at present mainly identifies problem articles according to the following two ways, and then manually reviews the problem articles:

1. based on the rule: the method comprises the steps of manually summarizing or mechanically mining some keywords with obvious characteristics, constructing a keyword library by the keywords, and filtering and auditing articles according to the keyword library.

2. A machine classification model: judging the article by a machine learning method; an article is determined to be a problem article or a normal article after passing through the machine classification model, and if the article is identified as a problem article, the article is transmitted to a designated module to be manually checked.

In the auditing process, the articles are only divided into normal articles and articles with problems, and the specific location of the problems is not explicitly pointed out for the articles with problems, so that when auditors audit, the articles need to be read throughout, the efficiency of the auditors is seriously influenced, and the misjudgment probability of the auditors in the auditing process is increased. Meanwhile, some problematic articles are easy to miss, and the problematic articles can be misjudged as normal articles because the identified score does not reach the specified threshold. In addition, in the process of examining and verifying the article, whether the article has a problem or not can be marked only on the whole article, a sentence or a paragraph of a specific problem cannot be clearly marked, and the marking data of the problem article is not detailed enough, so that the data cannot be fully utilized.

In order to solve the problem of low auditing efficiency and accuracy caused by a content auditing method in the related art, the embodiment of the application provides a content auditing method and device. The embodiment of the application relates to artificial intelligence and machine learning technology, and is designed based on natural language processing technology and machine learning in the artificial intelligence.

In the process of auditing the contents to be audited, the contents to be audited are divided to obtain the text units to be audited, auditing is performed on each text unit to be audited, and the text units to be audited are input into a text auditing model to determine the auditing result of the text units to be audited. In order to increase the accuracy of the verification, before the text verification model is input into the text unit to be verified, word segmentation is carried out on the text unit to be verified, the word weight of each word segmentation and the word weight of each character in the word segmentation are calculated, the input of the text verification model is set as the character and the combination of the corresponding word segmentation, so that the text information amount obtained by the text verification model is increased, and the accuracy of the text verification model in recognizing the text semantics is improved. After the audit result of the text unit to be audited is determined, the marking mode of the text unit to be audited is determined according to the audit result, the violation probability of the text unit to be audited is identified by using different marking modes, the position of the violation content in the content to be audited can be clearly identified, the audit result can be conveniently determined, and the audit efficiency and accuracy are improved.

Fig. 1 is a schematic view of an application architecture of a content auditing method according to an embodiment of the present application, including a server 100 and a terminal device 200.

The terminal device 200 may be a mobile or a fixed electronic device. For example, a mobile phone, a tablet computer, a notebook computer, a desktop computer, various wearable devices, a smart television, a vehicle-mounted device, or other electronic devices capable of implementing the above functions may be used. The terminal device 200 can display the contents of articles, short messages and the like to be checked to the user, send the contents to be checked to the server 100, receive the checking result sent by the server 100 and display the result to the user.

The terminal device 200 and the server 100 can be connected via the internet to communicate with each other. Optionally, the internet described above uses standard communication techniques and/or protocols. The internet is typically the internet, but can be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), any combination of mobile, wireline or wireless networks, private or virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

The server 100 may provide various network services for the terminal device 200, and the server 100 may perform information processing using a cloud computing technology. The server 100 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Specifically, the server 100 may include a processor 110 (CPU), a memory 120, an input device 130, an output device 140, and the like, the input device 130 may include a keyboard, a mouse, a touch screen, and the like, and the output device 140 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.

Memory 120 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 110 with program instructions and data stored in memory 120. In the embodiment of the present invention, the memory 120 may be used to store a program of the content auditing method in the embodiment of the present invention.

The processor 110 is configured to execute the steps of any content auditing method according to the embodiments of the present invention according to the obtained program instructions by calling the program instructions stored in the memory 120.

It should be noted that, in the embodiment of the present invention, the content auditing method is mainly executed by the server 100, for example, for the content auditing method, the terminal device 200 may send the content to be audited to the server 100, and the server 100 audits the content to be audited, labels in the content to be audited according to the auditing result, and returns the labeling result to the terminal device 200. As shown in fig. 1, the application architecture is described by taking application to the server 100 side as an example, of course, the content auditing method in the embodiment of the present invention may also be executed by the terminal device 200, for example, the terminal device 200 may obtain a trained text auditing model from the server 100 side, so as to audit the content to be audited based on the text auditing model, and show the auditing result to the user in a form of a label, which is not limited in the embodiment of the present invention.

In addition, the application architecture diagram in the embodiment of the present invention is for more clearly illustrating the technical solution in the embodiment of the present invention, and does not limit the technical solution provided in the embodiment of the present invention, and certainly, is not limited to the digestive tract disease diagnosis service application, and for other application architectures and service applications, the technical solution provided in the embodiment of the present invention is also applicable to similar problems.

The various embodiments of the present invention are schematically illustrated as applied to the application architecture diagram shown in fig. 1.

Fig. 2 shows a flowchart of a content auditing method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

step S201, a to-be-audited text unit in the to-be-audited content is obtained.

If the content to be audited is not the text content, the content to be audited in other forms can be converted into the text for content audit, and then the audit is performed.

In the embodiment of the application, the obtaining mode of the content to be checked is not limited, for example, the content to be checked can be input into the terminal equipment by a user through an input device such as a keyboard, and then the content to be checked is sent to the server by the terminal equipment; or the server may acquire the article from the network and check the acquired text.

In a specific implementation process, the content to be checked is split according to a set rule to obtain a plurality of text units to be checked. For example, the content to be audited may be divided according to paragraphs to obtain a plurality of paragraphs to be audited, where each paragraph to be audited is a text unit to be audited; or the contents to be checked can be divided according to the sentences to obtain a plurality of sentences to be checked, wherein each sentence to be checked is a text unit to be checked. In the embodiment of the application, because the paragraph granularity is large, the text unit to be examined is determined by splitting according to sentences, namely according to sentence stop marks such as a sentence number and an exclamation point.

Step S202, according to the semantic information of the text unit to be audited and the corresponding relation between the semantic information and the audit result, the audit result of the text unit to be audited is determined.

And determining the corresponding relation between the semantic information and the verification result according to the semantic information of the training text unit and the verification label of the training text unit.

In a specific implementation process, a Text audit model may be used to determine an audit result of a Text unit to be audited, where the Text audit model may be any classification algorithm, such as fastText (fast Text classifier), LR (logistic regression) classifier, Support Vector Machine (SVM), and the like, or may be a deep learning algorithm, such as TextCNN (Text Convolutional Neural network), LSTM (Long Short-Term Memory, Long-Short Term Memory network), BERT, and the like. Preferably, because the BERT can fully learn the semantic features of the articles and deeply understand the meaning of the articles, compared with the traditional model, the BERT model does not need manual feature screening, thereby reducing the interference of manual factors.

Step S203, according to the auditing result of the text unit to be audited, the auditing mark of the text unit to be audited is determined.

In the specific implementation process, the audit mark can be various marking modes such as highlight color, added comments, bold font and the like. Generally, when a text unit to be checked has a problem or is an illegal text, the text unit to be checked is marked so as to quickly determine the position of the text unit to be checked with the problem in the article.

And step S204, adding an audit mark in the text unit to be audited and displaying.

Further, the auditing result of the text unit to be audited is the violation probability of the text unit to be audited.

The determining an audit mark of the to-be-audited text unit according to the audit result of the to-be-audited text unit includes:

and marking and displaying the text unit to be checked by using the target checking mark.

In a specific implementation process, in order to distinguish the violation degree of the text unit to be audited, corresponding violation probability intervals can be set, and the corresponding relation between the probability intervals and the audit marks is established for different probability intervals and even different audit marks. Different audit marks may be different background colors, different font sizes, etc., without limitation. When the mark of the problematic text to be audited is determined, the corresponding target probability interval is determined according to the determined violation probability of the text unit to be audited, the target audit mark corresponding to the text unit to be audited is further determined, and then the text unit to be audited is marked by the target audit mark and displayed to a user, so that the user can quickly distinguish the violation condition of the text unit to be audited.

Generally, for BERT, the model is entered in the form of words, i.e., text is divided into individual words as model inputs. The input mode weakens the effect of the phrases and is easy to introduce ambiguity. In order to avoid the problem of fuzzy boundary words, the embodiment of the application introduces word segmentation results into model input. In step 203, determining an audit result of the text unit to be audited according to the semantic information of the text unit to be audited and the corresponding relationship between the semantic information and the audit result, including:

determining semantic information of each target participle text in a participle set to be checked and semantic information of each target single character text in a text unit to be checked;

The text unit to be checked is subjected to word segmentation, and checking is performed according to the word segmentation and the single words, so that the information of semantic recognition is increased, and the detection accuracy is improved. In a specific implementation process, before the text unit to be checked is input into the text checking model, the text unit to be checked is segmented, and a specific segmentation mode is not limited. After word segmentation is carried out on the text unit to be audited, a word segmentation set to be audited is obtained, and the word segmentation set to be audited comprises a plurality of target word segmentation texts. And aiming at each target single character text, taking the target single character text and the target word segmentation text where the target single character text is as the input of the text auditing model together.

The specific input mode may be that the target word segmentation text is spliced behind the target single word text to be used as the same input, or the target word segmentation text is spliced in front of the target single word text to be used as one input, or the target single word text and the target word segmentation text are combined to be used as one input according to other modes. For example, the target segmented word text "lifestyle" corresponds to inputs of "living lifestyle", "habit taking" and "habit taking".

For specific target word segmentation texts and target single word texts, corresponding input characteristics can be determined by using a word embedding mode. Word Embedding (Word Embedding) is a method of converting words in text into digital vectors, and in order to analyze them using standard machine learning algorithms, it is necessary to take these converted-to-digital vectors as input in digital form. Embedding is actually a mapping from semantic space to vector space, and two words with similar semantics are positioned relatively close to each other in the vector space. The word embedding process is to embed a high-dimensional space with the number of all words into a continuous vector space with a much lower dimension, each word or phrase is mapped to a vector on the real number domain, and the word vector is generated as a result of the word embedding.

The Word embedding method includes One-hot encoding, Word to vector (Word vector) algorithm, Global vector for Word Representation (Global vector) algorithm, and the like. The embodiment of the present application may convert a text into an input feature by using an arbitrary word embedding method, which is not limited in the embodiment of the present application.

Furthermore, the embodiment of the application also gives weights to the participles and the single characters, so that the leading of the key words is highlighted, and the accuracy of semantic judgment is improved. In an optional embodiment, before determining the result of the review of the text unit to be reviewed in step 202, the method further includes:

determining the word weight of each target word segmentation text in a text unit to be checked; the word weight reflects the importance degree of the target word segmentation text to the text unit to be audited;

and determining the character weight of each target single character text in the target word segmentation text according to the weight of the target word segmentation text.

Generally, when semantic recognition is performed on a text unit to be checked, if only words or single words are separated, the weight of each participle or single word is the same, so that the dominant role of a core word in semantics cannot be highlighted. Since the core words can best express the view or subject meaning of the article, in order to highlight the dominant role of the core words, the embodiment of the present application assigns weights to the target participle text by using the TFIDF method.

The main idea of TFIDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. Wherein, the TF Term Frequency (Term Frequency) represents the Frequency of occurrence of the Term in the document d. The IDF reverse file Frequency (Inverse Document Frequency) is a measure of the general importance of words, and the main idea is as follows: if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong. In practice, however, if a term frequently appears in a document of a class, it indicates that the term can well represent the characteristics of the text of the class, and such terms should be given higher weight and selected as characteristic words of the text of the class to distinguish the document from other classes.

In the embodiment of the application, the word weight of each target participle text is determined by using a TFIDF method. And then, following the word weight of the target word segmentation text, and determining the word weight of each target single word text in the target word segmentation text. For example, the word weight of the target word segmentation text may be directly assigned to the word weight of the target word segmentation text, that is, the word weight of the target word text is the same as the word weight of the target word segmentation text in which the target word text is located. Preferably, in the embodiment of the present application, the word weight of the target word segmentation text is averagely assigned to each target single word text in the target word segmentation text. For example, the word weight of the target participle text "living habits" is 0.4, and the word weight of each target single word text in the target participle text is 0.1.

At this time, according to the semantic information of each target single word text in the text unit to be audited, the semantic information of the target participle text where the target single word text is located, and the corresponding relationship between the semantic information and the audit result, determining the audit result of the text unit to be audited, including:

and determining the auditing result of the text unit to be audited according to the word weight corresponding to each target single word text and the target single word text in the text unit to be audited, the word weight corresponding to the target word segmentation text and the target word segmentation text in which the target single word text is located, and the corresponding relation between the semantic information and the auditing result.

In a specific implementation process, the word weight of the target word segmentation text can be multiplied by the word vector of the target word segmentation text, and the word weight of the target single word text can be multiplied by the word vector of the target single word text, so that the word weight and the word weight are introduced into the model input.

Further, the embodiment of the application utilizes the model to perform word segmentation, weight distribution and text review. Specifically, segmenting words of a text unit to be examined to obtain a target segmented word text, including:

inputting a text unit to be checked into a word segmentation model for word segmentation to obtain a target word segmentation text;

determining the word weight of the target word segmentation text in the text unit to be checked, wherein the word weight comprises the following steps:

determining an auditing result of a text unit to be audited, comprising:

determining the input characteristics of a text auditing model according to each target single word text in a text unit to be audited, the word weight corresponding to the target single word text, and the word weight corresponding to the target word segmentation text and the target word segmentation text where the target single word text is located;

inputting all input features into the trained text auditing model, and determining the auditing result of the text unit to be audited;

the text auditing model is trained according to a training text unit and an auditing label of the training text unit, model parameters are obtained through learning, and the training text unit comprises a positive training sample with an auditing label being a positive label and a negative training sample with an auditing label being a negative label.

In a specific implementation process, after the server obtains the content to be checked, the divided text units to be checked are directly input into the algorithm model, and the algorithm model calculates and determines the checking result of the text units to be checked and outputs, for example, the violation probability of the text units to be checked. The algorithmic model may include a segmentation model, a weight determination model, and a text review model.

The text auditing model in the embodiment of the application is BERT, the BERT reads the whole text unit to be audited at one time, and the input is the combination of the word vector of each target single word text and the word vector of the target word segmentation text in the text unit to be audited. BERT can learn on both sides of each input, which is equivalent to a bi-directional function. Since the text auditing in the embodiment of the application actually belongs to a classification task, a classification layer needs to be added on the output of the BERT converter, so that the text unit to be audited is classified, and the violation probability of the text unit to be audited is determined.

The BERT is used for training according to a training text unit and an audit tag of the training text unit, model parameters are obtained through learning, and the training text unit comprises a positive training sample with the audit tag being a positive tag and a negative training sample with the audit tag being a negative tag.

In a specific implementation process, the training text unit is article data after being audited by auditors, and the article data comprises positive training samples and negative training samples, so that the coverage of the model and the diversity of the data are guaranteed. Different from the previous training data, the form of the training text unit in the embodiment of the application is consistent with that of the text unit to be checked, for example, if the text unit to be checked is a paragraph, the training text unit is also a paragraph; and if the text unit to be checked is a sentence, the training text unit is also the sentence. In the past, a series of keywords are extracted from an article, and compared with the prior art, the training mode in the embodiment of the application can keep complete semantic information in text content.

Further, the word segmentation model, the weight determination model and the text auditing model in the embodiment of the application are jointly trained. Specifically, the word segmentation model, the weight determination model and the text review model are trained according to the following modes:

acquiring a training text sample, wherein the training text sample comprises a training text unit and an audit label corresponding to the training text unit;

performing iterative training according to a training text sample and a text auditing model to be trained until iteration is terminated to obtain a text auditing model, wherein the text auditing model comprises a word segmentation model, a weight determination model and a text auditing model; wherein each iterative training process comprises:

inputting the training text unit into a word segmentation model for word segmentation to obtain a training word segmentation text;

inputting a training text unit and each training word segmentation text in the training text unit into a weight determination model, and determining the word weight of each training word segmentation text in the training text unit according to the occurrence frequency of the training word segmentation text in a text database and the occurrence frequency of the training word segmentation text in the training text unit;

determining each training single character text in a training text unit and a training word segmentation text where the training single character text is located;

inputting a word weight corresponding to each training single character text and the training single character text in a training text unit, and a word weight corresponding to a training participle text and a training participle text in which the training single character text is located into a text auditing model;

The following describes, by way of specific examples, implementation processes of the content auditing method provided in the embodiments of the present application. The specific implementation process is shown in fig. 3.

And obtaining the article to be audited from the network by the auditor through the terminal equipment. The auditor sends the article to be audited to the background server through the terminal device, and the background server audits the article to be audited.

And the background server splits the article to be checked according to the sentences to obtain a plurality of sentences to be checked.

And inputting each sentence to be checked into the algorithm model by the background server. The algorithm model comprises a word segmentation model, a weight determination model and a text auditing model.

The word segmentation model segments each sentence to be examined to obtain a word segmentation set to be examined, wherein the word segmentation set to be examined comprises a plurality of target word segmentation texts.

And inputting the participle set to be checked into a weight determination model, determining the word weight of the target participle text in the sentence to be checked by using the weight determination model by using a TFIDF method, and determining the word weight of each target single word text in the target participle text. The weight determination model converts each target single-word text into a word vector and multiplies the word vector by the word weight, and also converts the target word segmentation text in which the target single-word text is positioned into a word vector and multiplies the word vector by the word weight. And splicing the word vector multiplied by the word weight after the word vector multiplied by the word weight is used as one input of the text auditing model.

And the text auditing model determines the violation probability of the sentence to be audited according to the input.

And the background server determines a target audit mark of the sentence to be audited according to the violation probability. For example, if the violation probability of the sentence to be audited is large, the corresponding font color is set to be red; if the violation probability of the sentence to be audited is medium, setting the corresponding font color as yellow; and if the violation probability of the sentence to be audited is smaller, setting the corresponding font color as blue.

And after the background server determines the target audit mark of each sentence to be audited in the article to be audited, the article to be audited and the audit result are sent to the terminal equipment, and are displayed to the auditor through the terminal equipment. Besides, the sentences audited by the auditors can be used as training data to train the model algorithm.

Corresponding to the method embodiment, the embodiment of the application also provides a content auditing device. Fig. 4 is a schematic structural diagram of a content auditing apparatus according to an embodiment of the present application; as shown in fig. 4, the content auditing apparatus includes:

an obtaining unit 401, configured to obtain a unit of text to be checked in the content to be checked;

an auditing unit 402, configured to determine an auditing result of the to-be-audited text unit according to the semantic information of the to-be-audited text unit and a corresponding relationship between the semantic information and the auditing result; the corresponding relation between the semantic information and the verification result is determined according to the semantic information of the training text unit and the verification label of the training text unit;

a marking unit 403, configured to determine an audit mark of the to-be-audited text unit according to an audit result of the to-be-audited text unit; and adding an audit mark in the text unit to be audited.

In an optional embodiment, the auditing unit 402 is specifically configured to:

In an optional embodiment, a training unit 404 is further included, configured to train the word segmentation model, the weight determination model, and the text review model jointly.

In an alternative embodiment, the training unit 404 is specifically configured to train the word segmentation model, the weight determination model and the text review model according to the following manners:

the marking unit 403 is specifically configured to:

and marking the text unit to be checked by using the target checking mark.

Corresponding to the method embodiment, the embodiment of the application also provides the electronic equipment.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure; as shown in fig. 5, the electronic device 50 in the embodiment of the present application includes: a processor 51, a display 52, a memory 53, an input device 56, a bus 55, and a communication device 54; the processor 51, the memory 53, the input device 56, the display 52 and the communication device 54 are all connected by a bus 55, the bus 55 being used for data transmission between the processor 51, the memory 53, the display 52, the communication device 54 and the input device 56.

The memory 53 may be configured to store software programs and modules, such as program instructions/modules corresponding to the content auditing method in the embodiment of the present application, and the processor 51 executes various functional applications and data processing of the electronic device 50, such as the content auditing method provided in the embodiment of the present application, by running the software programs and modules stored in the memory 53. The memory 53 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program of at least one application, and the like; the stored data area may store data created from use of the electronic device 50 (e.g., training samples, feature extraction networks), and the like. Further, the memory 53 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 51 is a control center of the electronic device 50, connects various parts of the entire electronic device 50 by using the bus 55 and various interfaces and lines, and performs various functions of the electronic device 50 and processes data by operating or executing software programs and/or modules stored in the memory 53 and calling data stored in the memory 53. Alternatively, the processor 51 may include one or more Processing units, such as a CPU, a GPU (Graphics Processing Unit), a digital Processing Unit, and the like.

In the embodiment of the present application, the processor 51 presents the segmented image to the user via the display 52.

The input device 56 is mainly used for obtaining input operations of a user, and when the electronic device is different, the input device 56 may be different. For example, when the electronic device is a computer, the input device 56 may be a mouse, a keyboard, or other input device; when the electronic device is a portable device such as a smart phone or a tablet computer, the input device 56 may be a touch screen.

The embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored in the computer storage medium, and the computer-executable instructions are used to implement the content auditing method according to any embodiment of the present application.

In some possible embodiments, the various aspects of the content auditing method provided in this application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the content auditing method according to various exemplary embodiments of this application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the content auditing procedure in steps S201-S204 shown in fig. 2.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method for content auditing, the method comprising:

acquiring a text unit to be checked in the content to be checked;

according to the semantic information of the text unit to be audited, performing quality analysis on the semantic information of the text unit to be audited, and determining the auditing result of the text unit to be audited;

and adding an audit mark in the text unit to be audited.

2. The method according to claim 1, wherein the performing quality analysis on the semantic information of the text unit to be checked according to the semantic information of the text unit to be checked to determine a checking result of the text unit to be checked comprises:

and determining the auditing result of the text unit to be audited according to the semantic information of each target single word text in the text unit to be audited and the semantic information of the target word segmentation text in which the target single word text is positioned.

3. The method of claim 2,

before determining the auditing result of the text unit to be audited according to the semantic information of each target single character text in the text unit to be audited and the semantic information of the target participle text where the target single character text is located, the method further comprises the following steps:

determining the auditing result of the text unit to be audited according to the semantic information of each target single character text in the text unit to be audited and the semantic information of the target word segmentation text where the target single character text is located, wherein the auditing result comprises the following steps:

and determining the auditing result of the text unit to be audited according to the word weight corresponding to each target single word text and the target single word text in the text unit to be audited, and the word weight corresponding to the target word segmentation text in which the target single word text is located and the target word segmentation text.

4. The method of claim 3,

the segmenting words of the text unit to be audited to obtain a segmented word set to be audited includes:

the determining the word weight of the target word segmentation text in the text unit to be audited includes:

the method for determining the auditing result of the text unit to be audited according to the word weight corresponding to each target single word text and each target single word text in the text unit to be audited and the word weight corresponding to the target word segmentation text and the target word segmentation text where the target single word text is located includes:

5. The method of claim 4, wherein the word segmentation model, the weight determination model, and the text review model are trained jointly.

6. The method of claim 4, wherein the word segmentation model, the weight determination model, and the text review model are trained according to:

7. The method according to claim 6, wherein after determining the audit mark of the text unit to be audited according to the audit result of the text unit to be audited, the method further comprises:

and taking the text unit to be audited as a training text unit, taking the audit mark of the text unit to be audited as an audit label of the training text unit, and inputting the text audit model to perform iterative training.

8. The method according to any one of claims 1 to 7, wherein the result of the examination of the text unit to be examined is the violation probability of the text unit to be examined;

9. A content auditing apparatus, characterized in that the apparatus comprises:

the auditing unit is used for performing quality analysis on the semantic information of the text unit to be audited according to the semantic information of the text unit to be audited and determining the auditing result of the text unit to be audited; the corresponding relation between the semantic information and the verification result is determined according to the semantic information of the training text unit and the verification label of the training text unit;

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 8 are performed when the program is executed by the processor.

11. A computer-readable storage medium, having stored thereon a computer program executable by a computer device, for causing the computer device to perform the steps of the method of any one of claims 1 to 8, when the program is run on the computer device.