CN114547435A

CN114547435A - Content quality identification method, device, equipment and readable storage medium

Info

Publication number: CN114547435A
Application number: CN202111664299.7A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2022-05-27

Abstract

The application is a divisional application of Chinese application 202011329266.2. The application discloses a content quality identification method, a content quality identification device, content quality identification equipment and a readable storage medium, and relates to the field of machine learning. The method comprises the following steps: acquiring target information flow content; obtaining comment data of target information flow content; performing intention identification on the comment data to obtain a comment intention identification result; and determining a quality result of the target information stream content based on the comment intention identification result. By means of the intention identification of the comment data, whether quality problems exist in the target information flow content or not is identified according to the intention expressed by the comment data, the segmentation quality problems possibly existing in the target information flow content are identified, the problems that due to the fact that the segmentation quality problems need to conduct detailed analysis on specific content of the target information flow content, analysis workload is large, analysis accuracy is low, content quality of the overall recommended content is low are solved, and quality of the content in a recommendation pool is improved.

Description

Content quality identification method, device, equipment and readable storage medium

The application is a divisional application of Chinese application with the application number of 202011329266.2, the application date of 2020, 11 and 24, and the title of the invention of a method, a device, equipment and a readable storage medium for identifying content quality.

Technical Field

The embodiment of the application relates to the field of machine learning, in particular to a method, a device, equipment and a readable storage medium for identifying content quality.

Background

In the field of artificial intelligence, before User Generated Content (UGC) is pushed, the UGC Content needs to be filtered and processed, the homogenized Content needs to be subjected to de-duplication processing, and if the UGC Content meets the filtering condition, the UGC Content is removed from a recommendation pool and is not recommended.

In the related art, the UGC content related to the content to be filtered is identified and filtered by identifying the text content such as the title and the body of the UGC content through machine learning and Natural Language Processing (NLP).

However, there are also a number of refined quality issues in UGC content, such as: the quality problem of the refinement cannot be identified by the method, so that the UGC content in the recommendation pool is low in quality and the accuracy of content recommendation is low.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for identifying content quality and a readable storage medium, which can improve the accuracy of content recommendation. The technical scheme is as follows:

in one aspect, a method for identifying content quality is provided, where the method includes:

acquiring target information flow content, wherein the target information flow content is to-be-quality-identified content;

obtaining comment data of the target information flow content, wherein the comment data are data generated when comment interaction is carried out on the target information flow content through a comment account;

performing intention recognition on the comment data to obtain a comment intention recognition result, wherein the comment intention recognition result is used for indicating a matching relation between the comment data and a candidate quality recognition result, and the candidate quality recognition result is used for indicating the quality condition of the target information stream content;

and determining a quality result of the target information flow content based on the comment intention identification result.

In another aspect, an apparatus for identifying content quality is provided, the apparatus including:

the device comprises an acquisition module, a quality identification module and a quality identification module, wherein the acquisition module is used for acquiring target information flow content, and the target information flow content is content to be subjected to quality identification;

the acquisition module is further configured to acquire comment data of the target information stream content, where the comment data is data generated when a comment account performs comment interaction on the target information stream content;

the identification module is used for carrying out intention identification on the comment data to obtain a comment intention identification result, the comment intention identification result is used for indicating a matching relation between the comment data and a candidate quality identification result, and the candidate quality identification result is used for indicating the quality condition of the target information flow content;

and the determining module is used for determining the quality result of the target information flow content based on the comment intention identification result.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the content quality identification method according to any of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the content quality identification method according to any one of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the content quality identification method described in any of the above embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the comment data of the target information flow content are obtained, the intention content expressed by the comment data is obtained through intention identification of the comment data, whether the quality problem exists in the target information flow content or not is identified according to the intention content expressed by the comment data, the segmentation quality problem which possibly exists in the target information flow content is identified, the problems that due to the fact that the segmentation quality problem needs to conduct detailed analysis on specific content of the target information flow content, analysis workload is large, analysis accuracy is low, and content quality of the overall recommended content is low are avoided, quality of the content in a recommendation pool is improved, and accuracy of content recommendation is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of comments received on content of an information stream that presents a refined quality problem provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method of identifying content quality provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic structural diagram of an intent recognition model provided based on the embodiment shown in FIG. 3;

FIG. 5 is a flow chart of a method for identifying content quality as provided by another exemplary embodiment of the present application;

FIG. 6 is a flow chart of a method of identifying content quality provided by another exemplary embodiment of the present application;

FIG. 7 is a block diagram of a system for identifying content quality provided by an exemplary embodiment of the present application;

fig. 8 is a block diagram of a content quality recognition apparatus according to an exemplary embodiment of the present application;

fig. 9 is a block diagram of a content quality recognition apparatus according to another exemplary embodiment of the present application;

fig. 10 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, a brief description is given of terms referred to in the embodiments of the present application:

artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Professional Generated Content (PGC): the content is personalized and diversified in view angle. Also known as professional-Produced Content (PPC). In both UGC and PGC, there are various forms of content expression such as text, video, audio, and the like.

With the rapid development of the internet, the whole media era is also rapidly changed, and a new media era of mobile social interaction is created in the era of more and more popular application at the mobile end. In new media, a platform where users can speak, share, and propagate contents is called a "self-media platform," and the contents are usually presented in a streaming manner and the users can interact with the contents. After consuming the content, the user can also comment, like, forward, collect and other various interactive behaviors on the content. The background can also recommend the content, and the content with high click rate can obtain higher exposure opportunity through the content recommendation strategy and method, however, inappropriate content is generated to seriously affect the brand and quality of products, the content is usually filtered and processed through machine learning and NLP technology before entering the recommendation pool, and the homogeneous content is subjected to deduplication processing, but due to the complexity and the variety of content forms of the content, especially the limitation on the video content and the semantic understanding capability of the content, a lot of content quality problems needing background knowledge processing exist. For example, there are a large number of segment quality types including: wrongly written words, discomfort, advertising, poor quality narration video, incomplete content, and the like.

In view of the above problems, in the embodiments of the present application, a method for identifying quality of content is provided, which is mainly used for performing auxiliary identification on quality of content in combination with comment data of the content. The core thought is as follows: and performing intention recognition on the comment content by adopting an intention recognition model to obtain an intention recognition result, and determining the quality result of the content according to the intention recognition result.

In the training process of the intention recognition model, targeted sample cleaning and enhancement are carried out on spoken comment contents, then a multi-subclass quality model is constructed through deep learning, and content quality categories are subdivided by adopting a pre-training model and a multi-quality classification mechanism on an attention machine. The pre-training model on the comment task corpus can greatly improve the calling accuracy and well represent semantic information, and improve the effect of the final model. By the method, the content which is unclear in quality standard and difficult to define can be found in time by identifying and classifying according to the intention of user comments; meanwhile, effective samples and convergence quality standards of the algorithm of the preposed auditing machine can be enriched, and the problem that the preposed model sample is difficult to collect is solved; meanwhile, by means of the spoken cleaning and enhancement processing of the comment content, various special semantic information contained in the comment can be well represented, and the effect of the model is effectively improved.

Intention recognition: the recognition method is a recognition mode for realizing downstream functions by semantically understanding natural language, and understanding of natural semantics is one of the premises that man-machine conversation can be realized, such as understanding what service a user wants based on intention recognition or chatting. The basic methods of identification that are currently contemplated include: (1) classifying rules based on the dictionary template; (2) matching based on past logs; (3) intent recognition is performed based on the classification model. Wherein the rules based on the dictionary template are classified: the method comprises the steps of rule template analysis: the method is completed on the premise of word segmentation, part-of-speech tagging, named entity recognition, dependency syntactic analysis and semantic analysis. The general technology of rule intention identification comprises a, judgment field: distinguishing the identification field by adopting an entity-main field-template integral framework; b. judging the intention: after hitting the home domain, the template takes an intention verb (download, query, etc.) or an intention question word (what, why, etc.); c. distinguishing weak intentions from strong intentions, and providing a solution with pertinence; the current machine learning and deep learning methods: intent recognition can be viewed as a classification problem, defining different query intent categories for vertical product features. And for the query sentence input by the user, calculating the probability of each intention according to the statistical classification model, and finally giving the intention of the query.

It is understood that in the specific implementation of the present application, data related to comment content, UGC content, etc. need to be obtained user permission, authorization or consent when the above embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

In conjunction with the above noun introduction, the application scenario involved in the embodiment of the present application is illustrated:

in the recommendation process of information flow content, for content with a high click-through rate or a high popularity, the content needs to be pushed to a recommendation pool, and recommended to a user according to a recommendation distribution algorithm, so that the content is displayed in the information flow in a user interface as recommended content, and because of instability of content quality, before the content needs to be pushed to the recommendation pool, the content is firstly subjected to quality filtering.

In combination with the method provided by the present application, when the machine learning model is used to perform quality filtering on the information flow content, the comment data of the information flow content needs to be subjected to auxiliary filtering, that is, the comment data of the information flow content is subjected to intention recognition, and whether the intention expressed by the comment data meets the candidate quality recognition result in the refined quality problem is determined, for example: and when the candidate quality identification result comprises wrong characters, discomfort, advertisements, low-quality explanation videos, incomplete contents and the like, and the information stream contents are rechecked when the intention of the obtained comments is identified to indicate that the information stream contents have the quality refining problem.

Referring to fig. 1, schematically, a comment received by information stream content with a refined quality problem provided by an exemplary embodiment of the present application is shown, as shown in fig. 1, a comment area 110 is displayed in a display interface 100 of the information stream content, the information stream content displayed in the display interface 100 of the current information stream content has a problem of a title party, that is, a title is compelling, while the content does not actually express a title, and the comment displayed in the comment area 110 indicates that the content is "a small-written title is too much".

In the above example, the method is applied to the identification of the content quality of the information stream, but the method for identifying the content quality provided by the present application may also be applied to other scenarios where the quality of the information stream content is identified, and the present application is not limited to this.

Fig. 2 is a schematic diagram of an implementation environment provided in an exemplary embodiment of the present application, as shown in fig. 2, the implementation environment includes a terminal 210 and a server 220, where the terminal 210 and the server 220 are connected through a communication network 230.

The terminal 210 is installed with an application program capable of publishing and browsing information streams, a user logs in a user account in the application program, browses information stream contents published or forwarded by other accounts in the application program based on the user account, and the user can also perform interactive operations such as forwarding, commenting, collecting, reporting and the like on the information stream contents based on the user account.

In the embodiment of the application, the terminal 210 issues a comment for the target information stream content, and sends the comment to the server 220 for response, and after receiving the comment for the target information stream content, the server 220 responds to the comment and displays the comment in the comment area corresponding to the target information stream content. When the quality of the target information flow content is identified, the quality of the target information flow content is identified in an auxiliary mode according to the obtained comment data and the intention identification result based on the comment data. And determining whether the target information flow content needs to be filtered without being added into the recommendation pool according to the identification result of the target information flow content quality.

It should be noted that the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

With reference to the above noun introduction and application scenario, a method for identifying content quality provided by the present application is described, and for example, the method is applied to a server, as shown in fig. 3, the method includes:

step 301, obtaining a target information flow content, where the target information flow content is a content to be subjected to quality identification.

In the embodiment of the present application, the manner of obtaining the content of the target information stream includes at least one of the following manners:

firstly, when the heat value of the information flow content reaches the required heat value, determining the information flow content as the target information flow content for quality identification;

the popularity value of the information flow content is determined by the number of times of participation and interaction of the information flow content, schematically determined by at least one interaction type number of times of browsing, commenting, praise, forwarding and collection of the information flow content, and weighted summation is performed on the number of different interaction types to obtain the popularity value of the information flow content when the popularity value is determined by the number of different interaction types. Such as: the browsing times of the information flow content are 12000, the comment times are 1400, the like times are 5000, the forwarding times are 4200, the collection times are 200, and the popularity value of the information flow content is 2320 determined by the browsing times (weight 0.1), the comment times (weight 0.5) and the like times (weight 0.3).

Secondly, when the information flow content is the content issued by a specified user account, determining the information flow content as the target information flow content for quality identification;

wherein, the appointed user account is the user account reported in the history interaction process; or, the user account is designated as the user account with the number of the attendees reaching the required number threshold; or, the user account is designated as the user account of which the average heat value of the content of the issued information stream reaches the heat threshold value. The selection criteria for specifying the user account are not limited in the embodiments of the present application.

Thirdly, when the information flow content is marked with the specified label, the information flow content is determined as the target information flow content for quality identification.

Wherein, the appointed label is a preset label meeting the heat requirement, such as: the designated label is a label which is n-bit in front of the popularity ranking list in the popularity time range.

The above determining manner of the target information flow content is only an illustrative example, and in the embodiment of the present application, the target information flow content may also be determined in other manners, which is not limited in the embodiment of the present application.

In some embodiments, the target information stream content is content to be pushed into the recommendation pool; or the target information flow content is the content to be subjected to quality check and issued to the information flow platform after passing the quality check. That is, for the first case, the target information flow content is the content with high popularity and needs to be recommended in the information flow platform; in the second case, the target information flow content is the common content published by the user account on the social platform, but can be published on a specific information flow platform after the quality audit is passed.

That is, the target information flow content is the information flow content to be recommended and released by the user account; or, the target information flow content is the content in the queue to be sent after the user account is selected to be issued.

Step 302, obtaining comment data of the target information flow content.

The comment data is data generated when the comment account carries out comment interaction on the target information flow content. The comment account is provided with comment permission in the information flow platform; or, the account which has the comment authority for the target information flow content and performs comment interaction with the target information flow content.

When the comment data of the target information flow content is acquired, all comment data received by the target information flow content or comment data meeting the comment requirement and received by the target information flow content are acquired.

After all comment data received by the target information flow content are obtained, the comment data are selectively filtered, or all comment data are directly subjected to subsequent processing.

And step 303, performing intention identification on the comment data to obtain a comment intention identification result.

In some embodiments, the comment data is subjected to intention recognition through a pre-trained intention recognition model, and a comment intention recognition result is obtained. And the comment intention identification result is used for indicating a matching relation between the comment data and the candidate quality identification result, and the candidate quality identification result is used for indicating the quality condition of the target information stream content.

Illustratively, the candidate quality recognition result is used to indicate the matching degree of the target information stream content corresponding to the wrongly written character, resulting in discomfort, advertisement, low-quality explanation video, incomplete content, and other quality conditions.

Optionally, for comment data of the target information stream content, combining intention identification results of all comment data to obtain a comment intention identification result of the target information stream content, wherein the matching degree of each quality situation corresponding to each comment data is accumulated to obtain an accumulation result as a comment intention identification result corresponding to the target information stream content; or, averaging the matching degrees of each quality situation corresponding to each comment data to obtain an average matching degree as a comment intention identification result corresponding to the target information flow content; or inputting a preset function for calculating the matching degree of each quality situation corresponding to each comment data, and obtaining a calculation result as a comment intention identification result corresponding to the target information flow content.

The above-described manner of calculating the comment intention recognition result is merely an illustrative example, and the present embodiment does not limit the determination manner of the comment intention recognition result.

In some embodiments, comment data is input into an intention recognition model, intention recognition is carried out on the comment data through the intention recognition model, and a comment intention recognition result is output, wherein the intention recognition model is a machine learning model obtained through pre-training of sample comment corpora.

In some embodiments, feature extraction and feature processing are performed on the comment data through the intention recognition model to obtain comment data features, the comment data features are input into an activation function, and the matching degree of the candidate quality recognition result and the comment data is output and obtained to serve as the comment intention recognition result.

Fig. 4 is a schematic structural diagram of an intention recognition model according to an exemplary embodiment of the present application, and as shown in fig. 4, a sharing layer 410 performs feature extraction and feature processing on a vocabulary 400 in comment data, and a specific task layer 420 classifies features, so as to obtain a matching degree of the comment data with respect to each quality condition. Such as: the probability of the comment data with respect to the case of wrongly written words is 0.6, and the probability with respect to the case of the title party is 0.2.

In the embodiment of the present application, an intention recognition model is implemented as a Bidirectional Encoder Representation (BERT) model for example, and a 2-layer model is selected as an actual model for example, so that the inference speed is greatly increased on the premise of losing a small amount of precision, and the method can be run on a Central Processing Unit (CPU).

And step 304, determining the quality result of the target information flow content based on the comment intention identification result.

Optionally, when the comment intention recognition result is used for indicating that the target information flow content conforms to a certain quality condition, namely that a quality problem exists, filtering the target information flow content; or, the content of the target information flow is rechecked and added into a rechecking queue.

When the comment intention recognition result is used for indicating that the target information flow does not accord with any quality condition, the target information flow content is further processed.

In some embodiments, in response to the comment intention recognition result indicating that the matching degree of the candidate quality recognition result reaches the required matching degree, the target information flow content is rechecked and marked, and the target information flow content is pushed to a rechecking queue for rechecking based on the rechecking mark, so that a quality result of the target information flow content is obtained; and determining the target information flow content to be the content meeting the quality requirement in response to the comment intention identification result indicating that the matching degrees of the candidate quality identification results are all smaller than the required matching degree.

Illustratively, when the comment intention recognition result indicates that the matching degree of the target information flow content corresponding to the wrongly written word condition is 0.8, and the required matching degree (0.75) is reached, the target information flow content is rechecked and marked, and the target information flow content is added into a rechecking queue to wait for rechecking.

In summary, according to the method provided by this embodiment, the comment data of the target information stream content is obtained, and the intention identification is performed on the comment data, so as to obtain the intention content expressed by the comment data, thereby identifying whether the target information stream content has a quality problem according to the intention content expressed by the comment data, identifying a segment quality problem that may exist in the target information stream content, avoiding the problems of a large analysis workload, a low analysis accuracy rate, and a low content quality of the overall recommended content, which are caused by the fact that the segment quality problem needs to perform detailed analysis on specific content of the target information stream content, improving the quality of content in the recommendation pool, and improving the accuracy rate of content recommendation.

In an alternative embodiment, the training process of the intention recognition model is realized through sample comments published in the information flow content platform. Fig. 5 is a method for identifying content quality according to another exemplary embodiment of the present application, in which a training method for an intention identification model is mainly involved, and is described by taking as an example that the method is applied to a server, as shown in fig. 5, the method includes:

step 501, obtaining sample comment data, where the sample comment data includes content for commenting information stream content issued by an account in an information stream content platform.

In some embodiments, the manner of obtaining the sample review data includes at least one of:

firstly, sample comment data is a set of contents which are issued by any account in an information flow content platform and used for commenting information flow contents;

secondly, the sample comment data is a set of contents which are issued by a specified account in the information flow content platform and used for commenting the information flow contents;

thirdly, the sample comment data is a set of comments issued by accounts received by the specified information flow content in the information flow content platform;

fourthly, the sample comment data is a set of comments issued by accounts received by the information flow contents issued within a preset time period in the information flow content platform.

It should be noted that the four manners described above are merely illustrative examples, and the manner of obtaining the sample comment data is not limited in the embodiment of the present application.

Step 502, preprocessing the sample comment data to obtain sample data.

Optionally, the preprocessing of the sample review data includes at least one of washing and enhancing.

Step 503, training the intention recognition model based on the sample data until the convergence effect of the intention recognition model reaches the convergence requirement.

Optionally, when the deep neural network is used for NLP model training, the text to be processed needs to be converted into word vectors as neural network input, and the effect of the word vectors affects the final model effect. The effect of the word vector mainly depends on the size of the training corpus, the limited labeled corpus in many NLP tasks is not enough to train out a good enough word vector, and the large-scale unlabeled corpus irrelevant to the current task is usually used for word vector pre-training, so that another benefit of the pre-training is that the generalization capability of the model can be enhanced.

In the embodiment of the present application, the pre-training model key includes: first, Mask Language Model (MLM); second, remove the Next sequence Prediction task (NSP). The following description is made for the two tasks:

first, for full word masking

The bi-directional language model is trained by randomly masking some words (replaced with uniform markers [ MASK ]), then by predicting these masked words, and referencing the representation of each word to context information. Doing so can create two disadvantages: (1) inconsistencies in pre-training and model parameter tuning can result because the masked [ MASK ] portion is not visible in tuning; (2) since only a portion of the words in each batch will typically be predicted, the convergence rate of the model will be slower than that of a one-way language model, and training will take longer. In this embodiment, the whole word that needs to be replaced by [ MASK ] is replaced, for example: in the complete sentence "i like the drama where a star starts playing," like "," a star "," starting playing ", and" drama "are added together to form the whole word of the sentence; and replacing small parts of the whole words with other words at random, and keeping the other words. Since the encoder does not know which word needs to be predicted and which word is replaced randomly, it forces the expression of each word to refer to the context information.

Second, remove the next sentence prediction task

In order to train a model that understands the relationships between sentences, a next sentence prediction task is introduced. The corpus of this task may be generated by extracting sentences from the corpus to include two sentences a and B, where 50% of the probability B is the next sentence of a and 50% of the probability B is a random sentence in the corpus. The NSP task predicts whether B is the next sentence of a. In the scenario of this embodiment, it is found in the pre-training process that the model effect is not reduced or even improved after the NSP task is removed, so a strategy of removing the next prediction task is adopted in this embodiment.

In some embodiments, a multi-classification model is employed that outputs final results and possible probabilities. After the fully-connected layer, a classification function (SoftMax function) is adopted as an activation function, and the SoftMax function is used for normalizing the output components corresponding to each category to make the sum of the components be 1. It is to be understood that any input value can be converted into a probability. The SoftMax function is shown in equation one below:

the formula I is as follows:

where j denotes the jth input vector, n denotes the number of elements of the input vector, i denotes the ith class,

representing the power of e with respect to the elements of the jth input vector,

representing the power of e with respect to the elements of the ith input vector. Finally obtaining s (x)_i) Representing the probability of the ith class.

According to the method provided by the embodiment, when the intention recognition model is trained, the intention recognition model is trained based on the comment data issued by the account in the information flow platform as the sample comment data, and the comment issued by the information flow platform is a comment actually issued by the account, so that the training accuracy and efficiency of the intention recognition model are improved.

In an optional embodiment, the sample comment data is subjected to data cleaning and enhancement to obtain sample data, and the intention recognition model is trained based on the sample data. Fig. 6 is a flowchart of a content quality recognition method provided in another exemplary embodiment of the present application, and particularly relates to a training method of an intention recognition model, as shown in fig. 6, which is described by way of example as being applied to a server, and the method includes:

step 601, obtaining sample comment data, wherein the sample comment data comprises content which is published by an account in an information flow content platform and used for commenting information flow content.

Step 602, performing sample cleaning on the sample comment data based on a preset cleaning rule to obtain cleaning sample data.

Optionally, the preset cleansing rule includes at least one of the following rules:

firstly, filtering sample comment data of which the first designated character number is less than a required number or a required proportion;

illustratively, the first specified number of characters is used to indicate the number of kanji characters, i.e., the sample comment data is filtered when the number of kanji characters is less than the required number, or the number of kanji characters is less than the required ratio.

Secondly, filtering the sample comment data which appears in the first character type and the second character type in a replacement way;

illustratively, if the first character type is used for indicating an expression symbol and the second character type is used for indicating a Chinese character, sample comment data which appears when the expression symbol and the Chinese character are replaced is filtered; or the first character type is used for indicating simplified Chinese characters, the second character type is used for indicating traditional Chinese characters, and sample comment data generated by replacing the simplified Chinese characters and the traditional Chinese characters is filtered.

Thirdly, filtering sample comment data with content repetition exceeding a preset number of times;

illustratively, when the number of times a certain vocabulary appears in the sample comment data exceeds 3 times, the sample comment data is filtered.

Fourthly, filtering sample comment data of which the second specified character number is greater than the limit number or the limit proportion;

illustratively, the second designated character is used to indicate a non-kanji character, and the sample comment data is filtered when the non-kanji character exceeds a limit number, or the non-kanji character exceeds a limit proportion.

Fifthly, filtering sample comment data with the same initial character in the content;

illustratively, the first letter of each word in the sample comment data is extracted, and when the first letters are the same, the sample comment data is filtered.

Sixth, sample comment data for which the meaning of the content cannot be identified is filtered.

Illustratively, semantic recognition is performed on the sample comment data, and when semantic information cannot be obtained through recognition, the sample comment data is filtered.

For an exemplary sample cleaning rule, please refer to the following table.

Watch 1

Step 603, performing sample enhancement on the sample comment data based on a preset enhancement rule to obtain sample enhancement data.

Optionally, the preset enhancement rule includes at least one of the following rules:

firstly, modifying and adjusting sample comment data based on a preset modification mode;

illustratively, some words in the sample review data are repeated; disorganizing the sequence of some words among the short sentences; adding stop words into the sample comment data; and randomly deleting some words in the sample comment data to obtain newly added sample comment data.

Secondly, translating the sample comment data belonging to the first language into a second language, and then translating the sample comment data into the first language again;

illustratively, after the Chinese sample comment data is translated into English, the Chinese sample comment data is translated back into Chinese again from English, and the newly added sample comment data is obtained.

And thirdly, replacing the vocabulary in the sample comment data based on a preset replacement mode.

Illustratively, selecting a similar meaning word from a similar meaning word table to replace a vocabulary in the sample comment data to obtain newly added sample comment data; replacing the vocabulary in the sample comment data with word vectors to obtain newly added sample comment data; and replacing the random vocabulary in the sample comment data with the uniform marker to obtain newly added sample comment data.

For an exemplary sample enhancement rule, please refer to table two below.

Watch two

Step 604, obtaining sample data based on the cleaning sample data and the enhancement sample data.

In some embodiments, it is also possible to perform only sample washing, or only sample enhancement. In this embodiment, the sample cleaning and the sample enhancement are described as an example, and the sample cleaning may be performed first, and the sample enhancement may be performed on the cleaning sample data after the sample cleaning; or, the sample is enhanced first, and the enhanced sample number protector after the sample enhancement is subjected to sample cleaning.

Step 605, training the intention recognition model based on the sample data until the convergence effect of the intention recognition model meets the convergence requirement.

In the embodiment of the present application, the pre-training model key includes: firstly, covering aiming at full words; and secondly, removing the next sentence prediction task.

In summary, in the method provided in this embodiment, when the intention recognition model is trained, the intention recognition model is trained based on the comment data issued by the account in the information flow platform as the sample comment data, and since the comment issued by the account in the information flow platform is a comment actually issued by the account, the training accuracy and efficiency of the intention recognition model are improved.

Fig. 7 is a block diagram schematically illustrating a structure of a content quality recognition system according to an exemplary embodiment of the present application, and as shown in fig. 7, functions of respective functional blocks are described as follows.

First, the content producer 710 and the content consumer 720: that is, a User terminal, a PGC or UGC, a Content producer of Multi-Channel Network (MCN) or Professional User Generated Content (pufc), provides local or photographed image and text Content, video or album Content, which are main Content sources of distributed Content, through a mobile terminal or a backend Application Programming Interface (API) system; through the communication with the uplink and downlink content interface service, the interface address of an uploading server is obtained firstly, then a local file is uploaded, and matched music, a filter template, the beautifying function of pictures and texts and the like can be selected in the shooting process; as a consumer, the system communicates with the uplink and downlink content interface server 730 to acquire index information for accessing the image-text or video file, and then downloads the corresponding streaming media file and plays the file through a local player; reporting behavior data, card pause, loading time, playing click and the like played by a user in the uploading and downloading processes to a server; the consumer side consumes the interactive information of the content, such as: and reviewing UGC short text content, commenting, forwarding, collecting and other interaction information, and reporting through a UGC interaction and statistics reporting interface.

Second, the uplink and downlink content interface server 730: and the content production end 710 directly communicates, and the content submitted from the front end directly enters the server end through the server, such as: title of content, publisher, abstract, cover page, publishing time, or photographed image and text, thereby storing the file in the content database 740; writing meta information of the image and text contents, such as the size of the image and text file, the book cover link, the code rate, the file format, the title, the release time, the author and other information into a content database; the uploaded file is submitted to the scheduling center server 750 for subsequent content processing and transfer.

Third, content database 740: a core database of contents, where meta-information of contents released by all producers is stored in the content database 740, where the meta-information includes, for example, file size, cover map link, code rate, file format, title, release time, author, video file size, video format, whether the original mark is made or whether the original mark is first released, and also includes classification of contents during manual review (including first, second, and third level classification and label information, such as an article explaining a mobile phone, where the first level classification is science and technology, the second level classification is a smart phone, the third level classification is a domestic mobile phone, and the label information is a brand and a model of the mobile phone); the information in the content database 740 is read during the manual review process, and the result and the status of the manual review are also returned to the content database 740; the processing result of the scheduling center 750 on the content can be written into the content database 740, and the completely repeated content can not be repeatedly processed for the second time by the human; the meta information of the content is read from the content database 740 when the tag is subsequently extracted.

Fourthly, the dispatching center 750: the whole scheduling process responsible for content circulation receives the content added to the library through the uplink and downlink content interface server 730, and then obtains the meta information of the content from the content database 740; scheduling the manual review system 760 and the machine processing system, and controlling the scheduling sequence and priority; the content is enabled through the manual review system 760 and then directly presented to the content consumers 720 of the terminal, i.e. the content index information obtained by the consuming side, through the content distribution outlet service 713 (typically a recommendation engine or a search engine or an operator).

Fifthly, a manual review system 760: original information of image-text content in a content database needs to be read, the system is usually a complex-service system developed based on a web database, and the characteristics of the image-text content are subjected to one-round preliminary filtering by manpower; performing secondary audit on the content on the basis of the primary audit, wherein the secondary audit is mainly used for classifying the content and labeling or confirming labels; and receiving the review task synchronized by the dispatching center 750 and simultaneously receiving the low-quality content screened by the review queue service 714 through the review intention identification service 711, directly downloading the content after reviewing the content, and then directly downloading the similar content started on line by calling the repeated similar content recall service.

Sixth, content deduplication service 770: providing repetition elimination service of the images, the texts, the videos and the atlas, vectorizing the images, the atlas and the videos, then establishing indexes of vectors, and then determining the similarity degree by comparing the distances between the vectors; for the image-text content, the image-text is generally vectorized through a BERT or a local sensitive hash (SimHash) algorithm, and before all the image-text re-ranking tasks, the title short text is re-ranked.

Seventh, UGC interaction and statistics reporting interface 780: the interaction information such as the reported content comment UGC short text, praise, forward and collect is received by communicating with the content consumption end 720, and the interaction information is written into the interaction content database 790 to serve as a basic data source for subsequent sample processing and cleaning and enhancement; the comment content generated by the terminal user is received, the content is transmitted to the intention identification service 711, the intention identification is carried out on the comment content through the intention identification service 711, and fine-grained prediction of quality dimension is carried out on the original content of the comment content.

Eighthly, the interactive content database 790 and the comment sample database 700: storing comment original data generated by a terminal user, wherein the comment original data comprises a unique mark of content corresponding to a comment, time for posting the comment, user identification (Identity, ID) of the comment and actual content of the comment; the comment data are subjected to original processing according to a sample cleaning and sample enhancing method, and after the processing is finished, samples for training are stored in the comment sample database 700, so that training original sample data service is provided for the comment intention recognition model of the user.

Nine, intention recognition service 711: the intention recognition model 712 is served, UGC interaction and comment information with synchronous statistical reporting interfaces are received, and then detailed sub-classification of the quality of corresponding content is judged through the emotion tendency of the intention recognition service 711; the determined low-quality content satisfying the threshold condition is pushed to the review queue service 714 or the corresponding low-quality content is directly filtered off.

Ten, intention recognition model 712: reading sample data in the comment sample database 700, and constructing a corresponding comment intention identification model according to the algorithm model.

Fig. 8 is a block diagram of a device for identifying content quality according to an exemplary embodiment of the present application, where the device includes, as shown in fig. 8:

an obtaining module 810, configured to obtain target information stream content, where the target information stream content is content to be quality-identified;

the obtaining module 810 is further configured to obtain comment data of the target information stream content, where the comment data is data generated when a comment account performs comment interaction on the target information stream content;

an identifying module 820, configured to perform intent identification on the comment data to obtain a comment intent identification result, where the comment intent identification result is used to indicate a matching relationship between the comment data and a candidate quality identification result, and the candidate quality identification result is used to indicate a quality condition of the target information stream content;

a determining module 830, configured to determine a quality result of the target information stream content based on the comment intention identification result.

In an alternative embodiment, as shown in fig. 9, the identification module 820 includes:

an input unit 821 for inputting the comment data into an intention recognition model;

an output unit 822, configured to perform intent recognition on the comment data through the intent recognition model, and output the comment intent recognition result, where the intent recognition model is a machine learning model obtained through sample comment corpus pre-training.

In an optional embodiment, the output unit 822 is further configured to perform feature extraction and feature processing on the comment data through the intention recognition model to obtain comment data features; inputting the feature of the comment data into an activation function, and outputting to obtain the matching degree of the candidate quality identification result and the comment data, wherein the matching degree is used as the comment intention identification result.

In an optional embodiment, the obtaining module 810 is further configured to obtain sample comment data, where the sample comment data includes content that is published by an account in an information flow content platform and used for commenting on information flow content;

the processing module 840 is used for preprocessing the sample comment data to obtain sample data;

a training module 850, configured to train the intention recognition model based on the sample data until a convergence effect of the intention recognition model meets a convergence requirement.

In an optional embodiment, the processing module 840 is further configured to perform sample cleaning on the sample comment data based on a preset cleaning rule to obtain cleaning sample data;

the processing module 840 is further configured to perform sample enhancement on the sample comment data based on a preset enhancement rule to obtain enhanced sample data;

the processing module 840 is further configured to obtain the sample data based on the cleaning sample data and the enhancement sample data.

In an optional embodiment, the preset washing rule comprises at least one of the following rules:

filtering sample comment data of which the first designated character quantity is smaller than a required quantity or a required proportion;

filtering sample comment data which appears in the first character type and the second character type in a replacement mode;

filtering sample comment data with contents repeated for more than a preset number of times;

filtering sample comment data of which the second specified character number is greater than the limit number or the limit proportion;

filtering sample comment data with the same initial character in the content;

filtering sample comment data that cannot be identified to the meaning of the content.

In an optional embodiment, the preset enhancement rule includes at least one of the following rules:

modifying and adjusting the sample comment data based on a preset modification mode;

translating the sample comment data belonging to the first language into a second language, and then translating the sample comment data into the first language again;

and replacing the vocabulary in the sample comment data based on a preset replacement mode.

In an optional embodiment, the determining module 830 is further configured to perform review marking on the target information stream content in response to that the comment intention recognition result indicates that the matching degree of the candidate quality recognition result reaches a required matching degree; and pushing the target information flow content to a rechecking queue for rechecking based on the rechecking mark to obtain the quality result of the target information flow content.

In an optional embodiment, the determining module 830 is further configured to determine that the target information stream content is content meeting the quality requirement in response to that the comment intention recognition result indicates that the matching degrees of the candidate quality recognition results are all smaller than the required matching degree.

In summary, the apparatus provided in this embodiment obtains the comment data of the target information stream content, and performs intent recognition on the comment data to obtain the intent content expressed by the comment data, so as to recognize whether the target information stream content has a quality problem according to the intent content expressed by the comment data, and recognize a segment quality problem that may exist in the target information stream content, thereby avoiding problems of a large analysis workload, a low analysis accuracy rate, and a low content quality of the overall recommended content, which are caused by the need of performing detailed analysis on specific content of the target information stream content due to the segment quality problem, improving the quality of content in the recommendation pool, and improving the accuracy rate of content recommendation.

It should be noted that: the content quality recognition apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the content quality identification device and the content quality identification method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 10 shows a schematic structural diagram of a server provided in an exemplary embodiment of the present application. Specifically, the method comprises the following steps:

the server 1000 includes a Central Processing Unit (CPU) 1001, a system Memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The server 1000 also includes a mass storage device 1006 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The mass storage device 1006 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1006 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1006 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1006 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or the network interface unit 1011 may be used to connect to another type of network or a remote computer system (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for identifying content quality provided by the above method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, on which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the method for identifying content quality provided by the above method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the content quality identification method described in any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc. The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for identifying content quality, the method comprising:

performing intention recognition on the comment data to obtain a comment intention recognition result, wherein the comment intention recognition result is used for indicating a matching relation between the comment data and a candidate quality recognition result, and the candidate quality recognition result is used for indicating the quality condition of the target information flow content;

2. The method of claim 1, wherein the performing intent recognition on the comment data to obtain a comment intent recognition result comprises:

inputting the comment data into an intent recognition model;

and performing intention recognition on the comment data through the intention recognition model, and outputting to obtain a comment intention recognition result, wherein the intention recognition model is a machine learning model obtained through sample comment corpus pre-training.

3. The method according to claim 2, wherein the performing intention recognition on the comment data through the intention recognition model and outputting the comment intention recognition result comprises:

feature extraction and feature processing are carried out on the comment data through the intention recognition model, and comment data features are obtained;

inputting the feature of the comment data into an activation function, and outputting to obtain the matching degree of the candidate quality identification result and the comment data, wherein the matching degree is used as the comment intention identification result.

4. The method of claim 2, further comprising:

obtaining sample comment data, wherein the sample comment data comprise contents which are issued by an account in an information flow content platform and used for commenting information flow contents;

preprocessing the sample comment data to obtain sample data;

and training the intention recognition model based on the sample data until the convergence effect of the intention recognition model meets the convergence requirement.

5. The method of claim 4, wherein preprocessing the sample comment data to obtain sample data comprises:

sample cleaning is carried out on the sample comment data based on a preset cleaning rule, and cleaning sample data is obtained;

sample enhancement is carried out on the sample comment data based on a preset enhancement rule to obtain enhanced sample data;

and obtaining the sample data based on the cleaning sample data and the enhancement sample data.

6. The method of claim 5, wherein the preset cleansing rules comprise at least one of the following rules:

filtering sample comment data with the same initial character in the content;

7. The method of claim 5, wherein the preset enhancement rule comprises at least one of the following rules:

8. The method of any of claims 1 to 7, wherein determining a quality result of the target information stream content based on the comment intent recognition result comprises:

rechecking and marking the target information flow content in response to the comment intention recognition result indicating that the matching degree of the candidate quality recognition result reaches the required matching degree;

and pushing the target information flow content to a rechecking queue for rechecking based on the rechecking mark to obtain the quality result of the target information flow content.

9. The method of claim 8, further comprising:

and in response to the comment intention recognition result indicating that the matching degrees of the candidate quality recognition results are all less than the required matching degree, determining the target information flow content as content meeting the quality requirement.

10. An apparatus for identifying a content quality, the apparatus comprising:

the system comprises an acquisition module, a quality identification module and a quality identification module, wherein the acquisition module is used for acquiring target information flow content which is to be subjected to quality identification;

11. The apparatus of claim 10, wherein the identification module comprises:

an input unit for inputting the comment data into an intention recognition model;

and the output unit is used for carrying out intention identification on the comment data through the intention identification model and outputting the comment intention identification result, and the intention identification model is a machine learning model obtained through sample comment corpus pre-training.

12. The apparatus according to claim 11, wherein the output unit is further configured to perform feature extraction and feature processing on the comment data through the intention recognition model to obtain comment data features; inputting the feature of the comment data into an activation function, and outputting to obtain the matching degree of the candidate quality identification result and the comment data, wherein the matching degree is used as the comment intention identification result.

13. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement the method of identifying content quality as claimed in any one of claims 1 to 9.

14. A computer-readable storage medium having stored thereon at least one instruction, which is loaded and executed by a processor, to implement the method for identifying content quality according to any one of claims 1 to 9.

15. A computer program product comprising a computer program or instructions which, when executed by a processor, implement a method of identifying content quality as claimed in any one of claims 1 to 9.