CN115146613A

CN115146613A - Document quality evaluation method and device, electronic equipment and medium

Info

Publication number: CN115146613A
Application number: CN202210764570.2A
Authority: CN
Inventors: 曹秀亭; 吴广发; 薛璐影
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-10-04

Abstract

The disclosure provides a document quality evaluation method, a document quality evaluation device, electronic equipment and a medium, and relates to the technical field of computers, in particular to the technical field of artificial intelligence and machine learning. The implementation scheme is as follows: acquiring a plurality of document data of a target document; acquiring a document feature vector of a target document based on a plurality of document data; and performing predictive analysis on the document feature vector to obtain a quality score of the target document.

Description

Document quality evaluation method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for evaluating document quality, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

Under the background that the online knowledge content is rapidly expanded, due to the open uploading authority, a user can upload documents in a large amount, and the enthusiasm of the user is activated. However, as the amount of uploaded documents increases, the quality of the documents is not uniform, and it is difficult to find a document meeting the requirement among a large number of documents depending on a user, so that the quality of the uploaded documents needs to be evaluated.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The disclosure provides a document quality evaluation method, a document quality evaluation device, an electronic device, a computer-readable storage medium and a computer program product.

According to an aspect of the present disclosure, there is provided a document quality evaluation method including: acquiring a plurality of document data of a target document, wherein the plurality of document data comprise at least one of attribute data, user operation data and uploading person data, and the attribute data at least comprise format category data, content category data and file size data of the target document; acquiring a document feature vector of a target document based on a plurality of document data, wherein at least one feature dimension in the document feature vector is a combined feature, and a feature value of the combined feature is determined based on first data and second data in the plurality of document data, wherein the first data and the second data are different document data; and performing predictive analysis on the document feature vector to obtain a quality score of the target document.

In accordance with another aspect of the present disclosure, A model training method is provided, comprising: the method comprises the steps of obtaining a sample data set, wherein each sample data in the sample data set comprises a plurality of document data of a sample document and a quality label corresponding to the sample data, the plurality of document data comprise at least one of attribute data, user operation data and uploader data, and the attribute data at least comprise format type data, content type data and file size data of a target document; for each sample data, the following operations are performed: acquiring a document feature vector based on a plurality of document data corresponding to the sample data; inputting the document feature vector into a model to obtain a quality prediction score of the sample data; and adjusting parameters of the model based on the quality prediction score of the sample data and the quality label of the sample data.

According to another aspect of the present disclosure, there is provided a document quality evaluation apparatus including: a first acquisition unit configured to acquire a plurality of document data of a target document, the plurality of document data including at least one of attribute data, user operation data, and uploader data, wherein the attribute data includes at least format category data, content category data, and file size data of the target document; a second acquisition unit configured to acquire a document feature vector of the target document based on the plurality of document data, wherein at least one feature dimension in the document feature vector is a combined feature, and a feature value of the combined feature is determined based on first data and second data in the plurality of document data, wherein the first data and the second data are different document data; and the prediction unit is configured to perform prediction analysis on the document feature vector to acquire a quality score of the target document.

According to another aspect of the present disclosure, there is provided a model training apparatus including: the third acquisition unit is configured to acquire a sample data set, wherein each sample data in the sample data set comprises a plurality of document data of a sample document and a quality tag corresponding to the sample data, the plurality of document data comprises at least one of attribute data, user operation data and uploader data, and the attribute data at least comprises format type data, content type data and file size data of a target document; an execution unit configured to execute, for each sample data, operations of the following sub-units: a second obtaining subunit, configured to obtain a document feature vector based on the plurality of document data corresponding to the sample data; an input subunit, configured to input the document feature vector into a model to obtain a quality prediction score of the sample data; and an adjusting subunit configured to adjust a parameter of the model based on the quality prediction score of the sample data and the quality label of the sample data.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the document quality assessment method or the model training method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the above-described document quality evaluation method or the above-described model training method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described document quality assessment method or the above-described model training method.

According to one or more embodiments of the disclosure, by introducing more dimensions of feature information (user operation data, uploader data) and combination features among some data into the document quality assessment, the accuracy of score prediction can be further improved, scores are more consistent with user expectations, and user experience is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a document quality assessment method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram for obtaining a document feature vector of a target document according to an embodiment of the disclosure;

FIG. 4 shows a flow diagram of a model training method according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of the structure of a document quality evaluation apparatus according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", and the like to describe various elements is not intended to limit the positional relationship, the temporal relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing the particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. In addition to this, the present invention is, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, the evaluation of the document quality generally scores the document based on a preset rule according to the document format category, the document content category, the file size, the uploading date, the document name, the document character number and other basic attribute data of the document, for example, when the document character number is greater than a preset value, a preset score is added to the document quality score, and the like. The dimensionality of the feature information according to the document quality evaluation is single, so that the accuracy of the document quality evaluation is low, the user requirements are difficult to meet, and the user experience is influenced.

The embodiment of the disclosure provides a document quality evaluation method, and by introducing more-dimensional feature information (user operation data and uploader data) and combination features among some data into document quality evaluation, the accuracy of score prediction can be further improved, so that the score is more in line with the expectation of a user, and the user experience is improved.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable the document quality assessment methods to be performed.

In some embodiments, the server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein, and is not intended to be limiting.

The user may use the

client devices

101, 102, 103, 104 105 and/or 106 perform operations of presentation, downloading, etc. of the document. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or may comprise a variety of Mobile operating systems, such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. Merely by way of example, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 2, there is provided a document quality evaluation method including: step S201, acquiring a plurality of document data of the target document, wherein the plurality of document data comprise at least one of attribute data, user operation data and uploader data, and the attribute data at least comprise format category data, content category data and file size data of the target document; step S202, acquiring a document feature vector of a target document based on a plurality of document data, wherein at least one feature dimension in the document feature vector is a combined feature, and a feature value of the combined feature is determined based on first data and second data in the plurality of document data, wherein the first data and the second data are different document data; and step S203, performing predictive analysis on the document feature vector to acquire a quality score of the target document.

Therefore, by introducing more dimensional feature information (user operation data and uploading data) and combination features among partial data into the document quality evaluation, the accuracy of scoring prediction can be further improved, scoring is more consistent with user expectation, and user experience is improved.

In some embodiments, document data of multiple dimensions of a target document to be evaluated may be first acquired, including, for example, one or more of attribute data, user operation data, and uploader data of the target document.

The attribute data may include basic data such as a document format type (e.g., a format type such as Word, excel, text, etc.), a document content type (e.g., a content type such as education, science, etc.), a file size, an upload date, a document name, a document character number, etc., of the target document, and the attribute data may be expressed by a feature value. For example, for a document format category and a document content category, a tag value may be set for different categories, respectively, so that the data is expressed by a feature value; for example, for the attribute of the upload date of the document, the difference between the date of the current quality evaluation and the upload date may be calculated.

The uploader data may include data related to the uploader of the document, such as account information of the uploader, identity category (e.g., authenticated user, general user, etc.), total number of uploaded documents of the uploader, audit pass rate of uploaded documents of the uploader, total number of times that the document uploaded by the uploader is presented on the day, and the like.

The user operation data may include statistical data for each of a plurality of user operations of the target document, where the plurality of user operations may include, for example, user actions of presenting, downloading, approving, collecting, reprinting, unloading, complaining, scoring, staying at, etc. the document.

For sparsely distributed data, such as user reprinting, dump, complaints, praise, collection and other operations, the total number of historical operations can be counted, so that data loss of the dimension is prevented; for the operation of presenting and downloading the document, the number of corresponding operations in the time period of the current quality evaluation, the previous 7 days, the previous 30 days, and the like, and the total number of corresponding historical operations may be respectively counted, and for the downloading operation, the downloading rate in the corresponding time period may be further calculated (for example, obtained by calculating the ratio of the downloading times to the presenting times), so that the information of the document quality evaluation may be further enriched.

For the stay of the user, the data of the dimension can be obtained by counting the stay time of the user in the document in the corresponding time period. For the scoring operation of the document by the user, the data of the dimension can be obtained by counting the average value of the historical scores of the document.

It is understood that the data for performing the document quality evaluation may be selected by the skilled person according to the actual situation, and is not limited herein.

In some embodiments, the multi-dimensional document data can be further mined to build combined features based on the intrinsic relationship between the data, thereby further enriching the information for document quality evaluation.

According to some embodiments, as shown in fig. 3, obtaining the document feature vector of the target document based on the plurality of document data may include: step S301, preprocessing a plurality of document data, wherein the preprocessing comprises abnormal value processing and missing value processing; step S302, selecting first data and second data from the preprocessed multiple document data; step S303, determining a characteristic value of the corresponding combined characteristic based on the first data and the second data; and a step S304 of acquiring a document feature vector based on the preprocessed multiple document data and the feature value of at least one combined feature.

Therefore, the characteristic value of the combined characteristic is determined through the internal association of certain data, so that more dimensionality information can be introduced into the document quality evaluation, meanwhile, the influence of unreasonable data on the prediction process is reduced through the combined characteristic mode, and the evaluation accuracy is improved.

Upon acquiring the above document data, the document data needs to be preprocessed first. For example, if there may be a significant anomaly in the data of a partial dimension of a certain document, for example, if the number of times of presentation of a certain document on the day is greater than a preset threshold, it is determined that there is a significant anomaly in the data, and the anomaly value may be corrected by the average value or the maximum value of the number of times of presentation of all documents presented on the day. For example, if there may be some missing data in a dimension of a document, the missing data may be processed by setting the data in the dimension to zero.

In some exemplary embodiments, the selected first data and second data may be the number of presentations and download rate, respectively, in the corresponding time period. Based on the first data and the second data, determining that the feature value of the corresponding combined feature may be, when the number of times of presentation is greater than a preset number of times but the download rate is less than a preset download rate threshold, setting the feature value of the combined feature to be a first value (e.g., "0"); when the above condition is not satisfied, the feature value of the combined feature may be set to a second value (e.g., "1").

By the method, when the relation between the display times and the download rate is obviously not in accordance with the normal condition, the acquisition of the unreasonable display times data by the prediction model is enhanced by introducing the combination characteristics, so that the high display times data is prevented from generating great influence on the prediction process, and the accuracy of quality evaluation is further influenced.

In some exemplary embodiments, the selected first data and second data may be dwell time data for a corresponding period of time and user account information for dwells on the document, respectively. Based on the first data and the second data, determining a feature value of a corresponding combined feature that a user with a higher activity or a high-quality user (e.g., an authenticated user) is selected from the user account information, acquiring a total stay time of the selected user in the document in the corresponding time period, comparing the total stay time with the total stay time by setting different time threshold values, and converting the total stay time into different level tag values to serve as the feature value of the combined feature, for example, if the total stay time is longer than 2 hours and shorter than 4 hours, setting the feature value to 1; if the total stay time is longer than 4 hours and shorter than 6 hours, setting the characteristic value as 2, and so on.

Therefore, the influence of some unreasonable stay time length data on the prediction process can be reduced. For example, if the stay time of a certain document is long on the day, but the user only stays the page in the document and does not watch the page, the influence of the stay time data on the prediction process can be reduced by introducing the combined features, so that the accuracy of the prediction of the document quality score is improved.

It can be understood that the related technical personnel can set the category and the obtaining mode of the combined features by themselves through the internal association between different document data by experiment mining, and the invention is not limited herein.

In some embodiments, a document feature vector corresponding to the target document may be constructed based on the preprocessed document data of the target document and based on at least one combined feature of the above-mentioned methods, wherein each feature dimension of the document feature vector corresponds to one document data, and the arrangement of the feature dimensions is set in advance.

In some embodiments, the user operation data may also include comment content of the user for the target document, which may be converted into data that may be used for document quality assessment by preprocessing the comment.

According to some embodiments, preprocessing the plurality of document data may further include: and carrying out comment emotion classification on the comment contents through a text classification model, and obtaining emotion categories of the comment contents so as to determine the characteristic value of at least one characteristic dimension in the document characteristic vector.

In some embodiments, the emotion categories of the comment content may be classified into three emotion categories, namely, negative emotion category, neutral emotion category, positive emotion category, and the three emotion categories may be respectively assigned with 0, 1, and 2 as feature values of the feature dimension in the feature vector. Therefore, feature information of the dimension of the comment content is introduced into the document quality evaluation, and the accuracy of the evaluation can be improved.

In some embodiments, after obtaining the document feature vector of the target document by the above method, the document feature vector may be input into a trained document quality score prediction model, and the model obtains the quality score of the target document by performing prediction analysis on the document feature vector.

In some embodiments, the predictive model may apply a LightGBM tree model, an XGBoost regression predictive model, or a GBDT model. It is understood that the relevant technical personnel can also select the applied model based on actual needs, and the invention is not limited thereto.

According to some embodiments, the document quality evaluation method may further include: detecting target content in a target document, wherein the target content at least comprises time information contained in the document content of the target document, document content related to the credibility of the target document and low-quality content, and the low-quality content at least comprises advertisements, websites, sensitive words and forbidden words; and adjusting the quality score of the target document based on the target content.

Therefore, by further simply detecting the document content, the prediction deviation caused by inaccurate labeling of sample data during model training is avoided, and the evaluation accuracy is further improved.

In some embodiments, the target content in the document may be detected by setting different rules. For example, whether the document content contains time information or not can be detected, and when the document content contains the time information, the document usually contains time-sensitive content. By calculating the time difference between the time information and the time of the present evaluation, the time efficiency thereof can be judged, and the quality score can be appropriately adjusted based on the judgment result.

For example, whether document content (e.g., financial and newspaper content, etc.) related to the document credibility is included in the document content may be detected, and if the document content is related to the document content, the document is considered to have higher authority and credibility, so that the quality score of the document may be increased appropriately.

Illustratively, whether the document contains related vocabularies can be detected through a pre-constructed sensitive word dictionary and a forbidden word dictionary, and detection can also be performed through a pre-set detection strategy, such as detecting continuous 11-digit numbers, detecting content matched with a website format, and the like, so as to detect whether the document content contains low-quality content, and if the document content contains the low-quality content, the quality score of the document can be appropriately reduced.

In some embodiments, the detection results may also be converted into numerical values and weighted to obtain a comprehensive result of the document content detection, and the document quality score may be adjusted based on the numerical value of the comprehensive result.

In some embodiments, the value of the quality score may also be first adjusted to be within a preset range of values (e.g., 1-5 points) before the quality score is adjusted based on the document content.

According to some embodiments, when the target content is low-quality content, adjusting the quality score of the target document based on the target content may include: in response to determining that the target content is included in the target document and that the quality score of the target document is greater than a preset threshold, decreasing the quality score by a preset numerical value.

When sample labeling is performed, high-quality content in document content is easy to observe, low-quality content is easier to ignore, and therefore errors of sample labeling are more caused by the omission of the low-quality content.

In some embodiments, whether the document content includes low-quality content may be detected through the above method, and when the document content includes low-quality content and the quality score is significantly higher (e.g., higher than a predetermined threshold), the quality score is decreased by a predetermined value. Therefore, the document quality evaluation accuracy can be further improved, and meanwhile, the quality score verification and adjustment efficiency is improved.

According to some embodiments, the document quality evaluation method may further include: the quality score of the target document is adjusted based on the identity category information of the uploader of the target document.

In some embodiments, when adjusting the quality scores of a plurality of target documents, the quality scores of the target documents may be initially preliminarily verified based on the identity category information of the uploaders of the documents, and when there is a significant abnormality in the quality scores, the quality scores may be appropriately adjusted based on the identity category information of the uploaders of the documents. For example, when the quality score of a certain document is smaller than a certain score threshold and the uploader of the document is the authenticated user, the quality score may be appropriately raised. Therefore, documents which better meet the requirements of the user can be provided for the user by referring to the identity category information of the uploader, and the user experience is improved.

In some embodiments, when the quality scores of the target documents are adjusted, the target documents can be classified based on the uploading identity category information of the documents, and the scores of the target documents in different categories are adjusted respectively, so that the efficiency of quality score verification and adjustment is improved.

According to some embodiments, as shown in fig. 4, there is also provided a model training method, including: step S401, obtaining a sample data set, wherein each sample data in the sample data set comprises a plurality of document data of a sample document and a quality tag corresponding to the sample data, wherein the plurality of document data comprises at least one of attribute data, user operation data and uploader data, and the attribute data at least comprises format type data, content type data and file size data of a target document; for each sample data, the following operations are performed: s402, acquiring a document feature vector based on a plurality of document data corresponding to the sample data; step S403, inputting the document feature vector into a model to obtain a quality prediction score of the sample data; and step S404, based on the quality prediction score of the sample data and the quality label of the sample data, and adjusting parameters of the model.

In some embodiments, the quality label of the sample data may include multiple labels at different levels, for example, including 4 levels of cheating, low quality, medium quality, high quality (with label values of 0, 1, 2, 3, respectively).

In some embodiments, before the sample data is labeled, the sample documents may be first divided into a plurality of categories by the uploader identity category information, and the documents in each category are labeled respectively, so that the efficiency and accuracy of labeling are improved.

In some embodiments, model training may be accomplished by constructing a loss function (e.g., applying a least mean square error loss function) based on the quality prediction scores and the quality labels, and adjusting model parameters based on the loss function.

According to some embodiments, obtaining the document feature vector based on the plurality of document data corresponding to the sample data may include: preprocessing a plurality of document data, wherein the preprocessing comprises abnormal value processing and missing value processing; selecting first data and second data from the preprocessed multiple document data, wherein the first data and the second data are different document data; determining a feature value of the corresponding combined feature based on the first data and the second data; and acquiring a document feature vector based on the preprocessed multiple document data and the feature value of at least one combined feature.

According to some embodiments, the user operation data may include comment content of the user for the target document, and the preprocessing of the plurality of document data may include: and carrying out comment emotion classification on the comment contents through a text classification model, and obtaining emotion categories of the comment contents so as to determine the characteristic value of at least one characteristic dimension in the document characteristic vector.

According to some embodiments, as shown in fig. 5, there is provided a document quality evaluation apparatus 500 including: a first acquisition unit 510 configured to acquire a plurality of document data of a target document, the plurality of document data including at least one of attribute data, user operation data, and uploader data, wherein the attribute data includes at least format category data, content category data, and file size data of the target document; a second obtaining unit 520 configured to obtain a document feature vector of the target document based on the plurality of document data, wherein at least one feature dimension in the document feature vector is a combined feature, and a feature value of the combined feature is determined based on first data and second data in the plurality of document data, wherein the first data and the second data are different document data; and a prediction unit 530 configured to perform prediction analysis on the document feature vector to obtain a quality score of the target document.

The operations of the units 510-530 in the document quality evaluation apparatus 500 are similar to the operations of the steps S201-S203 in the above-mentioned positioning method, and are not described herein again.

According to some embodiments, the second obtaining unit may include: a preprocessing subunit configured to preprocess the plurality of document data, the preprocessing including abnormal value processing and missing value processing; a selection subunit configured to select, among the plurality of document data subjected to the preprocessing, first data and second data; a determining subunit configured to determine, based on the first data and the second data, a feature value of the respective combined feature; and a first acquiring subunit configured to acquire a document feature vector based on the preprocessed plurality of document data and a feature value of the at least one combined feature.

According to some embodiments, the user operation data may include comment content of the user for the target document, and the preprocessing subunit may be further configured to: and carrying out comment emotion classification on the comment contents through a text classification model, and obtaining emotion categories of the comment contents so as to determine the characteristic value of at least one characteristic dimension in the document characteristic vector.

According to some embodiments, the document quality evaluation apparatus may further include: the detection unit is configured to detect target content in the target document, wherein the target content at least comprises time information contained in the document content of the target document, document content related to the credibility of the target document and low-quality content, and the low-quality content at least comprises advertisements, websites, sensitive vocabularies and forbidden vocabularies; and a first adjusting unit configured to adjust the quality score of the target document based on the target content.

According to some embodiments, when the target content is low-quality content, the first adjusting unit may be further configured to: in response to determining that the target content is included in the target document and that the quality score of the target document is greater than a preset threshold, decreasing the quality score by a preset numerical value.

According to some embodiments, the document quality evaluation apparatus may further include: and the second adjusting unit is configured to adjust the quality score of the target document based on the identity category information of the uploader of the target document.

According to some embodiments, as shown in fig. 6, there is provided a model training apparatus 600 comprising: a third obtaining unit 610, configured to obtain a sample data set, where each sample data in the sample data set includes a plurality of pieces of document data of a sample document and a quality tag corresponding to the sample data, where the plurality of pieces of document data includes at least one of attribute data, user operation data, and uploader data, where the attribute data includes at least format type data, content type data, and file size data of a target document; an execution unit 620 configured to, for each sample data, perform the following sub-unit operations: a second obtaining subunit 621, configured to obtain a document feature vector based on a plurality of document data corresponding to the sample data; an input subunit 622 configured to input the document feature vector into the model to obtain a quality prediction score of the sample data; and an adjusting subunit 623 configured to adjust parameters of the model based on the quality prediction score of the sample data and the quality label of the sample data.

The operations of the units 610 to 620 and the subunits 621 to 623 in the model training apparatus 600 are similar to the operations of the steps S401 to S404 of the positioning method, and are not described herein again.

According to some embodiments, the second acquisition subunit may comprise: a preprocessing module configured to preprocess a plurality of document data, the preprocessing including abnormal value processing and missing value processing; the document processing device comprises a selection module, a storage module and a display module, wherein the selection module is configured to select first data and second data in a plurality of preprocessed document data, and the first data and the second data are different document data; a determination module configured to determine feature values of the respective combined features based on the first data and the second data; and the acquisition module is configured to acquire the document feature vector based on the preprocessed multiple document data and the feature value of the at least one combined feature.

According to some embodiments, the user operation data may include comment content of the user for the target document, and the preprocessing module may be further configured to: and carrying out comment emotion classification on the comment contents through a text classification model, and obtaining emotion categories of the comment contents so as to determine the characteristic value of at least one characteristic dimension in the document characteristic vector.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the above-described document quality evaluation method or the above-described model training method. For example, in some embodiments, the document quality assessment methods described above or the model training methods described above may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM703 and executed by the computing unit 701, one or more steps of the document quality assessment method or the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the above-described document quality assessment method or the above-described model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A document quality assessment method, comprising:

acquiring a plurality of document data of a target document, wherein the plurality of document data comprise at least one of attribute data, user operation data and uploader data, and the attribute data at least comprise format category data, content category data and file size data of the target document;

acquiring a document feature vector of the target document based on the plurality of document data, wherein at least one feature dimension in the document feature vector is a combined feature, and a feature value of the combined feature is determined based on first data and second data in the plurality of document data, wherein the first data and the second data are different document data; and

and performing predictive analysis on the document feature vector to obtain a quality score of the target document.

2. The method according to claim 1, wherein the obtaining a document feature vector of the target document based on the plurality of document data comprises:

preprocessing the plurality of document data, wherein the preprocessing comprises abnormal value processing and missing value processing;

selecting the first data and the second data from the plurality of document data subjected to the preprocessing;

determining feature values of respective combined features based on the first data and the second data; and

and acquiring the document feature vector based on the preprocessed document data and the feature value of at least one combined feature.

3. The method of claim 2, wherein the user operation data includes comment content of a user for the target document, and the preprocessing the plurality of document data includes:

and carrying out comment emotion classification on the comment content through a text classification model, and acquiring an emotion category of the comment content to determine a characteristic value of at least one characteristic dimension in the document characteristic vector.

4. The method of any of claims 1 to 3, further comprising:

detecting target content in the target document, wherein the target content at least comprises time information contained in the document content of the target document, document content related to the credibility of the target document and low-quality content, and the low-quality content at least comprises advertisements, websites, sensitive vocabularies and forbidden vocabularies; and

adjusting the quality score of the target document based on the target content.

5. The method of claim 4, the target content being the low-quality content, the adjusting the quality score of the target document based on the target content comprising:

in response to determining that the target content is included in the target document and that the quality score of the target document is greater than a preset threshold, decreasing the quality score by a preset numerical value.

6. The method of any of claims 1 to 3, further comprising:

adjusting the quality score of the target document based on identity category information of an uploader of the target document.

7. A model training method, comprising:

acquiring a sample data set, wherein each sample data in the sample data set comprises a plurality of document data of a sample document and a quality tag corresponding to the sample data, wherein the plurality of document data comprise at least one of attribute data, user operation data and uploader data, and the attribute data at least comprise format type data, content type data and file size data of the target document;

for each sample data, performing the following operations:

acquiring a document feature vector based on a plurality of document data corresponding to the sample data;

inputting the document feature vector into the model to obtain a quality prediction score of the sample data; and

and adjusting the parameters of the model based on the quality prediction score of the sample data and the quality label of the sample data.

8. The method according to claim 7, wherein the obtaining a document feature vector based on a plurality of document data corresponding to the sample data comprises:

selecting first data and second data from the preprocessed document data, wherein the first data and the second data are different document data;

9. The method of claim 8, wherein the user operational data includes comment content of the user for the target document, and the preprocessing the plurality of document data includes:

10. A document quality evaluation apparatus comprising:

a first acquisition unit configured to acquire a plurality of document data of a target document, the plurality of document data including at least one of attribute data, user operation data, and uploader data, wherein the attribute data includes at least format category data, content category data, and file size data of the target document;

a second acquisition unit configured to acquire a document feature vector of the target document based on the plurality of document data, wherein at least one feature dimension in the document feature vector is a combined feature, and a feature value of the combined feature is determined based on first data and second data in the plurality of document data, wherein the first data and the second data are different document data; and

a prediction unit configured to perform prediction analysis on the document feature vector to obtain a quality score of the target document.

11. The apparatus of claim 10, wherein the second obtaining unit comprises:

a preprocessing subunit configured to perform preprocessing on the plurality of document data, the preprocessing including abnormal value processing and missing value processing;

a selection subunit configured to select the first data and the second data in the plurality of document data subjected to the preprocessing;

a determining subunit configured to determine feature values of the respective combined features based on the first data and the second data; and

a first obtaining subunit configured to obtain the document feature vector based on the preprocessed plurality of document data and a feature value of at least one of the combined features.

12. The apparatus of claim 11, wherein the user operation data comprises comment content of a user for the target document, and the preprocessing subunit is further configured to:

13. The apparatus of any of claims 10 to 12, further comprising:

a detecting unit configured to detect target content in the target document, wherein the target content at least comprises time information contained in document content of the target document, document content related to the credibility of the target document and low-quality content, and the low-quality content at least comprises advertisements, websites, sensitive vocabularies and forbidden vocabularies; and

a first adjusting unit configured to adjust the quality score of the target document based on the target content.

14. The apparatus of claim 13, the target content being the low-quality content, the first adjustment unit further configured to:

15. The apparatus of any of claims 10 to 12, further comprising:

a second adjusting unit configured to adjust the quality score of the target document based on identity category information of an uploader of the target document.

16. A model training apparatus comprising:

a third obtaining unit, configured to obtain a sample data set, where each sample data in the sample data set includes a plurality of pieces of document data of a sample document and a quality tag corresponding to the sample data, where the plurality of pieces of document data includes at least one of attribute data, user operation data, and uploader data, where the attribute data includes at least format type data, content type data, and file size data of the target document;

an execution unit configured to execute, for each sample data, the following sub-units of operations:

a second obtaining subunit, configured to obtain a document feature vector based on the plurality of document data corresponding to the sample data;

an input subunit, configured to input the document feature vector into the model to obtain a quality prediction score of the sample data; and

an adjusting subunit configured to adjust a parameter of the model based on the quality prediction score of the sample data and the quality label of the sample data.

17. The method of claim 16, wherein the second acquisition subunit comprises:

a preprocessing module configured to preprocess the plurality of document data, the preprocessing including abnormal value processing and missing value processing;

a selection module configured to select first data and second data from the plurality of preprocessed document data, wherein the first data and the second data are different document data;

a determination module configured to determine feature values of respective combined features based on the first data and the second data; and

an obtaining module configured to obtain the document feature vector based on the preprocessed plurality of document data and a feature value of at least one of the combined features.

18. The method of claim 17, wherein the user operational data includes comment content of a user for the target document, and the preprocessing module is further configured to:

and performing comment emotion classification on the comment content through a text classification model, and acquiring an emotion category of the comment content to determine a characteristic value of at least one characteristic dimension in the document characteristic vector.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-9 when executed by a processor.