CN112579729A - Training method and device for document quality evaluation model, electronic equipment and medium - Google Patents

Training method and device for document quality evaluation model, electronic equipment and medium Download PDF

Info

Publication number
CN112579729A
CN112579729A CN202011572453.3A CN202011572453A CN112579729A CN 112579729 A CN112579729 A CN 112579729A CN 202011572453 A CN202011572453 A CN 202011572453A CN 112579729 A CN112579729 A CN 112579729A
Authority
CN
China
Prior art keywords
document
quality evaluation
evaluation model
training
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011572453.3A
Other languages
Chinese (zh)
Inventor
韩都晓
邵世臣
李永恒
李梦泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu China Co Ltd
Original Assignee
Baidu China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu China Co Ltd filed Critical Baidu China Co Ltd
Priority to CN202011572453.3A priority Critical patent/CN112579729A/en
Publication of CN112579729A publication Critical patent/CN112579729A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a training method of a document quality evaluation model, a method, a device, equipment, a medium and a product for evaluating the document quality, and relates to the fields of deep learning, natural language processing, knowledge maps and the like. The training method of the document quality evaluation model comprises the following steps: acquiring an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first characteristic data; obtaining a plurality of second documents, wherein each second document has second characteristic data; and training the initial document quality evaluation model by using a plurality of second documents to update the initial document quality evaluation model.

Description

Training method and device for document quality evaluation model, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and in particular, to the fields of deep learning, natural language processing, knowledge maps, and the like, and more particularly, to a method for training a document quality evaluation model, a method, an apparatus, an electronic device, a medium, and a program product for evaluating document quality.
Background
Under the background of rapid expansion of knowledge content on the line, a large number of documents are layered endlessly, and the quality of the documents is good and irregular. When a user needs to obtain a certain type of document, the number of documents meeting the user requirements is large, and the user generally needs to select a document with high quality from the documents. However, when selecting documents with high quality, the documents are generally selected by performing manual quality judgment on the documents one by one, which results in low efficiency of document selection and high labor cost.
Disclosure of Invention
The present disclosure provides a training method of a document quality evaluation model, a method, an apparatus, an electronic device, a storage medium, and a computer program product for evaluating document quality.
According to an aspect of the present disclosure, there is provided a method for training a document quality evaluation model, including: obtaining an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first characteristic data; obtaining a plurality of second documents, wherein each second document has second characteristic data; and training the initial document quality evaluation model by using the plurality of second documents to update the initial document quality evaluation model.
According to another aspect of the present disclosure, there is provided a method of evaluating document quality, including: acquiring a document to be evaluated; and processing the document to be evaluated by using a document quality evaluation model to obtain the document quality evaluation aiming at the document to be evaluated. The method for evaluating the document quality further comprises the step of training an initial document quality evaluation model by using the training method of the document quality evaluation model to obtain the document quality evaluation model.
According to another aspect of the present disclosure, there is provided a training apparatus for a document quality evaluation model, including: the device comprises a first acquisition module, a second acquisition module and a first training module. The first obtaining module is used for obtaining an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first feature data. The second obtaining module is used for obtaining a plurality of second documents, wherein each second document has second characteristic data and a label. The first training module is used for training the initial document quality evaluation model by using the plurality of second documents so as to update the initial document quality evaluation model.
According to another aspect of the present disclosure, there is provided an apparatus for evaluating a quality of a document, including: the device comprises a third acquisition module, a processing module and a second training module. The third obtaining module is used for obtaining the document to be evaluated. The processing module is used for processing the document to be evaluated by using the document quality evaluation model to obtain the document quality evaluation aiming at the document to be evaluated. The device for evaluating the document quality further comprises a second training module, wherein the second training module is used for training an initial document quality evaluation model by using the training method of the document quality evaluation model so as to obtain the document quality evaluation model.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to: obtaining an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first characteristic data; obtaining a plurality of second documents, wherein each second document has second characteristic data; and training the initial document quality evaluation model by using the plurality of second documents to update the initial document quality evaluation model.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to: acquiring a document to be evaluated; and processing the document to be evaluated by using a document quality evaluation model to obtain the document quality evaluation aiming at the document to be evaluated. The method for evaluating the document quality further comprises the step of training an initial document quality evaluation model by using the training method of the document quality evaluation model to obtain the document quality evaluation model.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform: obtaining an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first characteristic data; obtaining a plurality of second documents, wherein each second document has second characteristic data; and training the initial document quality evaluation model by using the plurality of second documents to update the initial document quality evaluation model.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform: acquiring a document to be evaluated; and processing the document to be evaluated by using a document quality evaluation model to obtain the document quality evaluation aiming at the document to be evaluated. The method for evaluating the document quality further comprises the step of training an initial document quality evaluation model by using the training method of the document quality evaluation model to obtain the document quality evaluation model.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements: obtaining an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first characteristic data; obtaining a plurality of second documents, wherein each second document has second characteristic data; and training the initial document quality evaluation model by using the plurality of second documents to update the initial document quality evaluation model.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements: acquiring a document to be evaluated; and processing the document to be evaluated by using a document quality evaluation model to obtain the document quality evaluation aiming at the document to be evaluated. The method for evaluating the document quality further comprises the step of training an initial document quality evaluation model by using the training method of the document quality evaluation model to obtain the document quality evaluation model.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an application scenario of a training method of a document quality evaluation model according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a method of training a document quality assessment model according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of a training method of a document quality evaluation model according to an embodiment of the present disclosure;
FIG. 4 schematically shows a schematic diagram of a training method of a document quality evaluation model according to another embodiment of the present disclosure;
FIG. 5 schematically shows a schematic diagram of a training method of a document quality evaluation model according to another embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow diagram of a method of evaluating document quality according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a training apparatus for a document quality evaluation model according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates a block diagram of an apparatus for evaluating the quality of a document according to an embodiment of the present disclosure; and
FIG. 9 is a block diagram of an electronic device for implementing a method for training a document quality evaluation model according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides a training method of a document quality evaluation model, which comprises the following steps: an initial document quality evaluation model is obtained, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first feature data. Then, a plurality of second documents are obtained, wherein each second document has second characteristic data. Next, an initial document quality evaluation model is trained using the plurality of second documents to update the initial document quality evaluation model.
Fig. 1 schematically illustrates an application scenario of a training method of a document quality evaluation model according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 of an embodiment of the present disclosure includes, for example, a document quality evaluation model 110 to be trained and a trained document quality evaluation model 120.
In embodiments of the present disclosure, the document quality evaluation model may include a classification model, for example, for classifying documents to separate them into high quality documents and low quality documents. The classification model may include a tree model including, but not limited to, a decision tree, a binary tree, and the like.
In the embodiment of the present disclosure, the training sample 111 includes, for example, a plurality of documents for training the document quality evaluation model 110 to be trained, each document in the training sample 111 has a label, the label for example characterizes the quality of the corresponding document, and the label for example includes "high quality", "low quality", and the like. The training sample 111 is used to train the document quality evaluation model 110 to be trained, and a trained document quality evaluation model 120 can be obtained.
Next, the trained document quality evaluation model 120 may be utilized to evaluate the quality of the document 121 to be evaluated. For example, the document 121 to be evaluated is input into the trained document quality evaluation model 120 to obtain a document quality evaluation result 130, and the document quality evaluation result 130 includes, for example, the probability that the document 121 to be evaluated is a high-quality document or the probability that the document 121 to be evaluated is a low-quality document, so as to realize quality evaluation of the document 121 to be evaluated.
The embodiment of the present disclosure provides a training method of a document quality evaluation model, and the following describes the training method of the document quality evaluation model according to an exemplary embodiment of the present disclosure with reference to fig. 2 to 5 in combination with the application scenario of fig. 1.
FIG. 2 schematically shows a flowchart of a method of training a document quality assessment model according to an embodiment of the present disclosure.
As shown in fig. 2, the training method 200 of the document quality evaluation model according to the embodiment of the present disclosure may include operations S210 to S230, for example.
In operation S210, an initial document quality evaluation model is obtained, where the initial document quality evaluation model is trained by using a plurality of first documents, and each of the first documents has first feature data.
In operation S220, a plurality of second documents, each having second feature data, is obtained.
In operation S230, the initial document quality evaluation model is trained using a plurality of second documents to update the initial document quality evaluation model.
In an embodiment of the present disclosure, the first feature data of the first document mainly includes, for example, a feature of the document itself, and the degree of change of the first feature data with time is low. An initial document quality evaluation model is obtained through training based on the first feature data of the first document, so that when the initial document quality evaluation model is used for carrying out quality evaluation on the document, the content of the document is more concerned.
In the embodiment of the present disclosure, the second feature data of the second document mainly includes, for example, features other than the features of the document itself, and the degree of change of the second feature data with time is high. And performing update training on the initial quality evaluation model by using the second document, so that the updated initial quality evaluation model focuses more on the time-varying content in the document when the document is subjected to quality evaluation.
In the embodiment of the disclosure, after the initial document quality evaluation model is obtained through the first document training, the second document is used for training and updating the initial quality evaluation model, so that when the updated initial quality evaluation model is used for carrying out quality evaluation on the document, not only the content of the document is concerned, but also the content changing along with time in the document is concerned, thereby improving the model precision of the initial quality evaluation model, and further improving the accuracy of the document quality evaluation.
FIG. 3 schematically shows a schematic diagram of a training method of a document quality evaluation model according to an embodiment of the present disclosure.
As shown in fig. 3, for each of a plurality of second documents 310, second feature data 320 for each second document is obtained.
Second feature data 320 includes, for example, author credibility feature data 321 of the second document, hotspot feature data 322 of the second document, and graph feature data 323 of the second document.
For the author public reliability feature data 321 of the second document, for example, data information related to the author on each website may be crawled from a plurality of websites through a crawler technology, and the related data information is processed to generate the author public reliability feature data 321. The author-credibility feature data 321 includes, for example, features of a plurality of dimensions including, for example, the number of times an author-related document is cited, the score of the author-related document, the number of times the author-related document is evaluated, and the like.
For hotspot feature data 322 of the second document, data may be crawled and features extracted on multiple websites according to the title of the second document or document topic keywords. For example, the crawler technique crawls multidimensional characteristics such as the number of times that a document title or a document subject keyword is referred to on each website, the number of times that the keyword touches news or a popular ranking list, and the like from a plurality of websites, and uses the multidimensional characteristics as hotspot characteristic data 322. Illustratively, the Document topic keyword may be extracted through a TF-IDF (Term Frequency-Inverse Document Frequency) technology, and then a search is performed in a plurality of hot documents in the website based on the keyword to obtain the hotspot feature data 322 of the second Document.
The graph characteristic data 323 for the second document characterizes an association between the second document, an author of the second document, and a user reading the second document. For example, documents, authors, and users are extracted from the relevant data of multiple websites as nodes, and the behavior of the users viewing the documents is determined as an associated edge between the user node and the document node, which may characterize the duration, number, and the like of the user reading the documents. An associated edge between the author node and the document node is determined, which may characterize an author time and an update time of the document by the author. The graph data is obtained by performing association and integration of data through a plurality of nodes and associated edges between the nodes, the graph characteristic data 323 for each second document can be obtained based on the graph data, and in one case, the graph characteristic data 323 can reflect that a certain second document is viewed by a plurality of users, and the probability that the document viewed by the plurality of users is a high-quality document is high.
In embodiments of the present disclosure, each second document may further include a label 330, the label 330 characterizing, for example, that the second document is a high quality or low quality document. In an embodiment of the disclosure, the tag of the second document is derived based on the second feature data. For example, when the second feature data of the second document satisfies the preset condition, the label 330 of the second document is "high quality". For example, when the author public reliability feature data 321 of the second document indicates that the author public reliability is greater than the preset public reliability, the hot spot feature data 322 of the second document indicates that the second document is a hot spot document, and/or the graph feature data 323 of the second document indicates that the document, the author, and the user of the second document satisfy a certain association condition, it may be determined that the second document is a high-quality document, and thus the tag 330 for the second document is "high quality".
Next, the second feature data 320 of the second document is processed using the initial document quality assessment model 340, resulting in a document quality assessment result 350 for the second document. Then, based on the document quality evaluation result 350 for the second document and the tags 330 of the second document, the model parameters of the initial document quality evaluation model 340 are adjusted to update the initial document quality evaluation model 340. For example, when the document quality evaluation result 350 and the label 330 are inconsistent, the model parameters of the initial document quality evaluation model 340 are adjusted so that the document quality evaluation result 350 and the label 330 are as consistent as possible during subsequent training.
In the embodiment of the disclosure, the label of the second document is obtained based on the second feature data, so that the label of the second document can reflect the document quality of the second document more, and the updated initial document quality evaluation model is trained based on the second document with the label, so that the model accuracy of the initial quality evaluation model is improved.
FIG. 4 schematically shows a schematic diagram of a training method of a document quality evaluation model according to another embodiment of the present disclosure.
As shown in fig. 4, a plurality of first documents 410 is obtained, each of the plurality of first documents 410 having first characteristic data 420 and a tag 440. The tag 440 characterizes the first document as a high quality or low quality document, for example.
In an embodiment of the present disclosure, the first feature data 420 includes a document content feature 421 of the first document, a user behavior feature 422 for the first document, and an author feature 423 of the first document.
The document content feature 421 of the first document includes, for example, a multi-dimensional feature including, for example, a title, a number of pages, a document format type, a document content type, a document upload time, and the like of the first document. The document format type is, for example, Word, PDF, PPT, etc. The document content type is, for example, a paper type, a teaching plan type, and the like.
The user behavior features 422 for the first document include, for example, multidimensional features including, for example, download rate, amount of approval, amount of collection, amount of sharing, amount of browsing, amount of user rating, value of user rating, and the like.
The author feature 423 includes, for example, a multi-dimensional feature including a feature that an author is interested in and a document quality feature of the author. The author focused on features include, for example, author focused on amount, and the author's document quality features include, for example, document quality scores uploaded by the author.
In an embodiment of the present disclosure, the tag 440 of the first document is derived based on the document attribute information 430 of the first document, for example. For example, each first document is processed to obtain at least one document attribute information 430 of the first document, and then the at least one document attribute information is processed based on a preset rule to obtain a tag 440 of the first document, wherein the tag 440 represents a document quality evaluation of the first document. For example, a first document is processed by natural language processing techniques, resulting in document attribute information 430 for the first document, and then a tag 440 for the first document is determined based on the document attribute information 430.
Next, the first feature data 420 is processed using the document quality assessment model 450 to be trained, resulting in a document quality assessment result 460 for the first document. Model parameters of the document quality assessment model 450 to be trained are then adjusted based on the document quality assessment results 460 for the first document and the tags 440 of the first document to arrive at an initial document quality assessment model. For example, when the document quality evaluation result 460 and the label 440 are inconsistent, the model parameters of the document quality evaluation model 450 to be trained are adjusted, so that the document quality evaluation result obtained by the obtained initial document quality evaluation model after the document is processed is consistent with the label as much as possible.
In an embodiment of the present disclosure, the document attribute information may include: document header information, document header-text relevance information, document readability information, document integrity information, document chart information, document spread information, document aesthetic information, document utility information, document cheating information, and the like.
As shown in table 1, a first document is processed to obtain a plurality of document attribute information of the first document, and then at least one document attribute information is processed based on a preset rule to obtain a tag of the first document. For example, the label of the first document is determined to be "high quality" when at least a portion of the plurality of document attribute information meets a quality criterion. Illustratively, the label of the first document is determined to be "high quality" when a plurality of document attribute information, such as title quality, topic relevance, readability, completeness, and the like, all meet the high-quality criteria.
In the embodiment of the disclosure, the label of the first document is determined based on the document attribute information of the first document, so that the label of the first document can reflect the document quality of the first document more, and the model accuracy of the initial quality evaluation model is improved by obtaining the initial document quality evaluation model based on the first document with the label.
TABLE 1
Document attributes Quality standard
Title quality Title length, complete description, explicit ideographs, correct syntax punctuation, etc
Correlation of subject matter Text to topic correspondence
Readability Content error less than 5 and no obvious advertisement link
Degree of integrity Consistent head and tail, clear prompt with picture/table
Join chart/form quality With matching drawings, clear diagrams, no duplication and the diagrams are related to the text subject
Document space The content of the document is generally 3 pages or more, and the PPT document is 8 pages or more
Aesthetic property Content non-occlusion and format neatening
Practicality of use The content has use value for users
Cheating No cheating behaviors such as content stacking and hidden character
FIG. 5 schematically shows a schematic diagram of a training method of a document quality evaluation model according to another embodiment of the present disclosure.
As shown in FIG. 5, at least one verification document 510 is obtained, each verification document 510 having characteristic data 511 and a tag 512. The feature data 511 of the verification document 510 is then processed using the initial document quality evaluation model 520, resulting in a document quality evaluation result 530 for the verification document 510.
Next, a target verification document 511 ' is determined from the at least one verification document 510 based on the document quality evaluation result 530 for the verification document 510 and the tag 512 of the verification document 510, wherein there is no match between the document quality evaluation result of the target verification document 511 ' and the tag of the target verification document 511 '. That is, the result of the quality evaluation of the target verification document 511 ' by the initial document quality evaluation model 520 is erroneous, and further it can be shown that, when the initial document quality evaluation model is trained, the number of samples similar to the target verification document 511 ' in the training samples used is small, which results in low accuracy of the quality evaluation of the target verification document 511 ' by the initial document quality evaluation model.
Then, based on the target verification document 511 ', a plurality of training documents 540 are obtained, and the similarity between the document attribute information of each training document 540 and the document attribute information of the target verification document 511 ' satisfies a preset similarity, that is, the training documents 540 and the target verification document 511 ' are similar documents. The document attribute information may include document title information, document title correlation information, document readability information, document integrity information, document chart information, document space information, document aesthetic information, document practicality information, document cheating information, and the like. After the plurality of training documents 540 are obtained, the initial document quality evaluation model 520 may be updated based on the plurality of training documents 540.
Therefore, the initial document quality evaluation model is verified by using the verification document to determine the sample type with insufficient quantity in the training samples used in training the initial document quality evaluation model, and the initial document quality evaluation model is updated by using a large quantity of samples of the type, so that the evaluation accuracy of the initial document quality evaluation model is improved.
FIG. 6 schematically shows a flow diagram of a method of evaluating document quality according to an embodiment of the present disclosure.
As shown in fig. 6, the method 600 of evaluating the quality of a document of the embodiment of the present disclosure may include, for example, operations S610 to S630.
In operation S610, an initial document quality evaluation model is trained using a training method of the document quality evaluation model to obtain a document quality evaluation model.
In operation S620, a document to be evaluated is acquired.
In operation S630, the document to be evaluated is processed by using the document quality evaluation model, so as to obtain a document quality evaluation for the document to be evaluated.
The document to be evaluated comprises first characteristic data, for example. The document quality evaluation model is obtained by training an initial document quality evaluation model using the above-mentioned training method, for example.
In the embodiment of the disclosure, the document quality evaluation obtained by processing the document to be evaluated by using the document quality evaluation model includes the probability that the document to be evaluated is represented as the quality document.
In one example, the probability is converted to a score between 0 and 5 by a piecewise function, such as equation (1):
Figure BDA0002859000880000111
and the prob is the probability of model output, if the document is a high-quality document, the probability is greater than 0.5, the quality evaluation score of the document is greater than 4 at the moment, and the score is closer to 5 as the probability is closer to 1. If the probability that a document belongs to a high quality document is less than or equal to 0.5, the quality rating score for the document is less than 3.
After the scores of the documents to be evaluated are obtained, the documents to be evaluated can be arranged in a descending order according to the scores, and the arranged documents to be evaluated are recommended to a user, so that the user can preferentially select the high-quality documents, the efficiency of selecting the high-quality documents by the user is improved, and the labor cost for selecting the high-quality documents is reduced.
FIG. 7 schematically shows a block diagram of a training apparatus of a document quality evaluation model according to an embodiment of the present disclosure.
As shown in fig. 7, the training apparatus 700 of the document quality evaluation model according to the embodiment of the present disclosure includes, for example, a first obtaining module 710, a second obtaining module 720, and a first training module 730.
The first obtaining module 710 may be configured to obtain an initial document quality evaluation model, where the initial document quality evaluation model is trained by using a plurality of first documents, and each of the first documents has first feature data. According to the embodiment of the present disclosure, the first obtaining module 710 may, for example, perform operation S210 described above with reference to fig. 2, which is not described herein again.
The second retrieving module 720 may be configured to retrieve a plurality of second documents, wherein each second document has second feature data and a tag. According to the embodiment of the present disclosure, the second obtaining module 720 may, for example, perform operation S220 described above with reference to fig. 2, which is not described herein again.
The first training module 730 may be used to train the initial document quality evaluation model with a plurality of second documents to update the initial document quality evaluation model. According to an embodiment of the present disclosure, the first training module 730 may, for example, perform the operation S230 described above with reference to fig. 2, which is not described herein again.
FIG. 8 schematically shows a block diagram of an apparatus for evaluating the quality of a document according to an embodiment of the present disclosure.
As shown in fig. 8, the apparatus 800 for evaluating the quality of a document according to the embodiment of the present disclosure includes, for example, a second training module 810, a third obtaining module 820, and a processing module 830.
The second training module 810 may be used to train the initial document quality evaluation model to obtain a document quality evaluation model. According to an embodiment of the present disclosure, the second training module 810 may perform, for example, the operation S610 described above with reference to fig. 6, which is not described herein again.
The third obtaining module 820 may be configured to obtain a document to be evaluated. According to the embodiment of the present disclosure, the third obtaining module 820 may perform, for example, the operation S620 described above with reference to fig. 6, which is not described herein again.
The processing module 830 may be configured to process the document to be evaluated by using the document quality evaluation model, so as to obtain a document quality evaluation for the document to be evaluated. According to the embodiment of the present disclosure, the processing module 830 may perform the operation S630 described above with reference to fig. 6, for example, and is not described herein again.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 9 is a block diagram of an electronic device for implementing a method for training a document quality evaluation model according to an embodiment of the present disclosure.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. The electronic device 900 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a training method of a document quality evaluation model. For example, in some embodiments, the training method of the document quality evaluation model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the training method of the document quality evaluation model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method of the document quality evaluation model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
The electronic device may be used to perform a method of evaluating the quality of a document. The electronic device may comprise, for example, a computing unit, a ROM, a RAM, an I/O interface, an input unit, an output unit, a storage unit and a communication unit. The computing unit, the ROM, the RAM, the I/O interface, the input unit, the output unit, the storage unit, and the communication unit in the electronic device have the same or similar functions as the computing unit, the ROM, the RAM, the I/O interface, the input unit, the output unit, the storage unit, and the communication unit of the electronic device shown in fig. 9, for example, and are not described herein again.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A training method of a document quality evaluation model comprises the following steps:
obtaining an initial document quality evaluation model, wherein the initial document quality evaluation model is obtained by utilizing a plurality of first documents through training, and each first document has first characteristic data;
obtaining a plurality of second documents, wherein each second document has second characteristic data; and
training the initial document quality evaluation model with the plurality of second documents to update the initial document quality evaluation model.
2. The method of claim 1, wherein each of the second documents further comprises a label, the label of the second document being derived based on the second feature data;
wherein the training the initial document quality assessment model using the plurality of second documents comprises:
processing second characteristic data of the second document by using the initial document quality evaluation model to obtain a document quality evaluation result aiming at the second document; and
adjusting model parameters of the initial document quality evaluation model based on the document quality evaluation result for the second document and the tags of the second document to update the initial document quality evaluation model.
3. The method of claim 1, wherein the second feature data comprises at least one of:
author credibility feature data of the second document;
hotspot characteristic data of the second document; and
graph characteristic data of the second document, wherein the graph characteristic data characterizes an association between the second document, an author of the second document, and a user reading the second document.
4. The method of claim 1, further comprising:
obtaining a plurality of first documents, wherein each first document has a tag;
processing the first characteristic data by using a document quality evaluation model to be trained to obtain a document quality evaluation result aiming at the first document; and
and adjusting model parameters of a document quality evaluation model to be trained based on the document quality evaluation result of the first document and the label of the first document to obtain the initial document quality evaluation model.
5. The method of any of claims 1 to 4, wherein the first feature data comprises at least one of:
a document content characteristic of the first document;
a user behavior feature for the first document;
an author characteristic of the first document, wherein the author characteristic comprises at least one of an author focused feature and an author document quality feature.
6. The method of claim 4, further comprising: acquiring a label of each first document; the obtaining the tag of each first document comprises:
processing the first document to obtain at least one document attribute information of the first document; and
and processing the at least one piece of document attribute information based on a preset rule to obtain a label of the first document, wherein the label of the first document represents the document quality evaluation of the first document.
7. The method of claim 6, wherein the document attribute information comprises at least one of:
the information comprises document title information, document title correlation information, document readability information, document integrity information, document chart information, document space information, document aesthetic information, document practicability information and document cheating information.
8. The method of claim 1 or 4, further comprising:
obtaining at least one verification document, each verification document having characteristic data and a tag;
processing the characteristic data of the verification document by using the initial document quality evaluation model to obtain a document quality evaluation result aiming at the verification document;
determining a target verification document from the at least one verification document based on the document quality evaluation result for the verification document and the tag of the verification document, wherein the document quality evaluation result of the target verification document and the tag of the target verification document are not matched;
acquiring a plurality of training documents based on the target verification document, wherein the similarity between the document attribute information of each training document and the document attribute information of the target verification document meets a preset similarity; and
updating the initial document quality evaluation model based on the plurality of training documents.
9. A method of evaluating the quality of a document, comprising:
acquiring a document to be evaluated; and
processing the document to be evaluated by using a document quality evaluation model to obtain document quality evaluation aiming at the document to be evaluated;
wherein the method of evaluating document quality further comprises training an initial document quality evaluation model using the training method according to any one of claims 1-8 to obtain the document quality evaluation model.
10. A training device of a document quality evaluation model comprises:
the document quality evaluation method comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining an initial document quality evaluation model, the initial document quality evaluation model is obtained by utilizing a plurality of first documents in a training mode, and each first document has first characteristic data;
the second acquisition module is used for acquiring a plurality of second documents, wherein each second document has second characteristic data and a label; and
and the first training module is used for training the initial document quality evaluation model by using the plurality of second documents so as to update the initial document quality evaluation model.
11. An apparatus for evaluating the quality of a document, comprising:
the third acquisition module is used for acquiring the document to be evaluated;
the processing module is used for processing the document to be evaluated by utilizing a document quality evaluation model to obtain document quality evaluation aiming at the document to be evaluated;
the device for evaluating the document quality further comprises a second training module, wherein the second training module is used for training an initial document quality evaluation model by using the training method according to any one of claims 1 to 8 to obtain the document quality evaluation model.
12. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of claim 9.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of claim 9.
16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to claim 9.
CN202011572453.3A 2020-12-25 2020-12-25 Training method and device for document quality evaluation model, electronic equipment and medium Pending CN112579729A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011572453.3A CN112579729A (en) 2020-12-25 2020-12-25 Training method and device for document quality evaluation model, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011572453.3A CN112579729A (en) 2020-12-25 2020-12-25 Training method and device for document quality evaluation model, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN112579729A true CN112579729A (en) 2021-03-30

Family

ID=75140114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011572453.3A Pending CN112579729A (en) 2020-12-25 2020-12-25 Training method and device for document quality evaluation model, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN112579729A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515628A (en) * 2021-05-19 2021-10-19 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN113656671A (en) * 2021-06-16 2021-11-16 北京百度网讯科技有限公司 Model training method, link scoring method, device, equipment, medium and product
CN114492409A (en) * 2022-01-27 2022-05-13 百度在线网络技术(北京)有限公司 Method and device for evaluating file content, electronic equipment and program product

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005158010A (en) * 2003-10-31 2005-06-16 Hewlett-Packard Development Co Lp Apparatus, method and program for classification evaluation
CN108537289A (en) * 2018-04-24 2018-09-14 百度在线网络技术(北京)有限公司 Training method, device and the storage medium of data identification model
CN109243618A (en) * 2018-09-12 2019-01-18 腾讯科技(深圳)有限公司 Construction method, disease label construction method and the smart machine of medical model
CN110515904A (en) * 2019-08-13 2019-11-29 北京达佳互联信息技术有限公司 Quality prediction model training method, qualitative forecasting method and the device of media file
CN110569359A (en) * 2019-08-26 2019-12-13 腾讯科技(深圳)有限公司 Recognition model training and application method and device, computing equipment and storage medium
CN110807154A (en) * 2019-11-08 2020-02-18 内蒙古工业大学 Recommendation method and system based on hybrid deep learning model
CN110956018A (en) * 2019-11-22 2020-04-03 腾讯科技(深圳)有限公司 Training method of text processing model, text processing method, text processing device and storage medium
CN111523322A (en) * 2020-04-25 2020-08-11 中信银行股份有限公司 Requirement document quality evaluation model training method and requirement document quality evaluation method
CN111737975A (en) * 2020-05-14 2020-10-02 平安科技(深圳)有限公司 Text connotation quality evaluation method, device, equipment and storage medium
CN111737446A (en) * 2020-06-22 2020-10-02 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for constructing quality evaluation model
CN112016315A (en) * 2020-10-19 2020-12-01 北京易真学思教育科技有限公司 Model training method, text recognition method, model training device, text recognition device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005158010A (en) * 2003-10-31 2005-06-16 Hewlett-Packard Development Co Lp Apparatus, method and program for classification evaluation
CN108537289A (en) * 2018-04-24 2018-09-14 百度在线网络技术(北京)有限公司 Training method, device and the storage medium of data identification model
CN109243618A (en) * 2018-09-12 2019-01-18 腾讯科技(深圳)有限公司 Construction method, disease label construction method and the smart machine of medical model
CN110515904A (en) * 2019-08-13 2019-11-29 北京达佳互联信息技术有限公司 Quality prediction model training method, qualitative forecasting method and the device of media file
CN110569359A (en) * 2019-08-26 2019-12-13 腾讯科技(深圳)有限公司 Recognition model training and application method and device, computing equipment and storage medium
CN110807154A (en) * 2019-11-08 2020-02-18 内蒙古工业大学 Recommendation method and system based on hybrid deep learning model
CN110956018A (en) * 2019-11-22 2020-04-03 腾讯科技(深圳)有限公司 Training method of text processing model, text processing method, text processing device and storage medium
CN111523322A (en) * 2020-04-25 2020-08-11 中信银行股份有限公司 Requirement document quality evaluation model training method and requirement document quality evaluation method
CN111737975A (en) * 2020-05-14 2020-10-02 平安科技(深圳)有限公司 Text connotation quality evaluation method, device, equipment and storage medium
CN111737446A (en) * 2020-06-22 2020-10-02 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for constructing quality evaluation model
CN112016315A (en) * 2020-10-19 2020-12-01 北京易真学思教育科技有限公司 Model training method, text recognition method, model training device, text recognition device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515628A (en) * 2021-05-19 2021-10-19 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN113656671A (en) * 2021-06-16 2021-11-16 北京百度网讯科技有限公司 Model training method, link scoring method, device, equipment, medium and product
CN114492409A (en) * 2022-01-27 2022-05-13 百度在线网络技术(北京)有限公司 Method and device for evaluating file content, electronic equipment and program product
CN114492409B (en) * 2022-01-27 2022-12-20 百度在线网络技术(北京)有限公司 Method and device for evaluating file content, electronic equipment and program product

Similar Documents

Publication Publication Date Title
US8290927B2 (en) Method and apparatus for rating user generated content in search results
US8630972B2 (en) Providing context for web articles
CN104899322A (en) Search engine and implementation method thereof
CN110888990B (en) Text recommendation method, device, equipment and medium
US20130060769A1 (en) System and method for identifying social media interactions
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN108874996B (en) Website classification method and device
US20210272013A1 (en) Concept modeling system
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN112966081A (en) Method, device, equipment and storage medium for processing question and answer information
CN113660541A (en) News video abstract generation method and device
US11379527B2 (en) Sibling search queries
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN111199151A (en) Data processing method and data processing device
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
Wei et al. Online education recommendation model based on user behavior data analysis
CN113392218A (en) Training method of text quality evaluation model and method for determining text quality
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
CN114647739B (en) Entity chain finger method, device, electronic equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN113515932B (en) Method, device, equipment and storage medium for processing question and answer information
CN114329206A (en) Title generation method and device, electronic equipment and computer readable medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN112528644A (en) Entity mounting method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination