CN112597768B

CN112597768B - Text auditing method, device, electronic equipment, storage medium and program product

Info

Publication number: CN112597768B
Application number: CN202011443455.2A
Authority: CN
Inventors: 丁鑫哲; 王倩倩; 刘瑛; 刘凯; 李婷婷
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-28
Anticipated expiration: 2040-12-08
Also published as: CN112597768A

Abstract

The application discloses a text auditing method, a text auditing device, electronic equipment, a storage medium and a program product, and relates to the technical field of artificial intelligence such as machine learning and natural language processing. The specific implementation scheme is as follows: acquiring clauses to be audited of the text to be audited; based on the clause to be checked, recalling a plurality of candidate information corresponding to the clause to be checked from the database; acquiring candidate information most relevant to clauses to be checked based on the plurality of candidate information; and auditing the clauses to be audited based on the most relevant candidate information. By adopting the technical scheme, the auditing method and the auditing device can automatically audit each clause to be audited of the text to be audited, further realize the auditing of the text to be audited, avoid the auditing of the text to be audited manually, and effectively improve the accuracy of text auditing and the efficiency of text auditing.

Description

Text auditing method, device, electronic equipment, storage medium and program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as machine learning and natural language processing, and in particular, to a text auditing method, apparatus, electronic device, storage medium, and program product.

Background

Each language is complex, undergoing long-term development and evolution for hundreds or even thousands of years, accumulating to form a complex set of grammatical and syntactic rules. The use of language is a relatively high demand for users, and if the users do not know completely or carelessly, the users are likely to wear the peaches, so that the users feel no longer miraculous, and particularly in the case of relatively important situations, even a very small language error has a very adverse effect. Based on this, text auditing becomes particularly important as a traditional problem of natural language processing.

In recent years, with the rapid development of the media industry and the daily explosion of information, the manuscript proofreading demand for manuscripts has sharply increased. Especially in the traditional media industry, the more important manuscripts are required to be strictly checked for three times and three proofings, so that serious errors are avoided. Besides traditional media, the number of emerging self-media practitioners is also increased year by year, and the self-media practitioners are further lack of manual manuscript proofreading links. On a new media platform providing services for self-media practitioners, important information needs to be strictly checked in combination with the current overall environment.

Disclosure of Invention

The application provides a text auditing method, a text auditing device, electronic equipment, a storage medium and a program product.

According to an aspect of the present application, a text auditing method is provided, where the method includes:

acquiring clauses to be audited of the text to be audited;

based on the clause to be checked, recalling a plurality of candidate information corresponding to the clause to be checked from a database;

acquiring candidate information most relevant to the clause to be audited based on the candidate information;

and auditing the clauses to be audited based on the most relevant candidate information.

According to another aspect of the present application, there is provided a text auditing apparatus, wherein the apparatus includes:

the clause acquisition module is used for acquiring the clauses to be audited of the text to be audited;

the recall module is used for recalling a plurality of candidate information corresponding to the clause to be reviewed from a database based on the clause to be reviewed;

a candidate obtaining module, configured to obtain candidate information that is most relevant to the clause to be reviewed based on the plurality of candidate information;

and the auditing module is used for auditing the clause to be audited based on the most relevant candidate information.

According to still another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to yet another aspect of the present application, there is provided a computer program product, which when executed by a processor of instructions in the computer program product, performs the method as described above.

According to the technology of the application, the clauses to be audited of the text to be audited are obtained; based on the clause to be checked, recalling a plurality of candidate information corresponding to the clause to be checked from the database; acquiring candidate information most relevant to clauses to be checked based on the plurality of candidate information; and auditing the clauses to be audited based on the most relevant candidate information. By adopting the technical scheme, the auditing method and the auditing device can automatically audit each clause to be audited of the text to be audited, further realize the auditing of the text to be audited, avoid the auditing of the text to be audited manually, and effectively improve the accuracy of text auditing and the efficiency of text auditing.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is a schematic illustration according to a third embodiment of the present application;

FIG. 4 is a schematic illustration according to a fourth embodiment of the present application;

fig. 5 is a block diagram of an electronic device for implementing a text auditing method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 is a schematic illustration according to a first embodiment of the present application; as shown in fig. 1, this embodiment provides a text auditing method, which may specifically include the following steps:

s101, obtaining a clause to be audited of a text to be audited;

s102, recalling a plurality of candidate information corresponding to the clauses to be audited from a database based on the clauses to be audited;

s103, acquiring candidate information most relevant to the clauses to be checked based on the plurality of candidate information;

and S104, auditing the clauses to be audited based on the most relevant candidate information.

The main execution body of the text auditing method of the embodiment is a text auditing device, and the text auditing device may specifically be an electronic entity, or may also be an application adopting software integration. When the device is used, each clause to be audited in the text to be audited can be audited through the text auditing device.

The text to be audited in this embodiment may be an article including a plurality of sentences or a paragraph in the article. During auditing, the clauses to be audited in the text to be audited are obtained first, and specifically, each clause to be audited in the text to be audited can be obtained in sequence according to the writing habit of the language used by the text to be audited. If punctuation marks are adopted to perform clauses on the text to be audited, each clause to be audited is obtained.

The database of the embodiment may be a large database, or may also be divided into databases of multiple fields or categories, such as a poetry-like database, a history-like database, a real-time news-like database, a law-like database, and the like; or the database can be divided into a celebrity speech record database, a classical database, a factual database and the like according to the requirements of the auditing scene applied by the database. The database of the celebrity speech records can comprise the contents of each speech of a celebrity needing learning. The classical database can comprise poetry, history and other classical historical information. Various official release factual announcement information and the like may be included in the factual database. The database of the embodiment can be updated in real time, for example, updated every day, so as to ensure that documents stored in the database of some real-time news classes can be updated in time.

The database of this embodiment may store attribute information of each document, such as author of the document, detailed information of the author, publication time, and the like; for example, for some frequently updated documents, such as documents of the law and regulation class, the updated content at each update, the update time, and the like may be further included. And also stores the Title (Title) of the document, the specific content of the document, and the like.

If the database of this embodiment includes multiple fields or multiple classified databases, step S102 needs to sequentially analyze all databases to obtain all candidate information corresponding to the clause to be reviewed when recalling multiple candidate information corresponding to the clause to be reviewed from the database based on the clause to be reviewed.

Further, in this embodiment, it is further required to obtain candidate information most relevant to the clause to be reviewed based on the plurality of candidate information, and to review the clause to be reviewed based on the most relevant candidate information. For example, during the review, the most relevant candidate information and the clause to be reviewed may be compared in detail to detect whether the clause to be reviewed is correct. Optionally, if the clause to be reviewed is incorrect, incorrect content can be further identified, or further optionally, the content of the clause to be reviewed which is incorrect can be classified, and incorrect categories can be identified, so that rich review results can be effectively displayed for the user, and the user can conveniently and accurately know the incorrect content and categories.

According to the text auditing method, by adopting the technical scheme, each clause to be audited in the text to be audited is audited in sequence, and therefore the text to be audited is audited.

The text auditing method of the embodiment acquires the clauses to be audited of the text to be audited; recalling a plurality of candidate information corresponding to the clause to be audited from the database based on the clause to be audited; acquiring candidate information most relevant to the clauses to be checked based on the plurality of candidate information; and auditing the clauses to be audited based on the most relevant candidate information. By adopting the technical scheme, the auditing of each clause to be audited of the text to be audited can be automatically carried out, so that the text to be audited can be audited, manual auditing of the text to be audited is avoided, and the accuracy of text auditing and the text auditing efficiency can be effectively improved.

Further optionally, step S102 in the embodiment shown in fig. 1 may specifically include at least one of the following three ways:

the first mode is as follows: based on the clauses to be checked, recalling a plurality of candidate document information from the database in a searching mode;

for example, the search mode of this embodiment may be a text search directly, that is, a text search may be performed in the database based on the clause to be reviewed, and a plurality of candidate document information are recalled, for example, the candidate document information recalled here may be identified by attribute information of the document, such as title + author of the document, and the candidate document information may be uniquely identified.

In addition, optionally, in the first mode, the following steps may be specifically adopted to implement:

(a1) recalling a plurality of candidate document information of the clause to be reviewed from the database by adopting an Elastic Search (ES) mode;

specifically, the ES search mode may refer to an ES search mode in the related art, so as to recall a plurality of candidate document information corresponding to the clause to be reviewed from the database.

(b1) Recalling a plurality of candidate document information of the clause to be audited from the database by adopting a pre-trained semantic representation model based on the similarity;

the semantic Representation model of this embodiment may adopt a knowledge Enhanced semantic Representation model (ERNIE), or other semantic Representation models, which is not limited herein. And based on a pre-trained semantic representation model, the semantic representation of the clause to be checked can be represented. Meanwhile, semantic representation can be performed on each document information in the database, such as title of the document or content information of the whole document. For example, in this embodiment, taking semantic representation of both a title of a document in a database and content information of the document as an example, similarity between a clause to be reviewed and a title of each document in the database can be calculated based on similarity in a title dimension of the document; and then screening document attributes corresponding to titles of a plurality of documents with similarity greater than a similarity threshold from the database based on a preset similarity threshold as a plurality of candidate document information. The document attribute may be a title of the document, or a combination of the title and an author of the document, and the like, and may be only required to uniquely identify the candidate document information.

Similarly, on the content information dimension of the document, based on the similarity, the similarity between the clause to be checked and the content information of each document in the database can be calculated; and then, based on a preset similarity threshold, screening document attributes corresponding to the content information of each document with similarity greater than the similarity threshold from the database to serve as a plurality of candidate document information. The document attribute is the same as above, and only the candidate document information can be uniquely identified.

In addition, optionally, the information of the top N documents with the largest similarity may be respectively taken as the corresponding candidate document information based on the title dimension of the document or the content information dimension of the document.

According to the method, a plurality of candidate document information of the clause to be checked can be recalled from the database from the title dimension of the document and the content information dimension of the document and collected together to serve as all candidate document information.

In practical application, a plurality of candidate document information can be obtained only from the title dimension of the document or the content information dimension of the document. Or, a plurality of candidate document information may also be obtained from other dimensions of the document, such as a title + abstract dimension, or a title + author dimension, and the like, which is not limited herein.

(c1) Respectively extracting at least one piece of relevant characteristic information corresponding to each recalled alternative document information based on the clause to be audited and each alternative document information;

for each recalled candidate document information in the above embodiment, based on the clause to be reviewed and the corresponding candidate document information, at least one piece of relevant feature information corresponding to the recalled candidate document information may be extracted.

For example, each feature and corresponding meaning in the corresponding at least one piece of related feature information are as described in table 1 below. Wherein query in table 1 represents clause to be audited, title represents title of document in database, and content represents content information of document in database. Wherein term is obtained by dividing title and query based on basic word granularity; title _ entry represents the entity identified in the title, and query _ entry represents the entity identified in the query.

TABLE 1

According to the features recorded in table 1 above, the corresponding related feature information can be acquired based on the meaning of each feature. It should be noted that the features in table 1 are set based on a title dimension and a content information dimension of a document, and in practical applications, corresponding related features may also be set based on other dimensions with reference to the feature setting manner in table 1, which is not described in detail herein for example.

(d1) Acquiring the relevance of each candidate document information and a clause to be checked by adopting a pre-trained relevance scoring model based on at least one piece of relevant characteristic information corresponding to each candidate document information;

optionally, in this embodiment, the correlation may still be obtained in different information dimensions. For example, in the title dimension of the document, at least one of the above features 1, 3, 5, 6, 7, 8, and 9 may be taken. In the dimension of the content information of the document, at least one of the above features 2, 4, 10, and 11 may be taken. Or optionally, the related feature information may not be obtained according to the dimension, and in any dimension, all the related feature information is included, but the related feature without specific content information may be represented by a null feature.

In this embodiment, a correlation scoring model may be trained in advance. When the method is used, for each candidate document information, inputting at least one piece of relevant characteristic information corresponding to the obtained candidate document information into the relevance degree scoring model, wherein the relevance degree scoring model can predict and output the relevance degree, specifically a numerical value, of the candidate document information and the clause to be audited. The larger the value of the correlation degree is, the stronger the correlation between the candidate document information and the clause to be audited is, otherwise, the weaker the correlation is.

The relevance score model of this embodiment may be a Gradient Boosting Decision Tree (GBDT) model. In the training process, a plurality of groups of training data can be collected in advance, wherein each group of training data comprises the relevant characteristic information of the candidate document information and the clause to be audited and the relevance of the marked candidate document information and the clause to be audited. Then, a plurality of groups of training data are adopted to train the relevancy scoring model, so that the relevancy scoring model can learn relevancy scoring based on the training data, and the training process of the relevant neural network model can be referred to in detail, which is not repeated herein.

(e1) And screening a plurality of candidate document information from all recalled candidate document information based on the relevance between each candidate document information and the clause to be checked and a preset relevance threshold value.

The preset correlation threshold in this embodiment may be set according to actual requirements. Specifically, a plurality of candidate document information with a relevance greater than a preset relevance threshold may be screened from all the recalled candidate document information as a plurality of candidate document information based on a preset relevance threshold.

The second mode is as follows: recalling a plurality of candidate sentence information of the clause to be examined from the database based on the trie tree structure;

for example, each statement in each document in the database may be mapped into a trie tree structure based on the principles of the trie tree structure. During recalling, each word of the clause to be audited can be sequentially searched in the trie tree structure according to the sequence from the front until the last word in the clause to be audited is searched, and the searched candidate sentences and the corresponding document information are obtained from the trie tree structure. Wherein the document information is stored in attribute information of nodes in the trie tree structure.

When the trie tree structure is searched, the skip word search, the maximum skip word number and the maximum skip word number in a sentence can be set. For example, after the first word of the clause to be reviewed is retrieved in the trie tree structure, but the second word is not retrieved, the second word can be skipped, the third word can be retrieved, if the third word is not retrieved, the fourth word can be retrieved continuously, and so on until the next word in the sentence to be reviewed is retrieved, and the corresponding candidate sentence is a sentence formed by splicing the words retrieved in the trie tree structure in sequence. If the word skipping condition is satisfied during the search process, the candidate sentence is not searched, and the related candidate sentence is not searched.

Therefore, the trie-structure-based search in this embodiment can also be understood as a prefix matching, i.e., a prefix matching that can be skipped.

According to the mode, a plurality of candidate sentence information corresponding to the clause to be examined can be recalled from the database. Of course, based on the characteristics of the trie tree structure, the document information to which each candidate sentence information belongs, such as the title and author of the document, may also be obtained.

The third mode is as follows: and recalling a plurality of candidate document information of the clauses to be checked from the database based on a simhash algorithm.

Specifically, based on the simhash algorithm, a fingerprint feature may be established for each document in the database. Similarly, based on the simhash algorithm, the fingerprint information of the clause to be audited can be calculated, then the similarity between the fingerprint information of the clause to be audited and the fingerprint characteristics of the documents in the database is calculated, and a plurality of candidate document information which is most similar to the clause to be audited is screened based on the similarity. In the process of screening based on the similarity, all candidate document information with the similarity greater than a preset similarity threshold may be selected, or a plurality of candidate document information with the maximum similarity may be selected according to the sequence of the similarity from large to small, which is not limited herein.

In practical application, the three manners of recalling the candidate information corresponding to the clause to be reviewed from the database may exist alone in one manner, or may be combined with each other two by two, or may exist in three manners at the same time. In any mode, a plurality of candidate information corresponding to the clauses to be checked can be recalled from the database. And the recall mode can ensure the accuracy of the recalled candidate information, thereby effectively improving the accuracy of text auditing.

FIG. 2 is a schematic illustration according to a second embodiment of the present application; as shown in fig. 2, the text auditing method of the embodiment further describes the technical solution of the present application in more detail based on the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the text auditing method of this embodiment may specifically include the following steps:

s201, carrying out clause division on a text to be checked by adopting a pre-trained clause model to obtain a plurality of clauses to be checked;

in this embodiment, a clause model may be used to perform clause on a text to be checked, where the clause model uses a pre-trained neural network model. During training, a sentence segmentation method of collecting a plurality of training texts, marking two continuous training clauses in each training text, and training the sentence segmentation model so that the sentence segmentation model learns the marking sentence segmentation method can be adopted. For example, in this embodiment, two consecutive training clauses collected may be separated by a desired clause manner, such as comma, period, or semicolon. Or alternatively, the present embodiment places a person saying "… …" in a training clause in the training data so that the content of the utterance can be completely within one clause.

During training, inputting a training text in each training data into a sentence splitting model, and performing sentence splitting on the training text by the sentence splitting model; and identifying whether the clauses predicted by the clause model are consistent with the clauses marked in the training data, and if not, adjusting parameters of the clause model to enable the predicted clauses to be consistent with the marked clauses. The sentence division model is continuously trained according to the above mode by adopting a plurality of pieces of training data, so that the sentence division model can learn the sentence division mode marked in the training data, and then each sentence to be checked in the text to be checked can be accurately marked.

S202, recalling a plurality of candidate document information from a database as a plurality of candidate information in a searching mode based on each clause to be audited;

s203, recalling a plurality of candidate sentence information of the clause to be examined from the database as a plurality of candidate information based on the trie tree structure;

s204, recalling a plurality of candidate document information of the clauses to be reviewed from the database as a plurality of candidate information based on a simhash algorithm;

in this embodiment, taking step S102 and the three ways of steps S202, S203 and S204 as examples, the candidate document information and the candidate sentence information are recalled from the database. In practical applications, reference may be made to the above description of the embodiment, and the three ways of recalling candidate information may exist alternatively or alternatively, and are not limited herein.

S205, scoring is carried out on each candidate information in all the obtained candidate information;

specifically, each candidate information may be scored with reference to one or more features of the candidate information. For example, optionally, when implemented, the step may include the following steps:

(a2) acquiring feature information corresponding to each candidate information based on each candidate information and the clause to be checked;

for example, optionally, when the step (a2) is implemented, at least one of the following characteristic information may be acquired:

(a3) acquiring features related to the longest public subsequence based on each candidate information and the clause to be checked;

specifically, the longest common subsequence can be obtained based on each candidate information and clauses to be checked; and then for each candidate message, acquiring the ratio of the number of the skip words in the generation process of the corresponding longest public subsequence to the corresponding longest public subsequence, the ratio of the corresponding longest public subsequence to the length of the clause to be audited, and the ratio of the number of the skip words in the generation process of the corresponding longest public subsequence to the length of the clause to be audited.

For example, for the first manner described in step S202, when the obtained candidate information is candidate document information, the to-be-reviewed clause may be specifically compared with each sentence in the candidate document to obtain a common subsequence of each sentence and the to-be-reviewed clause. It should be noted that the common sub-sequence may skip words and is not completely continuous. For example, if the clause to be reviewed is "ABCDEF" and a sentence is "A1C 2 EF", the corresponding common subsequence is "A, C, EF", and the corresponding length is 4 words. In a similar manner, a common subsequence of each sentence in the candidate document and the clause to be reviewed can be obtained. And then, taking the longest public subsequence, and obtaining a sentence corresponding to the longest public subsequence as a candidate sentence corresponding to the clause to be audited. Then, for each candidate statement, the proportion of the number of the skip words in the generation process of the corresponding longest public subsequence to the corresponding longest public subsequence, the proportion of the corresponding longest public subsequence to the length of the clause to be audited, and the proportion of the number of the skip words in the generation process of the corresponding longest public subsequence to the length of the clause to be audited can be obtained.

For example, for the second manner described in step S203, the obtained candidate information is candidate sentence information, and at this time, each candidate sentence is directly compared with the clause to be checked, so as to obtain the longest common subsequence of each candidate sentence and the clause to be checked. And further acquiring the ratio of the number of the skip words in the generation process of the corresponding longest public subsequence to the corresponding longest public subsequence, the ratio of the corresponding longest public subsequence to the length of the clause to be audited, and the ratio of the number of the skip words in the generation process of the corresponding longest public subsequence to the length of the clause to be audited.

For example, in the third method described in step S204 above, the obtained candidate information is also candidate document information as in the first method. At this time, the manner of obtaining the features related to the longest common subsequence is the same as the first manner described in step S202, and reference may be made to the above related description for details, which is not repeated herein.

(b3) Acquiring the similarity between each recalled candidate information and a clause to be examined;

for example, regarding the first mode described in step S202, the similarity between each candidate information and the clause to be reviewed can be calculated in step (d1) with reference to the descriptions of (a1) - (e1) in the above embodiment, which is equivalent to the correlation between each candidate document information screened in the above embodiment (e1) and the clause to be reviewed.

For example, for each candidate information obtained in the second manner in step S203 as candidate sentence information, the similarity between each candidate information and the clause to be reviewed at this time may be directly obtained as semantic similarity between each candidate information and the clause to be reviewed.

For example, for the similarity between each candidate information and the clause to be reviewed when each candidate information obtained in the third manner in step S204 is a candidate document information, reference may be made to the similarity between the candidate document information obtained in the related description in the third manner in the foregoing embodiment and the clause to be reviewed, which is not described herein again.

(c3) Acquiring timeliness scores of the candidate information based on the candidate information and the time information of the clause to be audited; and

the timeliness score of this embodiment may be specifically determined based on the time information of the clause to be reviewed and the time information corresponding to each candidate information. The time information of the clause to be checked may be time information of a text to be checked to which the clause to be checked belongs, and the time information corresponding to each candidate information may be time information such as publication time or publication time of the candidate document or the candidate document to which the candidate sentence belongs. If the candidate information is a document published by the network, the carried time information can be specific to the publishing time. If the candidate information is a document issued in another manner, the time information may be detailed only to a specific date. If the difference value of the candidate information and the time information of the clause to be checked is closer, the set timeliness score of the candidate information can be larger, and the time similarity between the candidate information and the clause to be checked is higher. And if the difference value between the candidate information and the time information of the clause to be checked is larger, the set timeliness score of the candidate information can be smaller, and the time similarity of the candidate information and the clause to be checked is smaller. For example, the age score may be set at a value between 0 and 1.

(d3) And obtaining semantic similarity between the candidate sentences in the candidate information and the clauses to be checked.

For example, for the first manner described in step S202 and the third manner described in S204, the candidate information is a candidate document, and a sentence with the longest common subsequence length with the to-be-reviewed clause in the candidate document may be taken as a candidate sentence.

For the second manner described in step S203, the obtained candidate information is the candidate sentence information.

Specifically, a pre-trained semantic representation model may be adopted to respectively obtain semantic representations of the candidate sentences and the clauses to be reviewed, and then semantic similarity between the candidate sentences and the clauses to be reviewed is calculated based on the semantic representations of the candidate sentences and the clauses to be reviewed.

For each candidate information, various feature information corresponding to the candidate information may be acquired according to the above (a3) to (d3), and the various feature information may exist alone or may be arbitrarily combined to constitute the feature information corresponding to the candidate information acquired in step (a 2).

(b2) And scoring each candidate information based on the feature information corresponding to each candidate information and a pre-trained scoring model.

Inputting the characteristic information corresponding to each candidate information obtained in the step (a2) into a pre-trained scoring model, wherein the scoring model can predict and output a scoring result of the candidate information relative to the clause to be checked, and identify the correlation between the candidate information and the clause to be checked, and the larger the value of the scoring result is, the stronger the correlation between the candidate information and the clause to be checked is, or else, the weaker the correlation between the candidate information and the clause to be checked is.

The scoring model of this embodiment is the same as the correlation scoring model in the above embodiments, and details can refer to the related description of the above embodiments, which are not described herein again.

By acquiring various kinds of feature information corresponding to the candidate information in the manners of (a3) to (d3), comprehensiveness of the feature information of the acquired candidate information can be effectively ensured, and further, accuracy of scoring the candidate information based on the feature information corresponding to the candidate information can be effectively ensured.

S206, acquiring candidate information most relevant to the clause to be checked from the plurality of candidate information based on the scoring result of each candidate information;

for example, the candidate information with the largest scoring result value may be specifically taken as the candidate information most relevant to the clause to be reviewed.

Steps S205-S206 are a specific implementation manner of step S103 in the embodiment shown in fig. 1. The method can effectively ensure the accuracy of the acquired most relevant candidate information, and further can effectively ensure the accuracy of text auditing.

And S207, based on the most relevant candidate information, adopting a pre-trained auditing model to audit the clauses to be audited.

For example, in this embodiment, the most relevant candidate sentences may be obtained based on the most relevant candidate information. And inputting the most relevant candidate sentences and clauses to be audited into an audit model trained in advance, wherein the audit model can audit the clauses to be audited according to the most relevant candidate sentences, and generates and outputs an audit result.

In the first and third manners of obtaining candidate information as described above, when the candidate information is a candidate document, a sentence corresponding to the longest common subsequence in the most relevant candidate documents may be taken as the most relevant candidate sentence. In the second way of obtaining candidate information as described above, when the candidate information is a candidate sentence, the most relevant candidate information obtained is the most relevant candidate sentence.

In this embodiment, the audit model audits the clause to be audited according to the most relevant candidate statement, and when the audit result is generated, the audit result may carry the error category, the error content, and the corrected content. For example, if a text error is checked, the error category may be a text error, an error content, and a corrected content directly carried in the checking result. If the fact error is checked, the error category can be directly carried in the checking result as the fact error, the error content and the corrected content.

The error category of the present embodiment may be predefined. The processing of the above text errors and real-time errors may further include various types of errors, such as entity name errors, author errors, generation errors, and language recording errors, which are not described in detail herein for example.

When the auditing model is trained, training data of all error types can be collected, each piece of training data comprises a training clause and a reference sentence, and the error type, the error content in the training clause and the content corrected based on the reference sentence are marked in the corresponding training data. During training, the training clauses and the reference sentences in each piece of training data are input into the auditing model, the auditing model can predict and output auditing results, the error types can be predicted in the auditing results, and the error contents in the training clauses and the corrected contents predicted based on the reference sentences can be predicted. And then comparing the predicted result with the labeled result, and if the predicted result is different from the labeled result, adjusting parameters of the auditing model to ensure that the predicted result is consistent with the labeled result. And continuously training the verification model by adopting all training data of all error categories according to the mode, so that the verification model learns the verification mode of the training data until the predicted result is consistent with the marked result, determining the parameters of the verification model after the training is finished, and further determining the verification model.

According to the mode of the embodiment, the whole text to be audited is audited by auditing each audit clause in the text to be audited.

By adopting various recall modes in the technical scheme, the text auditing method of the embodiment can recall more candidate information more comprehensively and more accurately, and score each candidate information in all the obtained candidate information; and the candidate information most relevant to the clause to be audited is obtained based on the scoring result, so that the accuracy of the obtained most relevant candidate information can be effectively ensured, and the accuracy of text audit and the text audit efficiency can be effectively improved. In addition, in the embodiment, the pre-trained auditing model is also adopted to audit the clauses to be audited, so that the efficiency of text auditing can be further effectively improved.

FIG. 3 is a schematic illustration according to a third embodiment of the present application; as shown in fig. 3, the present embodiment provides a text auditing apparatus 300, including:

a clause acquiring module 301, configured to acquire a clause to be audited of a text to be audited;

a recall module 302, configured to recall, from the database, a plurality of candidate information corresponding to the clause to be reviewed based on the clause to be reviewed;

a candidate obtaining module 303, configured to obtain candidate information most relevant to the clause to be reviewed based on the plurality of candidate information;

and the auditing module 304 is configured to audit the clause to be audited based on the most relevant candidate information.

The text auditing apparatus 300 of this embodiment implements the implementation principle and technical effect of text auditing by using the above modules, which are the same as the implementation of the above related method embodiments, and reference may be made to the related records of the above method embodiments for details, which are not described herein again.

FIG. 4 is a schematic illustration of a fourth embodiment according to the present application; as shown in fig. 4, the text auditing apparatus 300 of the present embodiment further describes the technical solution of the present application in more detail based on the technical solution of the embodiment shown in fig. 3.

As shown in fig. 4, in the text auditing apparatus 300 of this embodiment, the recall module 302 includes at least one of the following:

a search recalling unit 3021, configured to recall, in a search manner, information on a plurality of candidate documents from the database based on the clause to be reviewed;

a tree structure recalling unit 3022, configured to recall, from the database, information of multiple candidate sentences of the clause to be reviewed based on the trie tree structure; and

and the hash recall unit 3023 is configured to recall, based on a simhash algorithm, information of multiple candidate documents of the clause to be reviewed from the database.

Further optionally, the search recall unit 3021 is configured to:

recalling a plurality of candidate document information of the clauses to be audited from the database in an elastic search mode;

Recalling a plurality of candidate document information of the clause to be audited from the database based on the similarity by adopting a pre-trained semantic representation model;

respectively extracting at least one piece of relevant characteristic information corresponding to the recalled alternative document information based on the clause to be checked and the alternative document information;

acquiring the relevance of each candidate document information and a clause to be checked by adopting a pre-trained relevance scoring model based on at least one piece of relevant characteristic information corresponding to each candidate document information;

and screening a plurality of candidate document information from all recalled candidate document information based on the relevance between each candidate document information and the clause to be checked and a preset relevance threshold value.

Further optionally, as shown in fig. 4, in the text auditing apparatus 300 of this embodiment, the candidate obtaining module 303 includes:

a scoring unit 3031 configured to score each candidate information in the multiple candidate information;

the screening unit 3032 is configured to obtain candidate information that is most relevant to the clause to be reviewed from the multiple candidate information based on the scoring result of each candidate information.

Further optionally, a scoring unit 3031 is configured to:

acquiring characteristic information corresponding to each candidate information based on each candidate information and the clause to be checked;

And scoring each candidate information based on the characteristic information corresponding to each candidate information and a pre-trained scoring model.

Further optionally, the scoring unit 3031 is configured to perform at least one of:

acquiring features related to the longest public subsequence based on each candidate information and clauses to be checked;

acquiring the similarity between each recalled candidate information and a clause to be checked;

acquiring timeliness scores of the candidate information based on the candidate information and the time information of the clauses to be audited; and

and obtaining semantic similarity between the candidate sentences in each candidate information and the clauses to be checked.

Further optionally, a scoring unit 3031 is configured to:

acquiring a longest public subsequence based on each candidate information and clauses to be audited;

and for each candidate message, acquiring the ratio of the number of the skip words in the generation process of the corresponding longest public subsequence to the corresponding longest public subsequence, the ratio of the corresponding longest public subsequence to the length of the clause to be audited, and the ratio of the number of the skip words in the generation process of the corresponding longest public subsequence to the length of the clause to be audited.

Further optionally, the auditing module 304 is configured to:

and based on the most relevant candidate information, adopting a pre-trained auditing model to audit the clauses to be audited.

According to embodiments of the present application, an electronic device, a readable storage medium, and a computer program product are also provided.

Fig. 5 is a block diagram of an electronic device implementing a text auditing method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing some of the necessary operations (e.g., as an array of servers, a group of blade servers, or a multi-processor system). Fig. 5 illustrates an example of a processor 501.

Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform a text auditing method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the text auditing methods provided herein.

The memory 502, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., related modules shown in fig. 3 and 4) corresponding to the text auditing method in the embodiments of the present application. The processor 501 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 502, namely, implements the text auditing method in the above method embodiments.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device implementing the text auditing method, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 502 may optionally include memory located remotely from the processor 501, which may be connected via a network to an electronic device implementing the text auditing method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for implementing the text auditing method may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus implementing the text auditing method, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick or other input device. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the Internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server that incorporates a blockchain.

According to the technical scheme of the embodiment of the application, the clauses to be audited of the text to be audited are obtained; recalling a plurality of candidate information corresponding to the clause to be audited from the database based on the clause to be audited; acquiring candidate information most relevant to the clauses to be checked based on the plurality of candidate information; and auditing the clauses to be audited based on the most relevant candidate information. By adopting the technical scheme, the method and the device can automatically audit each clause to be audited of the text to be audited, further realize the audit of the text to be audited, avoid the audit of the text to be audited manually, and effectively improve the accuracy of text audit and the efficiency of text audit.

According to the technical scheme of the embodiment of the application, more and more accurate candidate information can be recalled more comprehensively by adopting various recall modes in the technical scheme, and each candidate information in all the obtained candidate information is scored; and the candidate information most relevant to the clause to be audited is obtained based on the scoring result, so that the accuracy of the obtained most relevant candidate information can be effectively ensured, and the accuracy of text audit and the text audit efficiency can be effectively improved. In addition, in the embodiment, the pre-trained auditing model is also adopted to audit the clauses to be audited, so that the efficiency of text auditing can be further effectively improved.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A text auditing method, wherein the method comprises:

acquiring clauses to be audited of the text to be audited;

obtaining candidate information most relevant to the clause to be audited based on the scoring results of the candidate information; the scoring result is obtained based on the characteristic information corresponding to each candidate information and a pre-trained scoring model; the feature information corresponding to each candidate information includes at least one of the following: based on the characteristics of the candidate information and the longest public subsequence obtained by the clause to be audited; similarity between each recalled candidate information and the clause to be checked; the timeliness score of each candidate information is obtained based on each candidate information and the time information of the clause to be audited; and semantic similarity between the candidate sentences in the candidate information and the clauses to be audited;

Inputting the most relevant candidate information and the clause to be audited into an auditing model trained in advance, and generating and outputting an auditing result by the auditing model, wherein the method comprises the following steps: and comparing the most relevant candidate information with the clauses to be checked to detect whether the clauses to be checked are correct or not, if the clauses to be checked are incorrect, identifying wrong contents, classifying the contents with the wrong clauses to be checked, identifying wrong categories, and outputting the corrected contents.

2. The method of claim 1, wherein recalling a plurality of candidate information corresponding to the clause to be reviewed from a database based on the clause to be reviewed comprises at least one of:

recalling a plurality of candidate document information from the database in a searching mode based on the clause to be audited;

recalling a plurality of candidate sentence information of the clause to be audited from the database based on the trie tree structure; and

and recalling a plurality of candidate document information of the clause to be audited from the database based on a simhash algorithm.

3. The method of claim 2, wherein recalling a plurality of candidate document information from the database by means of search based on the clause to be reviewed comprises:

Recalling a plurality of candidate document information of the clause to be audited from the database in an elastic search mode;

recalling a plurality of candidate document information of the clause to be audited from the database by adopting a pre-trained semantic representation model based on the similarity;

respectively extracting at least one piece of relevant characteristic information corresponding to each piece of recalled alternative document information based on the clause to be audited and each piece of alternative document information;

based on the at least one piece of relevant characteristic information corresponding to each piece of candidate document information, adopting a pre-trained relevance degree scoring model to obtain the relevance degree of each piece of candidate document information and the clause to be audited;

and screening the candidate document information from all the recalled candidate document information based on the correlation degree of each candidate document information and the clause to be reviewed and a preset correlation degree threshold value.

4. The method according to any one of claims 1 to 3, wherein obtaining candidate information that is most relevant to the clause to be reviewed based on the scoring result of the plurality of candidate information comprises:

scoring each of the plurality of candidate information;

And acquiring candidate information most relevant to the clause to be checked from the plurality of candidate information based on the scoring result of each candidate information.

5. The method of claim 4, wherein scoring each of the plurality of candidate messages comprises:

acquiring feature information corresponding to each candidate information based on each candidate information and the clause to be checked;

and scoring each candidate message based on the feature information corresponding to each candidate message and a pre-trained scoring model.

6. The method according to claim 5, wherein obtaining feature information corresponding to each candidate information based on each candidate information and the clause to be reviewed includes at least one of:

acquiring features related to the longest public subsequence based on each candidate information and the clause to be audited;

obtaining the similarity between each recalled candidate information and the clause to be checked;

acquiring timeliness scores of the candidate information based on the candidate information and the time information of the clause to be audited; and

and acquiring semantic similarity between the candidate sentences in the candidate information and the clauses to be audited.

7. The method of claim 6, wherein obtaining features associated with a longest common subsequence based on each of the candidate information and the clause to be reviewed comprises:

acquiring a longest public subsequence based on each candidate information and the clause to be checked;

8. A text auditing apparatus, wherein the apparatus comprises:

the clause acquisition module is used for acquiring clauses to be audited of the text to be audited;

the recalling module is used for recalling a plurality of candidate information corresponding to the clause to be audited from a database based on the clause to be audited;

the candidate obtaining module is used for obtaining candidate information most relevant to the clause to be audited based on the scoring results of the candidate information; the scoring result is obtained based on the characteristic information corresponding to each candidate information and a pre-trained scoring model; the feature information corresponding to each candidate information includes at least one of the following: based on the characteristics of the candidate information and the longest public subsequence obtained by the clause to be checked; similarity between each recalled candidate information and the clause to be checked; the timeliness score of each candidate information is obtained based on each candidate information and the time information of the clause to be audited; and semantic similarity between the candidate sentences in the candidate information and the clauses to be audited;

The auditing module is used for inputting the most relevant candidate information and the clause to be audited into an auditing model which is trained in advance, and generating and outputting an auditing result by the auditing model, and comprises the following steps: and comparing the most relevant candidate information with the clauses to be checked to detect whether the clauses to be checked are correct or not, if the clauses to be checked are incorrect, identifying wrong contents, classifying the contents with the wrong clauses to be checked, identifying wrong categories, and outputting the corrected contents.

9. The apparatus of claim 8, wherein the recall module comprises at least one of:

the search recalling unit is used for recalling a plurality of candidate document information from the database in a searching mode based on the clause to be audited;

the tree structure recalling unit is used for recalling a plurality of candidate sentence information of the clause to be reviewed from the database based on a trie tree structure; and

and the hash recall unit is used for recalling a plurality of candidate document information of the clause to be audited from the database based on a simhash algorithm.

10. The apparatus of claim 9, wherein the search recall unit is to:

respectively extracting at least one piece of relevant characteristic information corresponding to each recalled candidate document information based on the clause to be audited and each candidate document information;

11. The apparatus of any of claims 8-10, wherein the candidate acquisition module comprises:

a scoring unit configured to score each of the plurality of candidate information;

and the screening unit is used for acquiring candidate information most relevant to the clause to be examined from the plurality of candidate information based on the scoring result of each candidate information.

12. The apparatus of claim 11, wherein the scoring unit is configured to:

13. The apparatus of claim 12, wherein the scoring unit is configured to perform at least one of:

14. The apparatus of claim 13, wherein the scoring unit is to:

acquiring a longest public subsequence based on each candidate information and the clause to be audited;

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product, which when executed by a processor of instructions in the computer program product performs the method of any one of claims 1-7.