CN108052499B - Text error correction method and device based on artificial intelligence and computer readable medium - Google Patents

Text error correction method and device based on artificial intelligence and computer readable medium Download PDF

Info

Publication number
CN108052499B
CN108052499B CN201711159880.7A CN201711159880A CN108052499B CN 108052499 B CN108052499 B CN 108052499B CN 201711159880 A CN201711159880 A CN 201711159880A CN 108052499 B CN108052499 B CN 108052499B
Authority
CN
China
Prior art keywords
segment
original
target
frequency
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711159880.7A
Other languages
Chinese (zh)
Other versions
CN108052499A (en
Inventor
肖求根
詹金波
郑利群
邓卓彬
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201711159880.7A priority Critical patent/CN108052499B/en
Publication of CN108052499A publication Critical patent/CN108052499A/en
Application granted granted Critical
Publication of CN108052499B publication Critical patent/CN108052499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a text error correction method and device based on artificial intelligence and a computer readable medium. The method comprises the following steps: acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in the original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model; acquiring feedback information of a target result fed back by a user based on the error correction text; performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information; and performing error correction processing on the subsequent original text based on the trained segment scoring model. According to the technical scheme, when the trained segment scoring model is used for text error correction, the error correction accuracy of the text can be effectively improved.

Description

Text error correction method and device based on artificial intelligence and computer readable medium
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computer application, in particular to a text error correction method and device based on artificial intelligence and a computer readable medium.
[ background of the invention ]
Artificial Intelligence (AI) is a new technical science of studying and developing theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others.
With the development of science and technology, more and more man-machine interaction modes are available in various scenes, and the experience degree of a user can be greatly improved. For example, in a search scenario, a user inputs a search query, and the search server may obtain a corresponding search result according to a text of the search query input by the user and feed the search result back to the user. Or in other scenarios where the intelligent device provides online consultation service or shopping guide service, the intelligent device may also receive text input by the user and make certain feedback based on the text input by the user. In all the above scenarios, the text input by the user may have a certain error, and after the text input by the user is acquired, the text needs to be corrected, so as to more accurately understand the requirements of the user. In order to effectively correct the text, in the prior art, a very intelligent network model is trained in advance, and the text is corrected based on the trained network model.
However, in the prior art, after the network model is trained, the network model is fixed, and after the network model is used for a period of time, the text may not be corrected accurately, so that the accuracy of text correction is poor.
[ summary of the invention ]
The invention provides a text error correction method and device based on artificial intelligence and a computer readable medium, which are used for improving the accuracy of text error correction.
The invention provides a text error correction method based on artificial intelligence, which comprises the following steps:
acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in an original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
acquiring feedback information of a target result fed back by a user based on the error correction text;
performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information;
and carrying out error correction processing on the subsequent original text based on the trained segment scoring model.
Further optionally, in the method, performing incremental training on the segment scoring model according to the target segment, the original segment, and the feedback information specifically includes:
acquiring relative characteristic information between the target fragment and the original fragment;
determining an ideal score of the target segment according to the feedback information;
and training the segment scoring model according to the relative feature information and the ideal scoring of the target segment.
Further optionally, in the method as described above, the obtaining of the relative feature information between the target segment and the original segment includes at least one of:
acquiring relative quality characteristics between the target segment and the original segment;
acquiring relative historical behavior characteristics between the target segment and the original segment; and
and acquiring semantic similarity characteristics between the target fragment and the original fragment.
Further optionally, in the method, obtaining the relative quality feature between the target segment and the original segment specifically includes:
acquiring the frequency of the original fragments appearing in a corpus and the frequency of the original fragments appearing together with the combination of the context fragments in the original text in the corpus;
acquiring the frequency of the target segment in the corpus and the frequency of the combination of the target segment and the context segment in the corpus;
obtaining a frequency ratio of the target segments to the original segments to appear in the corpus and a frequency ratio of the combinations of the target segments and the context segments to appear in the corpus according to a frequency of the original segments to appear in the corpus, a frequency of the combinations of the original segments and the context segments to appear together in the corpus, a frequency of the target segments to appear in the corpus, and a frequency of the combinations of the target segments and the context segments to appear in the corpus, and/or the frequency difference of the target segment and the original segment in the corpus and the frequency difference of the combination of the target segment and the context segment and the combination of the original segment and the context segment in the corpus.
Further optionally, in the method, obtaining the relative historical behavior characteristics between the target segment and the original segment specifically includes:
acquiring a first modification frequency of modifying the original segment into the target segment in the PT table;
acquiring a second modification frequency of modifying the combination of the original segment and the context segment in the PT table into the combination of the target segment and the context segment;
and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
Further optionally, in the method, obtaining the semantic similarity feature between the target segment and the original segment specifically includes:
obtaining semantic similarity between the target fragment and the original fragment; and/or
And acquiring the semantic similarity of the combination of the target segment and the context segment and the combination of the original segment and the context segment.
Further optionally, in the method described above, the obtaining of the relative feature information between the target segment and the original segment further includes at least one of the following;
respectively acquiring the special noun characteristics of the original fragment and the target fragment according to a preset special noun library; and
and acquiring the pinyin editing distance characteristics of the target segment and the original segment.
Further optionally, in the method described above, determining an ideal score of the target segment according to the feedback information specifically includes:
presume whether the user accepts to adopt the target segment to replace the original segment in the error correction text or not according to the feedback information;
if the user acceptance is presumed, setting the ideal score of the target segment to 1; otherwise, if the user is presumed not to accept, the ideal score for the target segment is set to 0.
Further optionally, in the method described above, training the segment scoring model according to the relative feature information and the ideal scoring of the target segment specifically includes:
inputting the relative characteristic information into the segment scoring model to obtain a prediction score of the segment scoring model;
obtaining the magnitude relation between the prediction scoring and the ideal scoring;
if the predicted score is smaller than the ideal score, adjusting parameters of the segment scoring model to enable the predicted score output by the segment scoring model to change towards an increasing direction;
if the predicted score is greater than the ideal score, adjusting parameters of the segment scoring model so that the predicted score output by the segment scoring model changes towards a decreasing direction.
The invention provides a text error correction device based on artificial intelligence, which comprises:
the segment information acquisition module is used for acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in an original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
the feedback information acquisition module is used for acquiring feedback information of a target result fed back by a user based on the error correction text;
the increment training module is used for carrying out increment training on the segment scoring model according to the target segment, the original segment and the feedback information;
and the error correction module is used for carrying out error correction processing on the subsequent original text based on the trained segment scoring model.
Further optionally, in the apparatus described above, the incremental training module specifically includes:
a relative feature information acquiring unit, configured to acquire relative feature information between the target segment and the original segment;
the determining unit is used for determining the ideal scoring of the target segment according to the feedback information;
and the training unit is used for training the segment scoring model according to the relative characteristic information and the ideal scoring of the target segment.
Further optionally, in the apparatus as described above, the relative feature information obtaining unit is configured to perform at least one of the following operations:
acquiring relative quality characteristics between the target segment and the original segment;
acquiring relative historical behavior characteristics between the target segment and the original segment; and
and acquiring semantic similarity characteristics between the target fragment and the original fragment.
Further optionally, in the apparatus as described above, the relative feature information obtaining unit is specifically configured to:
acquiring the frequency of the original fragments appearing in a corpus and the frequency of the original fragments appearing together with the combination of the context fragments in the original text in the corpus;
acquiring the frequency of the target segment in the corpus and the frequency of the combination of the target segment and the context segment in the corpus;
obtaining a frequency ratio of the target segments to the original segments to appear in the corpus and a frequency ratio of the combinations of the target segments and the context segments to appear in the corpus according to a frequency of the original segments to appear in the corpus, a frequency of the combinations of the original segments and the context segments to appear together in the corpus, a frequency of the target segments to appear in the corpus, and a frequency of the combinations of the target segments and the context segments to appear in the corpus, and/or the frequency difference of the target segment and the original segment in the corpus and the frequency difference of the combination of the target segment and the context segment and the combination of the original segment and the context segment in the corpus.
Further optionally, in the apparatus as described above, the relative feature information obtaining unit is specifically configured to:
acquiring a first modification frequency of modifying the original segment into the target segment in the PT table;
acquiring a second modification frequency of modifying the combination of the original segment and the context segment in the PT table into the combination of the target segment and the context segment;
and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
Further optionally, in the apparatus as described above, the relative feature information obtaining unit is specifically configured to:
obtaining semantic similarity between the target fragment and the original fragment; and/or
And acquiring the semantic similarity of the combination of the target segment and the context segment and the combination of the original segment and the context segment.
Further optionally, in the apparatus as described above, the relative feature information obtaining unit is further configured to perform at least one of the following;
respectively acquiring the special noun characteristics of the original fragment and the target fragment according to a preset special noun library; and
and acquiring the pinyin editing distance characteristics of the target segment and the original segment.
Further optionally, in the apparatus as described above, the determining unit is specifically configured to:
presume whether the user accepts to adopt the target segment to replace the original segment in the error correction text or not according to the feedback information;
if the user acceptance is presumed, setting the ideal score of the target segment to 1; otherwise, if the user is presumed not to accept, the ideal score for the target segment is set to 0.
Further optionally, in the apparatus as described above, the training unit is specifically configured to:
inputting the relative characteristic information into the segment scoring model to obtain a prediction score of the segment scoring model;
obtaining the magnitude relation between the prediction scoring and the ideal scoring;
if the predicted score is smaller than the ideal score, adjusting parameters of the segment scoring model to enable the predicted score output by the segment scoring model to change towards an increasing direction;
if the predicted score is greater than the ideal score, adjusting parameters of the segment scoring model so that the predicted score output by the segment scoring model changes towards a decreasing direction.
The present invention also provides a computer apparatus, the apparatus comprising:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement an artificial intelligence based text correction method as described above.
The present invention also provides a computer readable medium having stored thereon a computer program which, when executed by a processor, implements an artificial intelligence based text error correction method as described above.
The text error correction method, the text error correction device and the computer readable medium based on the artificial intelligence are characterized in that an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in the original text are obtained; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model; acquiring feedback information of a target result fed back by a user based on the error correction text; performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information; and performing error correction processing on the subsequent original text based on the trained segment scoring model. According to the technical scheme, the incremental training is carried out on the segment scoring model according to the target segment, the original segment and the feedback information, so that the prediction accuracy of the segment scoring model can be improved, and when the trained segment scoring model is used for text error correction, the error correction accuracy of the text can be effectively improved. For example, when the technical scheme of the invention is applied to long text editing, the content production quality of the long text can be assisted to be improved, and the user experience is improved.
[ description of the drawings ]
FIG. 1 is a flowchart of a first embodiment of a text error correction method based on artificial intelligence according to the present invention.
FIG. 2 is a flowchart of a second embodiment of the text error correction method based on artificial intelligence according to the present invention.
FIG. 3 is a flowchart of a first embodiment of a method for correcting long text errors based on artificial intelligence according to the present invention.
Fig. 4 is a schematic diagram of a search interface according to the present embodiment.
Fig. 5 is a flowchart of a second embodiment of the artificial intelligence-based long text error correction method of the present invention.
Fig. 6 is an exemplary diagram of a mapping table of the confusing sound provided in the present embodiment.
Fig. 7 is a schematic diagram of an error correction result of the long text error correction method based on artificial intelligence in this embodiment.
FIG. 8 is a block diagram of a first embodiment of an artificial intelligence based text error correction apparatus according to the present invention.
Fig. 9 is a block diagram of a second embodiment of an artificial intelligence based text correction apparatus according to the present invention.
FIG. 10 is a block diagram of an embodiment of a computer device of the present invention.
Fig. 11 is an exemplary diagram of a computer device provided by the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flowchart of a first embodiment of a text error correction method based on artificial intelligence according to the present invention. As shown in fig. 1, the text error correction method based on artificial intelligence of this embodiment may specifically include the following steps:
100. acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in the original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
the execution subject of the text error correction method based on artificial intelligence of this embodiment is an artificial intelligence-based text error correction device, which can be an independent electronic entity for performing error correction on a text. The text of this embodiment may be a short text such as a query, or a long text in a text editing system, where the length of the long text is usually greater than that of the query, and the long text may be a longer sentence. That is to say, the text error correction method based on artificial intelligence of the present embodiment can be applied in a search scenario, and can also be applied in various scenarios involving long text editing.
In the text error correction processing based on artificial intelligence in this embodiment, the original text needs to be corrected, and when the text error correction is performed specifically, word segmentation processing may be performed on the original text first to obtain a plurality of words. The word segmentation strategy can refer to the word segmentation strategy of the related art, and is not limited herein. Then, a window with a preset size can be set, the window is applied to the original text, the window is slid from front to back, and each original segment is selected. The size of the preset window in this embodiment may be set to the size of 1 participle, the size of 2 participles, or the size of 3 participles. Therefore, the original segment of the present embodiment may be composed of each participle alone or a combination of consecutive participles.
And obtaining each original segment in the original text according to the mode. Then, for each original segment, a plurality of candidate segments that can replace the original segment are obtained, and the obtaining process can obtain a replacement segment corresponding to the original segment based on a Phrase Table (PT) Table counted in advance, or can recall more candidate segments with the same or similar pronunciation based on the pronunciation of the original segment. The segment scoring model may then be used to score each candidate segment, and further obtain a target segment for replacing the original segment from the plurality of candidate segments based on the score of each segment. For example, in a shorter query, only one original segment may be included, and the candidate segment with the highest score may be used as the target segment. For poor texts, when more than two original segments are included, the candidate segment with the highest score may be obtained as the corresponding target segment for each original segment. Or for a certain original segment, considering the factors such as connectivity with the context, etc., the top-ranked or next-highest candidate segment from the top N higher-ranked candidate segments may be the most target segment, which is not limited herein. No matter which way is adopted to obtain the target segment, scoring of the candidate segment needs to be carried out by referring to the segment scoring model. Therefore, in this embodiment, scoring of the candidate segments by the segment scoring model is a very important link in text error correction, and if the scoring accuracy of the candidate segments by the segment scoring model is poor, the accuracy of text error correction will be poor.
In this embodiment, after the error correction text is obtained by correcting the error of the original text by using the method, the corrected target segment in the error correction text and the original segment corresponding to the target segment in the original text can be obtained.
101. Acquiring feedback information of a target result fed back by a user based on the error correction text;
in this embodiment, the scenes are different, and the form and the content of the target result fed back to the user based on the corrected text may be different. For example, in a search scenario, the target result fed back to the user based on the corrected text may be a search result based on the corrected text. In long text editing, the target result fed back to the user based on the corrected text may appear to be agreeing to the modification or disagreeing to the modification. In other scenarios, there may be other forms, and the description is omitted here. Feedback information of the user can be obtained regardless of the form of the target result to be fed back to the user based on the corrected text. For example, in a search scenario, after a search result is fed back to a user based on an error correction text, if the user agrees to the search result after error correction, the user may directly click the search result to read. And if the user does not agree with the search result after error correction, the search result of the time can be ignored, and the search is carried out again. For another example, in a long text editing scene, after the original text input by the user is corrected, the user may be given a certain prompt, agreement or disagreement at the position where the correction is performed, and the user may click to agree or disagree according to the actual situation of the position where the correction is performed. Therefore, in any scene, the feedback information of the target result fed back by the user based on the corrected text can be acquired.
102. Performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information;
the incremental training of this embodiment may be an online learning process, that is, after each error correction, the segment scoring model is directly learned online according to the error correction result, so as to improve the prediction accuracy of the segment scoring model.
Or, the incremental training of this embodiment may also be performed offline, and every certain time period, all error correction data of the time period are collected, and the error correction data is used to perform incremental training on the segment scoring model, so as to improve the prediction accuracy of the segment scoring model.
In the incremental training process of this embodiment, incremental training needs to be performed on the segment scoring model according to the target segment, the original segment, and the feedback information.
103. And performing error correction processing on the subsequent original text based on the trained segment scoring model.
When the subsequent original text is subjected to error correction processing based on the incremental trained segment scoring model, the accuracy is higher.
In practical application, the model structure of the pure GBRank cannot be subjected to incremental training, and in this embodiment, in order to improve the accuracy of the segment scoring model, the segment scoring model is subjected to incremental training. The segment scoring model of this embodiment may employ a logistic regression function applied in the model of GBRank to support incremental training. For example, in the training, the gbrank model needs to be trained first to obtain the tree model, and then, on the basis, the same training data is adopted to be trained by combining with logistic regression to obtain the segment scoring model of the embodiment.
In the text error correction method based on artificial intelligence, an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in the original text are obtained; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model; acquiring feedback information of a target result fed back by a user based on the error correction text; performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information; and performing error correction processing on the subsequent original text based on the trained segment scoring model. According to the technical scheme of the embodiment, the segment scoring model is subjected to incremental training according to the target segment, the original segment and the feedback information, so that the prediction accuracy of the segment scoring model can be improved, and when the trained segment scoring model is used for text error correction, the error correction accuracy of the text can be effectively improved. For example, when the technical scheme of the embodiment is applied to long text editing, the content production quality of the long text can be assisted to be improved, and the user experience is improved.
FIG. 2 is a flowchart of a second embodiment of the text error correction method based on artificial intelligence according to the present invention. As shown in fig. 2, the text error correction method based on artificial intelligence of this embodiment further introduces the technical solution of the present invention in more detail based on the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the text error correction method based on artificial intelligence of this embodiment may specifically include the following steps:
200. acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in the original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
201. acquiring feedback information of a target result fed back by a user based on the error correction text;
for the implementation of step 200 and step 201, reference may be made to step 100 and step 101 in the embodiment shown in fig. 1, which is not described herein again.
202. Acquiring relative characteristic information between a target fragment and an original fragment;
for example, step 202 may specifically include at least one of the following:
firstly, acquiring relative quality characteristics between a target fragment and an original fragment;
the step may specifically include the steps of:
(a1) acquiring the frequency of the original fragments appearing in the corpus and the frequency of the original fragments appearing together with the combination of the context fragments in the original text in the corpus;
this step (a1) is a specific way of obtaining the quality features of the original segment. Since the error correction text has already been acquired in this embodiment, the application field of this embodiment can be determined. Specifically, the quality features of the original segment are obtained from the corpus of the application domain.
The context segment of the original segment in this embodiment is a segment immediately before or after the original segment in the original text. For example, when the original segment includes 1 participle, the corresponding context segment may include 1 participle or 2 participles located before the participle and 1 participle or 2 participles located after the participle. If the original segment includes 2 segmentations, the corresponding context segment may include 1 segmentations located before the original segment and 1 segmentations located after the original segment in the original text. If the original segment includes 3 segmentations, the corresponding context segment may include only 1 segmentations located before the original segment and 1 segmentations located after the original segment in the original text. Or considering that the probability of the occurrence of the segment including more word segments in the original text is smaller, the embodiment may further define: if the original segment already includes 3 or more tokens, its context segment may not be taken. That is, when a context fragment of an original fragment needs to be taken, there are three combinations of the original fragment plus the context fragment, and the above fragment plus the original fragment plus the context fragment corresponding to the combination of the original fragment and the context fragment. When the quality characteristics of the original segment are obtained, it is necessary to obtain the respective frequencies of the original segment, the combination of the original segment plus the text segment, and the combination of the text segment plus the original segment plus the text segment in the corpus.
In addition, optionally, when the original segment does not need to take the context segment, the quality feature of the corresponding original segment at this time may only include the frequency of occurrence of the original segment in the corpus.
(b1) Acquiring the frequency of the target segment in the corpus and the frequency of the combination of the target segment and the context segment in the corpus;
correspondingly, the step (b1) is a manner of obtaining the quality characteristics of the target segment, and the specific obtaining manner is the same as that of the step (a1), and is not repeated here.
In addition, considering the alignment of data, the target segment is a replacement segment of the original segment and has the same property as the original segment, and if the original segment does not take the context segment in step (a1), the target segment in step (b1) also correspondingly does not take the context segment. When the context segment needs to be taken, and the original segment is the beginning or end of the original text, the corresponding empty context segment can be represented by setting a preset beginning or end feature of the sentence, so as to ensure the alignment of the data.
(c1) According to the frequency of the original fragments appearing in the corpus, the frequency of the combinations of the original fragments and the context fragments appearing together in the corpus, the frequency of the target fragments appearing in the corpus and the frequency of the combinations of the target fragments and the context fragments appearing in the corpus, the frequency ratio of the target fragments to the original fragments appearing in the corpus and the frequency ratio of the combinations of the target fragments and the context fragments to the combinations of the original fragments and the context fragments appearing in the corpus are obtained, and/or the frequency difference of the target fragments to the original fragments appearing in the corpus and the frequency difference of the combinations of the target fragments and the context fragments to the combinations of the original fragments and the context fragments appearing in the corpus.
This step (c1) is a specific way of obtaining the relative quality features between the target segment and the original segment. Specifically, the fusion of the target segment and the context segment can be embodied by obtaining the frequency ratio of the target segment to the original segment in the corpus and the frequency ratio of the combination of the target segment and the context segment to the combination of the original segment and the context segment in the corpus, and/or the frequency difference of the target segment to the original segment to the combination of the target segment and the context segment to the frequency difference of the combination of the original segment and the context segment in the corpus. And vice versa.
Similarly, if the target segment and the original segment appear in the corpus with a small frequency difference, i.e. the probability difference is not much, but the frequency difference between the combination of the target segment and the context segment and the combination of the original segment and the context segment appears in the corpus is very large, which indicates that the combination of the target segment and the context segment is used more frequently in the prediction base than the combination of the original segment and the context segment, the target segment and the context segment can be considered to have strong compatibility, and the target segment can be used to replace the original segment, and vice versa.
In addition, if the original segment does not need to be the context segment, the corresponding relative quality features only include: and acquiring the frequency ratio of the target segment to the original segment in the corpus and/or the frequency difference of the target segment to the original segment in the corpus according to the frequency of the original segment in the corpus and the frequency of the target segment in the corpus. Compared with the above-mentioned need to obtain the context fragment, it is not rich enough to obtain the feature content, and therefore, in this embodiment, it is preferable to obtain the context fragment.
Secondly, acquiring relative historical behavior characteristics between the target segment and the original segment;
the step may specifically include the steps of:
(a2) acquiring a first modification frequency of modifying an original fragment into a target fragment in a PT table;
(b2) acquiring a second modification frequency of modifying the combination of the original fragment and the context fragment into the combination of the target fragment and the context fragment in the PT table;
(c2) and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
In addition, it should be noted that, if the original segment includes 3 word segments and the context segment is not taken, the steps (a2) - (c2) cannot be adopted to obtain the relative historical behavior feature between the target segment and the original segment, and the relative historical behavior feature may be directly set to be null or a preset feature symbol. Of course, since the context segment is taken, including the feature content is rich, in this embodiment, the context segment is preferably taken, and the above steps (a2) - (c2) are taken to achieve the acquisition of the relative historical behavior feature between the target segment and the original segment.
And thirdly, acquiring semantic similarity characteristics between the target fragment and the original fragment.
Similarly, the obtaining of the semantic similarity feature between the target segment and the original segment in this embodiment may include: obtaining semantic similarity between a target fragment and an original fragment; and/or obtaining semantic similarity of the combination of the target segment and the context segment and the combination of the original segment and the context segment.
In this embodiment, a preset dictionary may be adopted to obtain the word vector of the target segment and the word vector of the original segment, and then the cosine distance between the word vector of the target segment and the word vector of the original segment is calculated as the semantic similarity between the candidate segment and the original segment. Correspondingly, if the number of the participles included in the original segment is 3 or more, the semantic similarity between the target segment and the original segment is taken as the semantic similarity characteristic between the target segment and the original segment. If the number of the participles included in the original segment is less than 3, the context segment of the original segment needs to be obtained, and at this time, the semantic similarity between the combination of the target segment and the context segment and the combination of the original segment and the context segment needs to be obtained. Similarly, the word vector of the combination of the target segment and the context segment and the word vector of the combination of the original segment and the context segment are obtained, and then the cosine distance between the word vectors is calculated to be used as the semantic similarity characteristic of the combination of the candidate segment and the context segment and the combination of the original segment and the context segment. Correspondingly, the combination of the original fragment plus the following fragment includes three combinations of the original fragment plus the following fragment, and the following fragment plus the original fragment plus the following fragment. Correspondingly, the semantic similarity feature between the candidate segment and the original segment includes: the semantic similarity characteristics of the candidate segment and the original segment are formed by splicing the semantic similarity of the target segment and the original segment, the semantic similarity of the combination of the candidate segment and the previous segment and the combination of the original segment and the previous segment, the semantic similarity of the combination of the candidate segment and the next segment and the combination of the original segment and the next segment, and the semantic similarity of the previous segment, the combination of the candidate segment and the next segment and the combination of the previous segment, the original segment and the next segment.
For feature richness and accuracy of segment scoring model scoring, in this embodiment, the relative feature information preferably includes a relative quality feature, a relative historical behavior feature, and a semantic similarity feature at the same time. In order to further enrich the content of the relative feature information, in this embodiment, the relative feature information between the target segment and the original segment is acquired, which may further include at least one of the following; respectively acquiring proper noun characteristics of an original fragment and a target fragment according to a preset proper noun library; and acquiring the pinyin editing distance characteristics of the target segment and the original segment.
Specifically, the term feature of the target segment is used to identify whether the target segment belongs to a term. For example, whether a target segment belongs to a proper noun is determined according to the proper noun library, if so, the corresponding proper noun feature is 1, otherwise, the corresponding proper noun feature is 0. Correspondingly, if the target segment is a proper noun, the higher the probability that the target segment replaces the original segment; and if not, the lower the probability that the target fragment replaces the original fragment. Similarly, the proper noun feature of the original segment may also be set according to the proper noun library, which is not described herein again. Moreover, it should be noted that, in practical applications, the probability that the original segment and the target segment are both proper nouns is very small.
In addition, the pronunciation edit distance between the target segment and the original segment, specifically, the number of letters in the pinyin which needs to be adjusted for editing the pronunciation of the target segment into the pronunciation of the original segment, and correspondingly, the greater the pronunciation edit distance between the target segment and the original segment, the smaller the probability of replacing the original segment with the target segment; and if the pronunciation edit distance between the target segment and the original segment is smaller, the probability that the original segment is replaced by the target segment is higher.
203. Determining an ideal score of the target segment according to the feedback information;
referring to the description of step 101, it can be seen that the feedback information of the user can be acquired regardless of the form of the target result to be fed back to the user based on the corrected text. And the feedback information of the user is finally embodied as the text with or without the error correction agreement. Therefore, in this embodiment, it may be presumed whether the user accepts the replacement of the original segment with the target segment in the error correction text according to the feedback information; if the supposition user accepts that the replacement of the target segment to the original segment can be considered to be correct, setting the ideal score of the target segment to be 1; otherwise, if the supposition user does not accept, the replacement of the target segment to the original segment can be considered incorrect, and the ideal score of the target segment is set to 0.
204. Training a segment scoring model according to the acquired relative feature information and the ideal scoring of the target segment;
step 202-step 204 of this embodiment are a specific implementation manner of the step 102 "performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information" in the embodiment shown in fig. 1.
The training of this embodiment is incremental training, can carry out similar online training once after error correction at every turn, also can gather all text error correction data in this time period at certain time interval, carries out the off-line training, no matter which kind of mode of adoption, is to the current segment scoring model that has trained relearning to improve the follow-up prediction's of segment scoring model precision. During training, all the acquired relative characteristic information can be input into the segment scoring model, and the prediction score of the segment scoring model is acquired; obtaining the size relation between the prediction scoring and the ideal scoring; adjusting parameters of the segment scoring model if the predicted score is less than the ideal score, so that the predicted score output by the segment scoring model changes towards an increasing direction; if the predicted score is greater than the ideal score, adjusting parameters of the segment scoring model such that the predicted score output by the segment scoring model moves in a decreasing direction. The adjustment of the embodiment is only one-time fine adjustment, as long as the predicted score output by the segment scoring model can be changed towards increasing or decreasing.
In addition, optionally, in this embodiment, the step of inputting all the obtained relative feature information to the segment scoring model is not performed any more, the prediction scoring of the segment scoring model is obtained, and the scoring of the target segment by the segment scoring model during error correction may be directly obtained.
205. And performing error correction processing on the subsequent original text based on the trained segment scoring model.
By adopting the technical scheme, the text error correction method based on artificial intelligence can improve the prediction accuracy of the segment scoring model by performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information, and can effectively improve the error correction accuracy of the text when the trained segment scoring model is used for performing text error correction. For example, when the technical scheme of the embodiment is applied to long text editing, the content production quality of the long text can be assisted to be improved, and the user experience is improved.
The text error correction method based on artificial intelligence in the embodiments shown in fig. 1 and fig. 2 can be applied not only to error correction processing of short texts such as query search, but also to error correction processing of long texts. The following embodiment describes a scenario of long text error correction applied in the technical solution of the present embodiment.
FIG. 3 is a flowchart of a first embodiment of a method for correcting long text errors based on artificial intelligence according to the present invention. As shown in fig. 3, the method for correcting a long text based on artificial intelligence in this embodiment may specifically include the following steps:
300. when original segments of non-special nouns exist in the long text, according to a PT table preset in the field of the long text, PT segment recalling is carried out on the original segments needing error correction, and a candidate segment set of the original segments is obtained, wherein the candidate segment set comprises a plurality of candidate segments;
the long text of this embodiment may be various kinds of long text information with a length longer than a general query length, which is edited by the user, and may be, for example, a summary of an article, or a sentence in an article, and the like. By adopting the technical scheme of the embodiment, long text error correction can be performed on each sentence in one article, so that error correction of the whole article is realized.
Similarly, in this embodiment, when the long text is corrected, word segmentation processing needs to be performed on the long text first to obtain a plurality of word segments. The word segmentation strategy can refer to the word segmentation strategy of the related art, and is not limited herein. The original segment of this embodiment may be formed by each word segmentation alone, or formed by a combination of consecutive word segmentations, and details of the above embodiment are referred to and will not be described herein again. After obtaining a plurality of original segments in the long text, judging whether each original segment is a proper noun. For example, whether all original segments in the long text belong to proper nouns can be judged according to a preset proper noun library, and if all original segments belong to proper nouns, it is determined that no original segments needing error correction exist in the long text; otherwise, if the original fragment which does not belong to the proper noun exists, determining that the original fragment which needs to be corrected exists in the long text. The term library of this embodiment may be a database that previously counts data in the field of the long text, extracts terms, and generates all terms including the field.
After the judgment, if the long text stores the non-proper nouns, the original segment needing error correction is recalled with the PT fragments according to the PT table preset in the field of the long text, and a plurality of recalled candidate segments are collected in a candidate segment set.
In this embodiment, before the step 300, a PT table in the field of long articles may also be preset, and for example, at least one of the following modes may be specifically included:
firstly, obtaining the change frequency from an original segment to a replaced segment according to big data statistics of user active modification search word behaviors in the field of long texts. Storing the original segment, the replacement segment and the change frequency from the original segment to the replacement segment into a PT table;
for example, the user continuously inputs 'Qinghua university' and 'Qinghua university', and changes of 'Qinghua- > Qinghua' can be collected; a modification of "Qinghua university- > Qinghua university"; in the input process of the user, if the user finds that the previous input is wrong, the user can actively modify the search word to be correct, and according to the behavior of the user, the user can know that the modified search word at the next time is correct. For example, through the statistics of the preset time period, it can be known that the alteration frequency of "qinghua- > qinghua" is 100 times, and the alteration frequency of "qinghua university- > qinghua university" is 70 times.
And secondly, acquiring the change frequency from the original segment to the replaced segment according to the segment alignment mapping between the search words input by the user in the long text field and the titles of the search results searched by the search server. Storing the original segment, the replacement segment and the change frequency from the original segment to the replacement segment into a PT table; for example, fig. 4 is a schematic view of a search interface according to this embodiment. As shown in fig. 4, the search word input by the user at a certain time is "qinghua university", but the search result of the search server includes "qinghua university" and also includes "qinghua university". Thus, changes of Qinghua university- > Qinghua university "can be recorded 1 time for the title of the search result including Qinghua university; changes to the university of Qinghua- > university of Qinghua "may be recorded 1 time for the title of the search results, including university of Qinghua. If 30 results were searched in total, of which 28 titles are of the university of Qinghua and 2 titles are of the university of Qinghua, the alteration frequency of "Qinghua university- > Qinghua university" is considered to be 28 times and the alteration frequency of "Qinghua university- > Qinghua university" is 2 times.
And thirdly, acquiring the change frequency from the original segment to the replacement segment according to the alignment mapping of the user feedback data between the search words input by the user in the long text field and the active error correction of the search server. Storing the original segment, the replacement segment and the change frequency from the original segment to the replacement segment into a PT table; unlike the case 2 described above, in this case, it is necessary to determine the replacement segment based on the feedback of the user. For example, the search word that the user has entered at a certain time is "Qinghua university", but the search result of the search server includes both "Qinghua university" and "Qinghua university"; changes of "Qinghua university- > Qinghua university" are considered 1 time if the user clicks on a search result whose title includes "Qinghua university"; the user clicks on a search result whose title includes "Qinghua university", and considers change "Qinghua university- > Qinghua university" 1 time.
In the manner of the above embodiment, the PT table of the present embodiment may be collected and counted for a preset time period. The PT table may be generated by any one of the three methods described above, or may be generated by combining any two or three methods described above. According to the above embodiments, it can be known that, in the PT table of this embodiment, a plurality of sets of original fragments, replaced fragments, and corresponding change frequencies are recorded, for example, each set of data may be stored in a form of "original fragment- > replaced fragment, and change frequency". For the same original segment, a plurality of replacement segments can be corresponded, and the change frequency corresponding to each replacement segment can be different. When the original segment that needs error correction is recalled for a PT segment according to the PT table, all the replacement segments corresponding to the original segment may be obtained from the PT table, and the change frequency corresponding to each replacement segment may be obtained at the same time. And then obtaining TOP n replacement segments with the maximum change frequency from the plurality of replacement segments as candidate segments corresponding to the original segment. And a candidate segment set is formed by a plurality of candidate segments.
301. Respectively scoring each candidate segment in the candidate segment set by utilizing a pre-trained segment scoring model;
in this embodiment, a segment scoring model may be trained in advance, and is used to score each candidate segment in the candidate segment set. In this embodiment, for the same original segment, the probability of using the candidate segment with a high score to correct the original segment in the long text is higher than the probability of using the candidate segment with a low score to correct the original segment in the long text. However, when correcting a long text, it is necessary to consider factors such as the smoothness of the original segment and the context, and therefore, the original segment is not necessarily replaced with the candidate segment with the highest score in the finally obtained corrected text. The fragment scoring model of the present embodiment may adopt a GBRank network model.
For example, the step 301 may specifically include the following steps:
(a3) acquiring the quality characteristics of the original segment in the field of the long text and the quality characteristics of each candidate segment in the candidate segment set in the field of the long text;
for example, the obtaining of the quality characteristics of the original segment in the field of long texts may specifically include: the frequency of the original segment appearing in the corpus of the long text field and the frequency of the original segment and the combination of the context segment appearing together in the corpus of the long text field are obtained.
Correspondingly, the obtaining of the quality characteristics of each candidate segment in the candidate segment set in the field of the long text specifically includes: and acquiring the frequency of the occurrence of each candidate segment in the candidate segment set in the corpus and the frequency of the occurrence of the combination of each candidate segment and the context segment in the corpus.
In this embodiment, the context segment of the original segment is a segment located immediately before or after the original segment in the long text, and reference may be made to the related description of the embodiment shown in fig. 2 for details, which is not described herein again. Or considering that the probability of the occurrence of the segment including more word segmentations in the long text is smaller, the embodiment may further define: if the original segment already includes 3 or more tokens, its context segment may not be taken. When the context segment of the original segment needs to be taken, when the quality feature of the original segment is obtained, the frequency of occurrence of each of the original segment, the combination of the original segment plus the context segment, and the combination of the above segment plus the original segment plus the context segment in the corpus is required to be obtained. Correspondingly, the quality feature obtaining manner of each candidate segment is the same, and is not described herein again.
(b3) Acquiring the relative quality characteristics of each candidate fragment and the original fragment according to the quality characteristics of the original fragment in the field of the long text and the quality characteristics of each candidate fragment in the field of the long text;
for example, the step (b3) may specifically include: according to the frequency of the original fragments appearing in the corpus, the frequency of the combinations of the original fragments and the context fragments appearing together in the corpus, the frequency of the candidate fragments appearing in the corpus and the frequency of the combinations of the candidate fragments and the context fragments appearing in the corpus, the frequency ratio of the candidate fragments to the original fragments to appear in the corpus, the frequency ratio of the combinations of the candidate fragments and the context fragments to appear in the corpus, and/or the frequency difference of the candidate fragments to the original fragments to appear in the corpus and the frequency difference of the combinations of the candidate fragments and the context fragments to appear in the corpus are obtained.
Specifically, by obtaining the frequency ratio of each candidate segment to the original segment in the corpus and the frequency ratio of each candidate segment and context segment combination to the original segment and context segment combination in the corpus, and/or the frequency difference of the candidate segments and the original segment in the corpus and the frequency difference of the combination of the candidate segments and the context segments and the combination of the original segment and the context segments in the corpus can embody the fusion of the candidate segments and the context segments, if the candidate segment and the original segment occur in the corpus more frequently, and the combination of the candidate segment and the context segment and the combination of the original segment and the context segment occur in the corpus less frequently, the compatibility of the candidate segment with the context segment is poor, and the original segment is not suitable to be replaced. And vice versa.
Similarly, if the frequency difference between the candidate segment and the original segment in the corpus is small, i.e. the probability difference is not much, but the frequency difference between the combination of the candidate segment and the context segment and the frequency difference between the combination of the original segment and the context segment in the corpus is very large, which indicates that the combination of the candidate segment and the context segment is used more frequently in the prediction library than the combination of the original segment and the context segment, the candidate segment and the context segment can be considered to have strong compatibility, and the candidate segment can be used to replace the original segment, or vice versa.
It should be noted that, if the original segment already includes 3 or more segments, the previous segment may not be taken, and at this time, the ratio of the frequency of the candidate segments to the frequency of the original segment appearing in the corpus and/or the difference between the frequency of the candidate segments and the frequency of the original segment appearing in the corpus may be obtained as the relative quality feature of the candidate segments and the original segment only according to the frequency of the candidate segments appearing in the corpus and the frequency of the candidate segments appearing in the corpus. Compared with the above-mentioned need to obtain the context fragment, it is not rich enough to obtain the feature content, and therefore, in this embodiment, it is preferable to obtain the context fragment.
In addition, it should be noted that when a context fragment is required. The original segment is a sentence head or a sentence tail of a long text, and the corresponding empty context segment can be represented by setting a preset sentence head characteristic or a preset sentence tail characteristic so as to ensure the alignment of data.
(c3) Obtaining relative historical behavior characteristics of replacing the original segment with each candidate segment;
since the PT table records modification information of the history, the historical behavior feature of the embodiment may be a feature related to the modification frequency in the PT table. For example, the step (c3) may specifically include the following steps:
(a4) acquiring a first modification frequency of modifying an original fragment into each candidate fragment in a PT table;
(b4) acquiring a second modification frequency of modifying the combination of the original fragment and the context fragment in the PT table into the combination of each candidate fragment and the context fragment;
(c4) and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
In addition, it should be noted that, if the original segment includes 3 word segments and the context segment is not taken, the relative historical behavior feature may be set to be null or a preset feature symbol at this time.
(d3) Acquiring semantic similarity characteristics of each candidate segment and the original segment;
in this embodiment, a preset dictionary may be adopted to obtain word vectors of each candidate segment and word vectors of the original segment, and then a cosine distance between the word vectors of each candidate segment and the word vectors of the original segment is calculated as a semantic similarity between the candidate segment and the original segment. Correspondingly, if the number of the participles included in the original segment is 3 or more in the embodiment, the semantic similarity between each candidate segment and the original segment is taken as the semantic similarity characteristic between each candidate segment and the original segment. If the number of the participles included in the original segment is less than 3, the context segment of the original segment needs to be obtained, and at this time, the semantic similarity between the combination of each candidate segment and the context segment and the combination of the original segment and the context segment needs to be obtained. Similarly, the word vector of the combination of each candidate segment and the context segment and the word vector of the combination of the original segment and the context segment are obtained, and then the cosine distance between the word vectors is calculated to be used as the semantic similarity characteristic of the combination of the candidate segment and the context segment and the combination of the original segment and the context segment. Correspondingly, the combination of the original fragment plus the following fragment includes three combinations of the original fragment plus the following fragment, and the following fragment plus the original fragment plus the following fragment. Correspondingly, the semantic similarity feature between the candidate segment and the original segment includes: the semantic similarity characteristics of the candidate segments and the original segments are formed by splicing the semantic similarity of each candidate segment and the original segment, the semantic similarity of the combination of the candidate segment and the previous segment and the combination of the original segment and the previous segment, the semantic similarity of the combination of the candidate segment and the next segment and the combination of the original segment and the next segment, and the semantic similarity of the previous segment, the combination of the candidate segment and the next segment and the combination of the previous segment, the original segment and the next segment.
In addition, the above-mentioned acquisition of the relative quality feature, the relative historical behavior feature and the semantic similarity feature of each candidate segment and the original segment may also refer to the above-mentioned acquisition of the relative quality feature, the relative historical behavior feature and the semantic similarity feature of the target segment and the original segment shown in fig. 2, respectively.
(e3) And respectively obtaining the scoring of each candidate segment according to the relative quality characteristics of each candidate segment and the original segment, the relative historical behavior characteristics of each candidate segment and the original segment, the semantic similarity characteristics of each candidate segment and the original segment and a segment scoring model.
And then inputting the relative quality characteristics of the candidate segments and the original segment, the relative historical behavior characteristics of the candidate segments and the original segment and the semantic similarity characteristics of the candidate segments and the original segment, which are obtained in the step, into a pre-trained segment scoring model, wherein the segment scoring model can predict the scoring of the candidate segments.
For example, when the segment scoring model is trained, the training original segments and the training replacement segments which are positive and negative examples can be collected, if the training original segments and the training replacement segments are correct replacements, the corresponding score is 1, and at the moment, the training data is a positive example; otherwise, if the replacement is wrong, the corresponding score is 0; the training data is negative at this time. The proportion of positive and negative cases in the training data is greater than 1 and may be, for example, 5:1 or 4: 1. Before training, setting initial values for parameters of the segment scoring model in advance, then inputting training data in sequence, and adjusting the parameters of the segment scoring model if the score predicted by the segment scoring model is inconsistent with the known score, so that the predicted result is consistent with the known result. By adopting the mode, tens of millions of training data are continuously adopted to train the segment scoring model, and the parameters of the segment training scoring model are determined until the predicted result of the segment scoring model is consistent with the known result, so that the segment scoring model is determined, and the segment scoring model is trained completely. The more training data is adopted during training, the more accurate the trained segment scoring model is, and the more accurate the score predicted by the candidate segment by using the segment scoring model subsequently is. According to the above manner, the score of the prediction may be between 0 and 1. In practical applications, the segment scoring model may also be set to be located in other value ranges, such as 0-100, and the principle is similar, which is not described herein again.
Further optionally, before scoring each candidate slice, the following steps may be further included: acquiring proper noun characteristics of each candidate fragment according to a preset proper noun library and each candidate fragment; and/or obtaining pinyin editing distance characteristics of each candidate segment and the original segment.
Specifically, the term feature of each candidate segment is used to identify whether the candidate segment belongs to a term. For example, whether a candidate segment belongs to a proper noun is determined according to the proper noun library, if so, the corresponding proper noun feature is 1, otherwise, the corresponding proper noun feature is 0. Correspondingly, if the candidate segment is a proper noun, the score output by the segment scoring model for the candidate segment is higher; and if not, the corresponding output score is lower. In addition, the pronunciation edit distance between the candidate segment and the original segment is specifically the number of letters in the pinyin which are required to be adjusted for editing the pronunciation of the candidate segment into the pronunciation of the original segment, correspondingly, the greater the pronunciation edit distance between the candidate segment and the original segment is, the smaller the probability of replacing the original segment by the candidate segment is, and at the moment, the score output by the corresponding segment scoring model for the candidate segment can be smaller; if the pronunciation edit distance between the candidate segment and the original segment is smaller, the probability that the original segment is replaced by the candidate segment is higher, and the score output by the corresponding segment scoring model for the candidate segment can be larger.
Based on the above principle, correspondingly, the step (e1) may specifically include: and respectively obtaining the scoring of each candidate segment according to the relative quality characteristics of each candidate segment and the original segment, the relative historical behavior characteristics of each candidate segment and the original segment, the semantic similarity characteristics of each candidate segment and the original segment and a segment scoring model and by combining the special noun characteristics of each candidate segment and the pinyin editing distance characteristics of each candidate segment and the original segment. Correspondingly, when the segment scoring model is trained, the proper noun feature of the training replacement segment and the pinyin editing distance feature of the training original segment and the training replacement segment in the training data also need to be obtained, and the segment scoring model is trained by combining the previous features.
302. And acquiring a target segment corresponding to each original segment from the candidate segment set of each original segment of the long text needing error correction in a decoding mode according to the score of each candidate segment, thereby obtaining the corrected text of the long text.
And finally, acquiring a target fragment of each original fragment from the candidate fragment set of each original fragment needing error correction based on the score of each candidate fragment to obtain a corrected text of the long text. For example, the candidate segment with the highest score may be directly obtained as the target segment. Or if the candidate segment with the highest score is better combined with the context in the long text, the candidate segment with the highest score can be used as the target segment in the correction text. Or other means may be used to obtain the correction text.
For example, after different original segments in a long text are all subjected to segment recall, each original segment can obtain a plurality of candidate segment results, so that different original segments can correspond to the possibility of a plurality of candidate segment combinations to form a segment candidate network. For example, if a certain length of text includes original segments A, B and C, the candidate segments corresponding to original segment a have 1, 2, and 3; the candidate fragments corresponding to the original fragment B are 4, 5 and 6; the candidate segments corresponding to the original segment C are 7, 8 and 9; in this case, the candidate segment of each original segment may be used to replace the original segment, that is, the candidate segment 1 may be combined with the candidate segment 4, 5, or 6, the candidate segment 2 may be combined with the candidate segment 4, 5, or 6, and the candidate segment 3 may be combined with the candidate segment 4, 5, or 6, respectively, to form the segment candidate network. At this time, a decoding algorithm may be used to obtain the optimal candidate segment corresponding to each original segment from the segment candidate network, so as to obtain the optimal corrected text. For example, the decoding algorithm may include, without limitation: decoding algorithms such as the viterbi algorithm (viterbi), the beam search (beam search), or the greedy search (greedy search).
Or, for example, the step 302 may specifically include the following steps: for each original fragment, acquiring at least two preselected fragments corresponding to the original fragment from the candidate fragment set according to the score of each candidate fragment in the candidate fragment set; and acquiring a target segment corresponding to each original segment from at least two pre-selected segments corresponding to each original segment of the long text needing error correction in a decoding mode, so as to obtain a corrected text of the long text.
Specifically, if the number of candidate segments corresponding to each original segment is large, at least one candidate segment with a higher score may be taken as a preselected segment according to the order of the scores, and then a target segment corresponding to each original segment is obtained from at least two preselected segments corresponding to each original segment of the long text that needs error correction in a decoding manner, so as to obtain a corrected text of the long text.
The long text error correction method based on artificial intelligence can correct the error fragments in the long text, and effectively improves the editing quality of the long text. The technical scheme of the embodiment is provided based on the long text error correction scene, can be suitable for error correction behaviors in the text scene, can quickly and effectively produce error correction results, is high in error correction efficiency, and can be used for assisting in improving the content production quality of the long text and improving the user experience.
Fig. 5 is a flowchart of a second embodiment of the artificial intelligence-based long text error correction method of the present invention. As shown in fig. 5, the method for correcting error of long text based on artificial intelligence in this embodiment further adds an Edit Distance (ED) fragment recall to the original fragment requiring error correction based on the technical solution of the embodiment shown in fig. 3, and introduces the technical solution of the present invention in detail. As shown in fig. 5, the method for correcting a long text based on artificial intelligence in this embodiment may specifically include the following steps:
400. judging whether all original fragments in the long text belong to proper nouns or not according to the proper noun library; if both belong to, go to step 401; otherwise, go to step 402;
401. determining that the original fragments included in the long text are proper nouns, wherein the long text does not need to be corrected, and ending;
402. determining that original fragments which do not belong to the proper nouns exist in the long text, and determining that the original fragments of the non-proper nouns in the long text need to be corrected; step 403 is executed;
403. according to a PT table preset in the field of long texts, performing PT fragment recall on an original fragment needing error correction to obtain a candidate fragment set of the original fragment, wherein the candidate fragment set comprises a plurality of candidate fragments; step 404 is executed;
the details of the implementation of steps 400-403 can refer to the description of the embodiment shown in fig. 3, and are not repeated herein.
404. Acquiring the frequency of the original segment appearing in a corpus corresponding to the field of long texts, the frequency of the combination of the original segment and the context segment appearing in the corpus, the frequency of the original segment changing in a PT table, the frequency of the combination of the original segment and the context segment changing in the PT table and the semantic similarity of the original segment and the context segment; step 405 is executed;
similarly, the combination of the original segment and the context segment in this embodiment can refer to the related description of the embodiment shown in fig. 1, and is not repeated herein. The frequency of the original segment appearing in the corpus corresponding to the field of the long text can be obtained by counting the appearing frequency of the original segment in the corpus. The frequency of the original fragment being altered in the PT table may be the total number of times the original fragment is replaced with a fragment other than itself in the PT table. The total number of times of all replaced "qinghua" such as "qinghua" replaced by "qinghua" and "qinghua" replaced by "qinghua". The frequency of changes in the PT table of the combination of the original fragment and the context fragment may be the total number of times the original fragment is replaced with another fragment than itself in the PT table. Such as the total number of times "Qinghua university" was replaced with "Qinghua university" and all replacement segments except "Qinghua university".
The semantic similarity between the original segment and the context segment in this embodiment may be specifically obtained by obtaining a word vector of the original segment and a word vector of the context segment, and calculating cosine similarity between the word vector of the original segment and the word vector of the context segment. Wherein the word vector of the context segment is the word vector of the combination of the above segment plus the below segment. Or in this embodiment, semantic similarities between the original segment and all other segments except the original segment in the long text may also be used to replace the semantic similarities between the original segment and the context segment in this embodiment, so as to form a new alternative.
405. Obtaining the confidence coefficient of the original fragment according to the frequency of the original fragment appearing in the corpus corresponding to the field of the long text, the frequency of the combination of the original fragment and the context fragment appearing in the corpus, the change frequency of the original fragment in the PT table, the change frequency of the combination of the original fragment and the context fragment in the PT table, the semantic similarity of the original fragment and the context fragment and a preset language compliance grading model; go to step 406;
for example, the step 405 in this embodiment specifically includes the following two implementation manners:
in a first implementation manner, the confidence score model is used to determine the confidence, which may specifically include the following steps:
(a5) predicting the smoothness of the original fragments according to the frequency of the original fragments appearing in the corpus corresponding to the field of the long text, the frequency of the combination of the original fragments and the context fragments in the long text appearing in the corpus and a language smoothness grading model;
the language popularity grading model of the embodiment is used for grading the popularity of the original segment in the long text. The language popularity grading model can predict the popularity of the original segment according to the frequency of the original segment appearing in the corpus corresponding to the field of the long text and the frequency of the combination of the original segment and the context segment in the long text appearing in the corpus. For example, the compliance score may be between 0 and 1, and it may be defined that a larger value is more compliant and a smaller value is less compliant. Alternatively, other numerical ranges may be used to indicate general order, such as 0-100.
The language smoothness scoring model of this embodiment may also be obtained by training in advance, for example, a plurality of training data are collected in advance, each training data corresponds to one training long text, and includes the frequency of the training original segments in the training long text appearing in the corpus, the frequency of the combination of the training original segments and the training context segments in the training long text appearing in the corpus, and the known smoothness of the training original segments. Each of the acquired training data may include positive training data with a known compliance of 1, and may also include negative training data with a known compliance of 0. The proportion of positive and negative cases may be greater than 1, for example, preferably 5:1 or 4: 1. before training, setting an initial value for parameters of a language smoothness grading model, inputting training data into the language smoothness grading model in sequence during training, predicting the smoothness for the training data by the language smoothness grading model, judging whether the predicted smoothness is consistent with the known smoothness, and if not, adjusting the parameters of the language smoothness grading model to enable the predicted smoothness to be consistent with the known smoothness. By adopting the mode, tens of millions of training data are used for continuously training the language smoothness scoring model until the predicted smoothness is consistent with the known smoothness, the parameters of the language smoothness scoring model are determined, the language smoothness scoring model is determined, and the language smoothness scoring model is trained completely.
(b5) Obtaining the confidence coefficient of the original fragment according to the smoothness of the original fragment, the changing frequency of the original fragment in a PT table, the changing frequency of the combination of the original fragment and the context fragment in the PT table and the semantic similarity of the original fragment and the context fragment in combination with a pre-trained confidence coefficient scoring model;
similarly, in this embodiment, a confidence score model is trained in advance, and the confidence score model is used to obtain the confidence of the original segment. In this embodiment, the confidence level may be set to be between 0 and 1, where a larger confidence level value indicates a higher confidence level, and a smaller confidence level value indicates a lower confidence level. In practice, the confidence level may be set between other value ranges, such as 0-100. When the method is used, the currency of the original segment, the change frequency of the original segment in the PT table, the change frequency of the combination of the original segment and the context segment in the PT table and the semantic similarity of the original segment and the context segment are input into a trained confidence score model, and the confidence score model can output the confidence of the original segment.
Similarly, the confidence score model of this embodiment may also be obtained through pre-training, for example, a plurality of training data are collected in advance, each training data includes the smoothness of the training original segment, the change frequency of the training original segment in the PT table, the change frequency of the combination of the training original segment and the training context segment in the PT table, the semantic similarity between the training original segment and the training context segment, and the confidence corresponding to each training original segment, and each parameter obtaining manner is the same as the related description of the above embodiment. Each piece of acquired training data may include positive training data with a known confidence level of 1, and may also include negative training data with a known confidence level of 0. The proportion of positive and negative cases may be greater than 1, for example, preferably 5:1 or 4: 1. before training, setting initial values for parameters of a confidence coefficient scoring model, inputting training data into the confidence coefficient scoring model in sequence during training, predicting confidence coefficients for the training data by the confidence coefficient scoring model, judging whether the predicted confidence coefficients are consistent with known confidence coefficients, and if not, adjusting the parameters of the confidence coefficient scoring model to enable the predicted confidence coefficients to be consistent with the known confidence coefficients. By adopting the mode, tens of millions of training data are used for continuously training the confidence coefficient scoring model until the predicted confidence coefficient is consistent with the known confidence coefficient, the parameters of the confidence coefficient scoring model are determined, the confidence coefficient scoring model is determined, and the training of the confidence coefficient scoring model is finished.
In addition, in the training and prediction of all models involved in this embodiment, the feature data input into the models may be normalized in advance, and the normalization processing mode is not limited.
In a second implementation manner, the determining the confidence by using a threshold may specifically include the following steps:
(a6) predicting the smoothness of the original fragments according to the frequency of the original fragments appearing in the corpus corresponding to the field of the long text, the frequency of the combination of the original fragments and the context fragments in the long text appearing in the corpus and a language smoothness grading model;
the implementation manner of step (a6) is the same as that of step (a5), and reference may be made to the description of step (a5) for details, which are not repeated herein.
(b6) Respectively judging whether the currency of the original fragment is greater than a preset currency threshold, whether the change frequency of the original fragment in a PT table and the change frequency of the combination of the original fragment and the context fragment in the PT table are both greater than a preset frequency threshold, and whether the semantic similarity of the original fragment and the context fragment is greater than a preset similarity threshold; if so, setting the confidence of the original segment to be greater than a preset confidence threshold; otherwise, the confidence of the original segment is set to be less than or equal to a preset confidence threshold.
In this embodiment, corresponding thresholds, such as a compliance threshold, a frequency threshold, and a confidence threshold, are preset respectively for compliance of the original segment, a change frequency of the original segment in the PT table, a change frequency of a combination of the original segment and the context segment in the PT table, and semantic similarity of the original segment and the context segment. And then, respectively judging that each parameter is greater than the corresponding threshold, if each parameter is greater than the corresponding threshold, determining that the confidence is greater at the moment, setting the confidence to be greater than a preset confidence threshold, and determining that the original fragment does not need to be subjected to ED recall at the moment. Otherwise, only one of the parameters is not greater than the corresponding threshold, the confidence coefficient at this time is considered to be smaller, the confidence coefficient can be set to be smaller than the preset confidence coefficient threshold, and at this time, it can be determined that the original segment needs to be recalled by ED. The confidence threshold of the present embodiment may be preset with an appropriate value according to practical experience.
406. Judging whether the confidence of the original fragment is greater than a preset confidence threshold; if yes, go to step 407; otherwise, determining that the original fragment does not need to be recalled by the ED fragment; step 408 is executed;
407. determining that the original fragment needs ED fragment recall; according to the pronunciation of the original segment, the input prompt information provided for the original segment by utilizing a corpus and/or a pinyin input method in the field of long language is utilized to recall the original segment by the ED segment, and the recalled candidate segment is added to the candidate segment set; step 408 is executed;
the ED recall of this embodiment is to recall the candidate segment by a method of double deletion of mixed initials and finals from the phonetic notation string, i.e., pinyin, of the original segment. The candidate segment in recall can be from a corpus, and the high-frequency part is selected by mixing initial consonant and final double deletion according to the pinyin of the original segment, so as to perform phonetic notation and perform reverse index through the pinyin. For example, "china", the phonetic notation is "zhonghua", in order to enlarge the recall, the initial and final are partially deleted to obtain an index, and the correspondingly generated key-value may be { "zhonghua", "zhua", "onghua", "zhongua", "zhong h" } _ - > { "china" }. The corresponding candidate segment is then recalled from the corpus according to "zhonghua", "zhhua", "onghua", "zhongua", "zhong h". Wherein, the 'zhonghua' is very easy to recall the corresponding candidate segment because of the complete pinyin. And the candidate segments corresponding to the pinyin can be recalled by supplementing initials or finals by the methods of zhhua, onghua, zhongua and zhong h. Therefore, the candidate segments recalled by the ED are the same or similar in reading as the original segments.
In addition, the candidate segment recalled by ED in this embodiment may also be from the recall result of the pinyin input method, and specifically, may be input prompt information provided for the original segment according to the pinyin input method. According to the common typing habits of users, recalling is carried out in the mode of the initial consonants and the final consonants of the current words, and the candidate word list of the pinyin input method is obtained by the words of 'zhonghua', 'zhongh' and 'zhhua'. In practical application, a confusion sound can be introduced to enlarge the recall result. For example, fig. 6 is an exemplary diagram of a mapping table of the confusing sound provided in the present embodiment. As shown in fig. 6, a partial confusing tone is provided. When the candidate segment is recalled according to the pinyin input method, the retrieval result can be expanded by referring to the confusing tone shown in fig. 6.
408. Respectively scoring each candidate segment in the candidate segment set by utilizing a pre-trained segment scoring model; step 409 is executed;
409. according to the scoring of each candidate segment in the candidate segment set, at least two preselected segments corresponding to the original segment are obtained from the candidate segment set; step 410 is executed;
410. acquiring target fragments corresponding to the original fragments from at least two pre-selected fragments corresponding to the original fragments of the long text needing error correction in a decoding mode, so as to obtain a corrected text of the long text; step 411 is executed;
the specific implementation manner of steps 408 and 410 can refer to the related description of the embodiment shown in fig. 3, and will not be described herein again.
411. And (5) carrying out error correction intervention on the corrected segments in the corrected text, determining the final corrected text, and ending.
For example, in this embodiment, the performing error correction intervention on the corrected segment in the corrected text specifically includes at least one of the following:
judging whether the corrected target segment and the corresponding original segment in the corrected text hit an error correction pair in a preset blacklist or not; if yes, reducing the target fragment into an original fragment; and
judging whether a target segment corrected in the corrected text and a corresponding original segment belong to synonyms or not; and if so, restoring the target fragment into the original fragment.
In the black list in this embodiment, the collection may be performed according to the error correction pair for correcting the error before. For example, after correcting the original segment into a target segment, the user may restore the target segment into the original segment according to the correction result, so as to determine that the error is corrected. At this time, the target segment and the original segment can be collected to form an error correction pair. In practical applications, several similar error correction pairs may be used to form the blacklist. Intervening according to the sent corrected segment in the error correction text of the blacklist, for example, detecting whether the corrected target segment and the original segment are a pair of error correction pairs, if so, reducing the target segment into the original segment; otherwise, the corrected text is retained.
In addition, long text error correction primarily corrects erroneous information without correcting synonyms. In this embodiment, a synonym table may also be stored in advance, and each term segment and the corresponding synonym segment are stored. Then detecting whether the corrected target fragment and the corresponding original fragment belong to the synonym or not according to the synonym table, and if so, reducing the target fragment into the original fragment; otherwise, the corrected text is retained.
Fig. 7 is a schematic diagram of an error correction result of the long text error correction method based on artificial intelligence in this embodiment. For example, by using the long text error correction method based on artificial intelligence of the embodiment, the long text "the teacher's trunk is fast and good", after error correction is performed, the obtained error correction text is "the teacher's trunk is fast and good", and it can be known that the technical scheme of the embodiment can perform error correction on the long text with high quality.
The long text error correction method based on artificial intelligence can correct the error fragments in the long text, and effectively improves the editing quality of the long text. The technical scheme of the embodiment is provided based on the long text error correction scene, can be suitable for error correction behaviors in the text scene, can quickly and effectively produce error correction results, is high in error correction efficiency, and can be used for assisting in improving the content production quality of the long text and improving the user experience. In addition, according to the technical scheme of the embodiment, the replacement intervention of the error segment can be continued, and the error correction result is further optimized.
The embodiments shown in fig. 3 and 5 are long text error correction scenarios applied by the text error correction scheme of the present invention. In practical applications, the embodiments shown in fig. 3 and fig. 5 may be used after the embodiments shown in fig. 1 and fig. 2, and incremental training is performed on the segment scoring model according to the feedback information of the error correction text and the target segment and the original segment in the error correction text, so as to further improve the accuracy of predictive scoring of the segment scoring model.
FIG. 8 is a block diagram of a first embodiment of an artificial intelligence based text error correction apparatus according to the present invention. As shown in fig. 8, the artificial intelligence based text error correction apparatus of this embodiment may specifically include:
the fragment information obtaining module 10 is configured to obtain an error-corrected target fragment in the error-corrected text and an original fragment corresponding to the target fragment in the original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
the feedback information acquiring module 11 is used for acquiring feedback information of a target result fed back by a user based on the error correction text;
the incremental training module 12 is configured to perform incremental training on the segment scoring model according to the target segment and the original segment acquired by the segment information acquiring module 10 and the feedback information acquired by the feedback information acquiring module 11;
the error correction module 13 is configured to perform error correction processing on the subsequent original text based on the segment scoring model trained by the incremental training module 12.
The implementation principle and technical effect of the text error correction device based on artificial intelligence implemented by using the above modules are the same as those of the related method embodiments, and reference may be made to the description of the related method embodiments in detail, which is not repeated herein.
Fig. 9 is a block diagram of a second embodiment of an artificial intelligence based text correction apparatus according to the present invention. As shown in fig. 9, the text error correction apparatus based on artificial intelligence of this embodiment may further include the following technical solutions on the basis of the technical solution of the embodiment shown in fig. 8.
As shown in fig. 9, in the artificial intelligence based text error correction apparatus of this embodiment, the incremental training module 12 specifically includes:
the relative feature information acquiring unit 121 is configured to acquire relative feature information between the target segment and the original segment acquired by the segment information acquiring module 10;
the determining unit 122 is configured to determine an ideal score of the target segment according to the feedback information acquired by the feedback information acquiring module 11;
the training unit 123 is configured to train a segment scoring model according to the relative feature information acquired by the relative feature information acquiring unit 121 and the ideal scoring of the target segment determined by the determining unit 122.
Further optionally, in the artificial intelligence based text error correction apparatus according to this embodiment, the relative feature information obtaining unit 121 is configured to perform at least one of the following operations:
acquiring relative quality characteristics between a target fragment and an original fragment acquired by the fragment information acquisition module 10;
acquiring relative historical behavior characteristics between a target fragment and an original fragment acquired by the fragment information acquisition module 10; and
the semantic similarity feature between the target segment and the original segment acquired by the segment information acquiring module 10 is acquired.
Further optionally, the relative feature information obtaining unit 121 is specifically configured to:
acquiring the frequency of the original segment in the corpus and the frequency of the combination of the original segment and the context segment in the original text appearing in the corpus together, wherein the frequency is acquired by the segment information acquisition module 10;
acquiring the frequency of the target segment in the corpus and the frequency of the combination of the target segment and the context segment in the corpus, which are acquired by the segment information acquisition module 10;
according to the frequency of the original fragments appearing in the corpus, the frequency of the combinations of the original fragments and the context fragments appearing together in the corpus, the frequency of the target fragments appearing in the corpus and the frequency of the combinations of the target fragments and the context fragments appearing in the corpus, the frequency ratio of the target fragments to the original fragments appearing in the corpus and the frequency ratio of the combinations of the target fragments and the context fragments to the combinations of the original fragments and the context fragments appearing in the corpus are obtained, and/or the frequency difference of the target fragments to the original fragments appearing in the corpus and the frequency difference of the combinations of the target fragments and the context fragments to the combinations of the original fragments and the context fragments appearing in the corpus.
Further optionally, the relative feature information obtaining unit 121 is specifically configured to:
acquiring a first modification frequency of an original fragment acquired by the fragment information acquisition module 10 in the PT table and modifying the original fragment into a target fragment acquired by the fragment information acquisition module 10;
acquiring a second modification frequency of modifying the combination of the original fragment and the context fragment acquired by the fragment information acquisition module 10 in the PT table into the combination of the target fragment and the context fragment;
and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
Further optionally, the relative feature information obtaining unit 121 is specifically configured to:
acquiring semantic similarity between a target fragment acquired by the fragment information acquisition module 10 and an original fragment; and/or
The semantic similarity between the combination of the target segment and the context segment acquired by the segment information acquiring module 10 and the combination of the original segment and the context segment is acquired.
Further optionally, the relative feature information obtaining unit 121 is further specifically configured to execute at least one of the following;
respectively acquiring proper noun characteristics of an original fragment and a target fragment according to a preset proper noun library; and
and acquiring the pinyin editing distance characteristics of the target segment and the original segment.
Further optionally, in the artificial intelligence based text error correction apparatus of this embodiment, the determining unit 122 is specifically configured to:
according to the feedback information acquired by the feedback information acquisition module 11, whether the user accepts the original fragment replaced by the target fragment in the error correction text is presumed;
if the user is presumed to accept, setting the ideal score of the target segment to 1; otherwise, if the presumed user does not accept, the ideal score for the target segment is set to 0.
Further optionally, in the artificial intelligence based text error correction apparatus of this embodiment, the training unit 123 is specifically configured to:
inputting the relative characteristic information into the segment scoring model to obtain a prediction score of the segment scoring model;
obtaining the size relation between the prediction scoring and the ideal scoring;
adjusting parameters of the segment scoring model if the predicted score is less than the ideal score, so that the predicted score output by the segment scoring model changes towards an increasing direction;
if the predicted score is greater than the ideal score, adjusting parameters of the segment scoring model such that the predicted score output by the segment scoring model moves in a decreasing direction.
The implementation principle and technical effect of the text error correction device based on artificial intelligence implemented by using the above modules are the same as those of the related method embodiments, and reference may be made to the description of the related method embodiments in detail, which is not repeated herein.
FIG. 10 is a block diagram of an embodiment of a computer device of the present invention. As shown in fig. 10, the computer device of the present embodiment includes: one or more processors 30, and a memory 40, the memory 40 being configured to store one or more programs, when the one or more programs stored in the memory 40 are executed by the one or more processors 30, to cause the one or more processors 30 to implement the information processing method of the embodiment shown in fig. 1-7 above. The embodiment shown in fig. 10 is exemplified by including a plurality of processors 30.
For example, fig. 11 is an exemplary diagram of a computer device provided by the present invention. FIG. 11 illustrates a block diagram of an exemplary computer device 12a suitable for use in implementing embodiments of the present invention. The computer device 12a shown in fig. 11 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in FIG. 11, computer device 12a is in the form of a general purpose computing device. The components of computer device 12a may include, but are not limited to: one or more processors 16a, a system memory 28a, and a bus 18a that connects the various system components (including the system memory 28a and the processors 16 a).
Bus 18a represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12a typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12a and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28a may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30a and/or cache memory 32 a. Computer device 12a may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34a may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard drive"). Although not shown in FIG. 11, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18a by one or more data media interfaces. System memory 28a may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of the various embodiments of the invention described above in fig. 1-9.
A program/utility 40a having a set (at least one) of program modules 42a may be stored, for example, in system memory 28a, such program modules 42a including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may include an implementation of a network environment. Program modules 42a generally perform the functions and/or methodologies described above in connection with the various embodiments of fig. 1-9 of the present invention.
Computer device 12a may also communicate with one or more external devices 14a (e.g., keyboard, pointing device, display 24a, etc.), with one or more devices that enable a user to interact with computer device 12a, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12a to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22 a. Also, computer device 12a may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through network adapter 20 a. As shown, network adapter 20a communicates with the other modules of computer device 12a via bus 18 a. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12a, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16a executes programs stored in the system memory 28a to perform various functional applications and data processing, such as implementing the artificial intelligence based text error correction method shown in the above embodiments.
The present invention also provides a computer-readable medium on which a computer program is stored, which when executed by a processor implements the artificial intelligence based text error correction method as shown in the above embodiments.
The computer-readable media of this embodiment may include RAM30a, and/or cache memory 32a, and/or storage system 34a in system memory 28a in the embodiment illustrated in fig. 11 described above.
With the development of technology, the propagation path of computer programs is no longer limited to tangible media, and the computer programs can be directly downloaded from a network or acquired by other methods. Accordingly, the computer-readable medium in the present embodiment may include not only tangible media but also intangible media.
The computer-readable medium of the present embodiments may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (20)

1. A text error correction method based on artificial intelligence, which is characterized by comprising the following steps:
acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in an original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
acquiring feedback information of a target result fed back by a user based on the error correction text;
performing incremental training on the segment scoring model according to the target segment, the original segment and the feedback information;
and carrying out error correction processing on the subsequent original text based on the trained segment scoring model.
2. The method according to claim 1, wherein performing incremental training on the segment scoring model according to the target segment, the original segment, and the feedback information specifically comprises:
acquiring relative characteristic information between the target fragment and the original fragment;
determining an ideal score of the target segment according to the feedback information;
and training the segment scoring model according to the relative feature information and the ideal scoring of the target segment.
3. The method of claim 2, wherein obtaining relative feature information between the target segment and the original segment comprises at least one of:
acquiring relative quality characteristics between the target segment and the original segment;
acquiring relative historical behavior characteristics between the target segment and the original segment; and
and acquiring semantic similarity characteristics between the target fragment and the original fragment.
4. The method according to claim 3, wherein obtaining the relative quality features between the target segment and the original segment specifically comprises:
acquiring the frequency of the original fragments appearing in a corpus and the frequency of the original fragments appearing together with the combination of the context fragments in the original text in the corpus;
acquiring the frequency of the target segment in the corpus and the frequency of the combination of the target segment and the context segment in the corpus;
obtaining a frequency ratio of the target segments to the original segments to appear in the corpus and a frequency ratio of the combinations of the target segments and the context segments to appear in the corpus according to a frequency of the original segments to appear in the corpus, a frequency of the combinations of the original segments and the context segments to appear together in the corpus, a frequency of the target segments to appear in the corpus, and a frequency of the combinations of the target segments and the context segments to appear in the corpus, and/or the frequency difference of the target segment and the original segment in the corpus and the frequency difference of the combination of the target segment and the context segment and the combination of the original segment and the context segment in the corpus.
5. The method according to claim 4, wherein obtaining the relative historical behavior characteristics between the target segment and the original segment specifically comprises:
acquiring a first modification frequency of modifying the original segment into the target segment in the phrase substitution table;
acquiring a second modification frequency of the combination of the original segment and the context segment in the phrase substitution table to the combination of the target segment and the context segment;
and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
6. The method according to claim 4, wherein obtaining semantic similarity features between the target segment and the original segment specifically comprises:
obtaining semantic similarity between the target fragment and the original fragment; and/or
And acquiring the semantic similarity of the combination of the target segment and the context segment and the combination of the original segment and the context segment.
7. The method according to any one of claims 3-6, wherein obtaining relative feature information between the target segment and the original segment further comprises at least one of;
respectively acquiring the special noun characteristics of the original fragment and the target fragment according to a preset special noun library; and
and acquiring the pinyin editing distance characteristics of the target segment and the original segment.
8. The method according to claim 2, wherein determining the ideal score of the target segment according to the feedback information specifically comprises:
presume whether the user accepts to adopt the target segment to replace the original segment in the error correction text or not according to the feedback information;
if the user acceptance is presumed, setting the ideal score of the target segment to 1; otherwise, if the user is presumed not to accept, the ideal score for the target segment is set to 0.
9. The method according to claim 2, wherein training the segment scoring model according to the relative feature information and the ideal score of the target segment comprises:
inputting the relative characteristic information into the segment scoring model to obtain a prediction score of the segment scoring model;
obtaining the magnitude relation between the prediction scoring and the ideal scoring;
if the predicted score is smaller than the ideal score, adjusting parameters of the segment scoring model to enable the predicted score output by the segment scoring model to change towards an increasing direction;
if the predicted score is greater than the ideal score, adjusting parameters of the segment scoring model so that the predicted score output by the segment scoring model changes towards a decreasing direction.
10. An artificial intelligence based text correction apparatus, the apparatus comprising:
the segment information acquisition module is used for acquiring an error-corrected target segment in an error-corrected text and an original segment corresponding to the target segment in an original text; the target segment is selected from a plurality of candidate segments of the original segment when the original text is subjected to error correction processing based on a pre-trained segment scoring model;
the feedback information acquisition module is used for acquiring feedback information of a target result fed back by a user based on the error correction text;
the increment training module is used for carrying out increment training on the segment scoring model according to the target segment, the original segment and the feedback information;
and the error correction module is used for carrying out error correction processing on the subsequent original text based on the trained segment scoring model.
11. The apparatus of claim 10, wherein the incremental training module specifically comprises:
a relative feature information acquiring unit, configured to acquire relative feature information between the target segment and the original segment;
the determining unit is used for determining the ideal scoring of the target segment according to the feedback information;
and the training unit is used for training the segment scoring model according to the relative characteristic information and the ideal scoring of the target segment.
12. The apparatus of claim 11, wherein the relative feature information obtaining unit is configured to perform at least one of:
acquiring relative quality characteristics between the target segment and the original segment;
acquiring relative historical behavior characteristics between the target segment and the original segment; and
and acquiring semantic similarity characteristics between the target fragment and the original fragment.
13. The apparatus according to claim 12, wherein the relative feature information obtaining unit is specifically configured to:
acquiring the frequency of the original fragments appearing in a corpus and the frequency of the original fragments appearing together with the combination of the context fragments in the original text in the corpus;
acquiring the frequency of the target segment in the corpus and the frequency of the combination of the target segment and the context segment in the corpus;
obtaining a frequency ratio of the target segments to the original segments to appear in the corpus and a frequency ratio of the combinations of the target segments and the context segments to appear in the corpus according to a frequency of the original segments to appear in the corpus, a frequency of the combinations of the original segments and the context segments to appear together in the corpus, a frequency of the target segments to appear in the corpus, and a frequency of the combinations of the target segments and the context segments to appear in the corpus, and/or the frequency difference of the target segment and the original segment in the corpus and the frequency difference of the combination of the target segment and the context segment and the combination of the original segment and the context segment in the corpus.
14. The apparatus according to claim 13, wherein the relative feature information obtaining unit is specifically configured to:
acquiring a first modification frequency of modifying the original segment into the target segment in the phrase substitution table;
acquiring a second modification frequency of the combination of the original segment and the context segment in the phrase substitution table to the combination of the target segment and the context segment;
and obtaining a frequency ratio and/or a frequency difference according to the first modification frequency and the second modification frequency, wherein the frequency ratio is equal to the second modification frequency divided by the first modification frequency, and the frequency difference is equal to the second modification frequency minus the first modification frequency.
15. The apparatus according to claim 13, wherein the relative feature information obtaining unit is specifically configured to:
obtaining semantic similarity between the target fragment and the original fragment; and/or
And acquiring the semantic similarity of the combination of the target segment and the context segment and the combination of the original segment and the context segment.
16. The apparatus according to any one of claims 12 to 15, wherein the relative feature information obtaining unit is further configured to perform at least one of the following;
respectively acquiring the special noun characteristics of the original fragment and the target fragment according to a preset special noun library; and
and acquiring the pinyin editing distance characteristics of the target segment and the original segment.
17. The apparatus according to claim 11, wherein the determining unit is specifically configured to:
presume whether the user accepts to adopt the target segment to replace the original segment in the error correction text or not according to the feedback information;
if the user acceptance is presumed, setting the ideal score of the target segment to 1; otherwise, if the user is presumed not to accept, the ideal score for the target segment is set to 0.
18. The apparatus according to claim 11, wherein the training unit is specifically configured to:
inputting the relative characteristic information into the segment scoring model to obtain a prediction score of the segment scoring model;
obtaining the magnitude relation between the prediction scoring and the ideal scoring;
if the predicted score is smaller than the ideal score, adjusting parameters of the segment scoring model to enable the predicted score output by the segment scoring model to change towards an increasing direction;
if the predicted score is greater than the ideal score, adjusting parameters of the segment scoring model so that the predicted score output by the segment scoring model changes towards a decreasing direction.
19. A computer device, the device comprising:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
20. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN201711159880.7A 2017-11-20 2017-11-20 Text error correction method and device based on artificial intelligence and computer readable medium Active CN108052499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711159880.7A CN108052499B (en) 2017-11-20 2017-11-20 Text error correction method and device based on artificial intelligence and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711159880.7A CN108052499B (en) 2017-11-20 2017-11-20 Text error correction method and device based on artificial intelligence and computer readable medium

Publications (2)

Publication Number Publication Date
CN108052499A CN108052499A (en) 2018-05-18
CN108052499B true CN108052499B (en) 2021-06-11

Family

ID=62118964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711159880.7A Active CN108052499B (en) 2017-11-20 2017-11-20 Text error correction method and device based on artificial intelligence and computer readable medium

Country Status (1)

Country Link
CN (1) CN108052499B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831212B (en) * 2018-06-28 2020-10-23 深圳语易教育科技有限公司 Auxiliary device and method for oral teaching
CN109032375B (en) * 2018-06-29 2022-07-19 北京百度网讯科技有限公司 Candidate text sorting method, device, equipment and storage medium
CN109766538B (en) * 2018-11-21 2023-12-15 北京捷通华声科技股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN111339755A (en) * 2018-11-30 2020-06-26 中国移动通信集团浙江有限公司 Automatic error correction method and device for office data
CN109376362A (en) * 2018-11-30 2019-02-22 武汉斗鱼网络科技有限公司 A kind of the determination method and relevant device of corrected text
CN110399607B (en) * 2019-06-04 2023-04-07 深思考人工智能机器人科技(北京)有限公司 Pinyin-based dialog system text error correction system and method
CN112733529B (en) * 2019-10-28 2023-09-29 阿里巴巴集团控股有限公司 Text error correction method and device
CN111160013B (en) * 2019-12-30 2023-11-24 北京百度网讯科技有限公司 Text error correction method and device
CN111832288B (en) * 2020-07-27 2023-09-29 网易有道信息技术(北京)有限公司 Text correction method and device, electronic equipment and storage medium
CN112541342B (en) * 2020-12-08 2022-07-22 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and storage medium
CN113159035B (en) * 2021-05-10 2022-06-07 北京世纪好未来教育科技有限公司 Image processing method, device, equipment and storage medium
CN114328798B (en) * 2021-11-09 2024-02-23 腾讯科技(深圳)有限公司 Processing method, device, equipment, storage medium and program product for searching text

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003083858A1 (en) * 2002-03-28 2003-10-09 Koninklijke Philips Electronics N.V. Time domain watermarking of multimedia signals
EP1593049A1 (en) * 2003-02-11 2005-11-09 Telstra Corporation Limited System for predicting speec recognition accuracy and development for a dialog system
CN101866336A (en) * 2009-04-14 2010-10-20 华为技术有限公司 Methods, devices and systems for obtaining evaluation unit and establishing syntactic path dictionary
CN104915264A (en) * 2015-05-29 2015-09-16 北京搜狗科技发展有限公司 Input error-correction method and device
CN105068661A (en) * 2015-09-07 2015-11-18 百度在线网络技术(北京)有限公司 Man-machine interaction method and system based on artificial intelligence
CN105374356A (en) * 2014-08-29 2016-03-02 株式会社理光 Speech recognition method, speech assessment method, speech recognition system, and speech assessment system
CN106528597A (en) * 2016-09-23 2017-03-22 百度在线网络技术(北京)有限公司 POI (Point Of Interest) labeling method and device
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN107133209A (en) * 2017-03-29 2017-09-05 北京百度网讯科技有限公司 Comment generation method and device, equipment and computer-readable recording medium based on artificial intelligence

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3899290B2 (en) * 2002-06-10 2007-03-28 富士通株式会社 Sender identification method, program, apparatus and recording medium
CN107239446B (en) * 2017-05-27 2019-12-03 中国矿业大学 A kind of intelligence relationship extracting method based on neural network Yu attention mechanism
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003083858A1 (en) * 2002-03-28 2003-10-09 Koninklijke Philips Electronics N.V. Time domain watermarking of multimedia signals
EP1593049A1 (en) * 2003-02-11 2005-11-09 Telstra Corporation Limited System for predicting speec recognition accuracy and development for a dialog system
CN101866336A (en) * 2009-04-14 2010-10-20 华为技术有限公司 Methods, devices and systems for obtaining evaluation unit and establishing syntactic path dictionary
CN105374356A (en) * 2014-08-29 2016-03-02 株式会社理光 Speech recognition method, speech assessment method, speech recognition system, and speech assessment system
CN104915264A (en) * 2015-05-29 2015-09-16 北京搜狗科技发展有限公司 Input error-correction method and device
CN105068661A (en) * 2015-09-07 2015-11-18 百度在线网络技术(北京)有限公司 Man-machine interaction method and system based on artificial intelligence
CN106528597A (en) * 2016-09-23 2017-03-22 百度在线网络技术(北京)有限公司 POI (Point Of Interest) labeling method and device
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN107133209A (en) * 2017-03-29 2017-09-05 北京百度网讯科技有限公司 Comment generation method and device, equipment and computer-readable recording medium based on artificial intelligence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Vector representation of non-standard spelling using dynamic time warping and a denoising autoencoder";Mehdi Ben Lazreg;《2017 IEEE Congress on Evolutionary Computation》;20140608;第1-4页 *

Also Published As

Publication number Publication date
CN108052499A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
CN108091328B (en) Speech recognition error correction method and device based on artificial intelligence and readable medium
CN110489760B (en) Text automatic correction method and device based on deep neural network
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
CN106534548B (en) Voice error correction method and device
CN109840287A (en) A kind of cross-module state information retrieval method neural network based and device
CN106202153A (en) The spelling error correction method of a kind of ES search engine and system
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN108304375A (en) A kind of information identifying method and its equipment, storage medium, terminal
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113948066B (en) Error correction method, system, storage medium and device for real-time translation text
CN115357719B (en) Power audit text classification method and device based on improved BERT model
Chen et al. A study of language modeling for Chinese spelling check
CN112347241A (en) Abstract extraction method, device, equipment and storage medium
CN112447172B (en) Quality improvement method and device for voice recognition text
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN112257456A (en) Text editing technology-based training method and device for text generation model
CN113705207A (en) Grammar error recognition method and device
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
Naowarat et al. Reducing spelling inconsistencies in code-switching ASR using contextualized CTC loss
CN116562240A (en) Text generation method, computer device and computer storage medium
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium
CN113128224B (en) Chinese error correction method, device, equipment and readable storage medium
CN114330375A (en) Term translation method and system based on fixed paradigm
CN114757203A (en) Chinese sentence simplification method and system based on contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant