CN110929514A

CN110929514A - Text proofreading method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN110929514A
Application number: CN201911144534.0A
Authority: CN
Inventors: 苏海波; 苏萌; 刘译璟; 姚震; 檀玉飞; 黄伟
Original assignee: Beijing Baifendian Information Science & Technology Co Ltd
Current assignee: Beijing Baifendian Information Science & Technology Co Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-03-27
Anticipated expiration: 2039-11-20
Also published as: CN110929514B

Abstract

The disclosure relates to a text proofreading method and device, a computer-readable storage medium and an electronic device. The method comprises the following steps: determining error correction information of each sentence in the text to be corrected, wherein the error correction information comprises error words and at least one corresponding error word; aiming at each error word, respectively determining a first co-occurrence frequency and a second co-occurrence frequency of the error word and preceding and following words in a preset corpus; acquiring semantic features aiming at each error correction word corresponding to the error word; and judging whether the error-correcting words are correct or not according to at least the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. The correctness of the error-correcting words is judged, and the text proofreading accuracy can be improved. When the correctness of the error-correcting words is judged, the collocation of the words before and after the words and the context semantic features are comprehensively considered, so that the accuracy of judging the correctness of the error-correcting words can be ensured, and the text proofreading accuracy is further improved. In addition, the work of proofreading is intelligent and automatic, the pressure of manual proofreading is reduced, the work efficiency is improved, and the labor cost is reduced.

Description

Text proofreading method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a text proofreading method and apparatus, a computer-readable storage medium, and an electronic device.

Background

In text processing, a quite mature computer application system is used for inputting, editing and typesetting, but the intermediate link of text proofreading mainly depends on a manual processing stage, and becomes a bottleneck that the whole industry is restricted from developing and the working efficiency is influenced in the fields of news, publishing, office printing and the like. The manual text proofreading is time-consuming and labor-consuming, and the accuracy of the proofreading is difficult to ensure.

Based on the problems, at present, an N-gram model is mainly adopted to detect errors in a text and give an error correction suggestion, but the method only considers the matching problem of front and rear words, and the accuracy rate of text correction is low.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a text proofreading method, a text proofreading apparatus, a computer-readable storage medium, and an electronic device.

In order to achieve the above object, according to a first aspect of embodiments of the present disclosure, there is provided a text proofing method, including:

determining error correction information of each sentence in a text to be corrected, wherein the error correction information comprises error words and at least one error correction word corresponding to the error words;

respectively determining a first frequency of occurrence of the error word and a preceding word of the error word in a preset corpus and a second frequency of occurrence of the error word and a following word of the error word in the preset corpus for each error word;

aiming at each error-correcting word corresponding to the error word, acquiring semantic features of the error word and the error-correcting word;

and judging whether the error-correcting words are correct or not at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features.

Optionally, the determining whether the error correction word is correct according to at least the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature includes:

and inputting at least the first co-occurrence frequency, the second co-occurrence frequency and the semantic features into a preset xgboost model to judge whether the error-correcting word is correct.

Optionally, the method further comprises:

marking words belonging to a single word in the error words and the error-correcting words as 1, and marking words belonging to multiple words in the error words and the error-correcting words as 0;

the determining whether the error-correcting word is correct at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features includes:

and judging whether the error-correcting words are correct or not according to the marks of the error-correcting words, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features.

Optionally, the obtaining semantic features of the error word and the error correction word includes:

replacing the error words in the initial sentence to which the error words belong with the error correction words to obtain error correction sentences;

respectively acquiring a first vector A ═ A corresponding to the initial sentences through a Bert model₁,A₂,…,A_m) And the second vector B corresponding to the error correction sentence is equal to (B)₁,B₂,…,B_n) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A_iA first score for characterizing the reasonableness of the ith character appearing in the initial sentence, i ═ 1,2, …, m, B_jA second score for characterizing the reasonableness of the jth character of the error-corrected sentence appearing in the error-corrected sentence, j being 1,2, …, n;

and determining a first difference value between the average value of each second score in the second vector and the average value of each first score in the first vector as the semantic features of the error word and the error correction word.

sequentially converting the first vector and the second vector through each preset conversion function in a plurality of preset conversion functions to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector;

respectively calculating a second difference value of the average value of each third score in the third vector and the average value of each fourth score in the fourth vector aiming at the third vector and the fourth vector obtained by converting each preset conversion function;

determining a plurality of the second differences as semantic features of the error word and the error correction word.

Optionally, the method further comprises:

marking a maximum value of the plurality of second difference values as 1, and marking second difference values except for the maximum value of the plurality of second difference values as 0;

and judging whether the error-correcting word is correct or not according to the mark of the second difference value, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features.

Optionally, before the step of determining error correction information of each sentence in the text to be collated, the method further includes:

preprocessing the text to be corrected to obtain a new text to be corrected;

the determining of the error correction information of each sentence in the text to be corrected includes:

and determining the error correction information of each sentence in the new text to be corrected.

According to a second aspect of the embodiments of the present disclosure, there is provided a text proofing apparatus, the apparatus including:

the device comprises a first determining module, a second determining module and a correcting module, wherein the first determining module is used for determining error correction information of each sentence in a text to be corrected, and the error correction information comprises error words and at least one error correction word corresponding to the error words;

a second determining module, configured to determine, for each error word determined by the first determining module, a first frequency of co-occurrence of the error word and a preceding word of the error word in a preset corpus and a second frequency of co-occurrence of the error word and a following word of the error word in the preset corpus, respectively;

an obtaining module, configured to obtain, for each error correction word corresponding to the error word determined by the first determining module, a semantic feature of the error correction word in a corresponding sentence;

and the judging module is used for judging whether the error-correcting word is correct or not at least according to the first co-occurrence frequency and the second co-occurrence frequency determined by the second determining module and the semantic features acquired by the acquiring module.

According to a third aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps of the method provided by the first aspect of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.

In the technical scheme, firstly, error words existing in each sentence in the text to be corrected and at least one error correction word corresponding to each error word are determined; then, respectively determining a first co-occurrence frequency and a second co-occurrence frequency of the error word and a preceding word and a following word thereof for each error word, and simultaneously acquiring corresponding semantic features for each error correction word corresponding to the error word; and finally, judging the correctness of the error-correcting words at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. After the error words and the corresponding error correction words are obtained, the correctness of the error correction words is further judged, and therefore the accuracy of text proofreading can be improved. In addition, when the correctness of the error-correcting words is judged, the matching problem of the front words and the rear words is considered, and the context semantic features of the words are combined, so that the correctness judgment precision of the error-correcting words can be ensured, and the accuracy of text proofreading is further improved. In addition, the text proofreading method enables proofreading work to be intelligent and automatic, reduces pressure of manual proofreading, improves working efficiency and reduces labor cost.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 illustrates a flow diagram of a method of text proofing according to an exemplary embodiment.

FIG. 2A is a flow diagram illustrating a method of obtaining semantic features in accordance with an exemplary embodiment.

FIG. 2B is a flow diagram illustrating a method of obtaining semantic features in accordance with another exemplary embodiment.

FIG. 3 illustrates a flow diagram of a method of text proofing according to another exemplary embodiment.

FIG. 4 illustrates a flow diagram of a method of text proofing according to another exemplary embodiment.

FIG. 5 illustrates a flow diagram of a method of text proofing according to another exemplary embodiment.

FIG. 6 illustrates a block diagram of a text proofing apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

FIG. 1 illustrates a flow diagram of a method of text proofing according to an exemplary embodiment. As shown in fig. 1, the above method may include the following steps 101 to 104.

In step 101, error correction information of each sentence in the text to be corrected is determined, wherein the error correction information includes an error word and at least one error correction word corresponding to the error word.

In the present disclosure, the error correction information may include 0, 1, or a plurality of error words, and each error word may correspond to 1 or more error correction words (i.e., error correction suggestions). For example, the sentence "today is really good. "the error correction information is: the error word "weather" and its corresponding error word "weather".

Also, in the present disclosure, the error correction information of each sentence in the text to be collated may be determined in various ways. In one embodiment, the error correction information may be obtained by an N-gram model. Specifically, this can be achieved by: (1) after the text to be corrected is obtained, word segmentation processing, part-of-speech tagging and other operations can be performed on the text; (2) based on a huge corpus, an N-gram model is adopted to perform positioning operation of error words, and a possible error position is detected; (3) further detecting the position which is possibly wrong by using a part-of-speech N-gram method, and if the position is detected to be unreasonable, judging the position to be wrong, and defining the position to be a wrong word; (4) and carrying out error correction processing on the error words and providing corresponding error correction words. Therefore, the error correction information of each sentence in the text to be corrected can be acquired.

In another embodiment, the error correction information may be obtained by reverse error correction (i.e., character matching). Specifically, after the text to be corrected is obtained, word segmentation processing can be performed on the text to be corrected; and then, for each word segmentation, respectively matching the word segmentation with each word in a preset word bank to obtain a plurality of matching degrees, if the maximum value of the plurality of matching degrees is smaller than a preset matching degree threshold value, determining the word segmentation as an error word, and determining a word corresponding to one or more matching degrees which are ranked earlier in the preset word bank and the plurality of matching degrees as an error correction word corresponding to the error word, wherein the matching degrees are sorted from big to small. Therefore, the error correction information of each sentence in the text to be corrected can be acquired.

In addition, the preset lexicon may be a database composed of words, and the preset lexicon may be generated based on a huge preset lexicon or an existing lexicon, which is not specifically limited in the present disclosure. The preset matching degree threshold may be a value set by a user, or may be a default empirical value, which is not specifically limited in this disclosure.

In another embodiment, the error correction information may be obtained by an N-gram model and a reverse error correction, and then the obtained error correction information is de-overlapped, and the de-overlapped and combined error correction information is used as the error correction information of each sentence in the text to be corrected. Therefore, the comprehensiveness of error word detection can be improved, and the accuracy of text proofreading is improved.

In step 102, for each error word, a first frequency of occurrence of the error word and a preceding word of the error word in a predetermined corpus and a second frequency of occurrence of the error word and a following word of the error word in the predetermined corpus are respectively determined.

In the present disclosure, the predetermined corpus may be a database composed of text sentences. Further, in order to make the first co-occurrence frequency and the second co-occurrence frequency determined according to the technical solution of the embodiment of the present application more fit with the text to be collated, the embodiment of the present application preferably makes up the preset corpus by the text sentences in the field to which the text to be collated belongs. In the embodiment of the present application, the generation method of the preset corpus is not limited.

The preceding word of the error word may be a word located before and next to the error word in the initial sentence to which the error word belongs, and the following word of the error word may be a word located after and next to the error word in the initial sentence to which the error word belongs. For example, the sentence "today is really good. "the error correction information is: the wrong word "qi day" and the corresponding error-correcting word "weather", wherein "today" is the antecedent of the wrong word "qi day" and "true" is the consequent of the wrong word "qi day".

The first and second co-occurrence frequencies may be determined in various ways. In one embodiment, the determination may be by a mathematical statistical method. Specifically, the number of times that the error word and its preceding word appear in the preset corpus (i.e., the first co-occurrence frequency), and the number of times that the error word and its following word appear in the preset corpus (i.e., the second co-occurrence frequency) may be counted for each error word.

In another embodiment, the first co-occurrence frequency and the second co-occurrence frequency can be determined by an N-gram model. Since the specific manner of determining the first and second co-occurrence frequencies through the N-gram model is well known to those skilled in the art, it is not described in detail in this disclosure.

In step 103, for each error correction word corresponding to the error word, semantic features of the error word and the error correction word are obtained.

In this disclosure, after the error correction information is obtained in step 101, corresponding semantic features may be obtained for each error correction word corresponding to each error word in the error correction information. In particular, the semantic features described above may be obtained in a number of ways. In one embodiment, this may be achieved through steps 1031 to 1033 shown in fig. 2A.

In step 1031, the error words in the initial sentence to which the error words belong are replaced with error correction words to obtain an error-corrected sentence.

Illustratively, the above sentence "this day is really good. "the error correction information is: the wrong word "sky" and its corresponding wrong word "weather", wherein, the initial sentence that the wrong word "sky" belongs to is: "today is very good. ". Thus, replacing the error word "weather" in the initial sentence with the error correction word "weather", the error correction sentence "weather is really good today" can be obtained. ".

In step 1032, a first vector corresponding to the initial sentence and a second vector corresponding to the error-corrected sentence are obtained through the Bert model, respectively.

In the present disclosure, a first vector a ═ a (a)₁,A₂,…,A_m) Second vector B ═ B (B)₁,B₂,…,B_n) Wherein m and n are the number of characters (including punctuation marks) contained in the initial sentence and the number of characters (including punctuation marks) contained in the error-corrected sentence, respectively, A_iTo characterize the reasonableness of the ith character appearing in the initial sentence, i is 1,2, …, m, B_jIn order to represent the plausibility of the jth character in the error-corrected sentence, j is 1,2, …, n. The number of characters included in the error word and the number of characters included in the corresponding error correction word may be the same or different, and m and n may be equal or different.

Bert(Bidirectional encoderrepresentation from transforms, i.e., bidirectionally encoded representations of transformers) is a method of pre-training language representation, a model that can be downloaded and used free of charge. Wherein the model can be used to extract high quality linguistic features from sentences in the text to be collated. In the present disclosure, the first vector a ═ corresponding to the initial sentence may be obtained through the Bert model respectively (a ═ a)₁,A₂,…,A_m) And the second vector B corresponding to the error correction sentence is equal to (B)₁,B₂,…,B_n)。

In step 1033, a first difference between the average of each second score in the second vector and the average of each first score in the first vector is determined as the semantic feature of the error word and the error correction word.

In this disclosure, the size of the first difference may reflect the quality of the corresponding error-correcting word, where when the first difference is greater than 0, it indicates that the corresponding error-correcting word is relatively good, and if the first difference is less than or equal to 0, it indicates that the corresponding error-correcting word is relatively poor.

The first vector a corresponding to the initial sentence to which the error word belongs is obtained in step 1032 (a ═ a)₁,A₂,…,A_m) And the corresponding second vector B ═ B of the error correction sentence₁,B₂,…,B_n) Then, the first vector a ═ (a) can be calculated, respectively₁,A₂,…,A_m) Each first score A_iThe average value of (i.e.,

) Second vector B ═ B₁,B₂,…,B_n) Of (2) each second score B_jThe average value of (i.e.,

) (ii) a After that, the second vector B is (B)₁,B₂,…,B_n) Of (2) each second score B_jIs equal to the first vector a (a)₁,A₂,…,A_m) Each first score A_iA first difference of the average values of (i.e.,

) Semantic features are determined for the erroneous words and the error-correcting words.

In another embodiment, the method can be implemented by

steps

1031, 1032, 1034, 1035 and 1036 shown in fig. 2B.

In step 1034, the first vector and the second vector are transformed sequentially through each of a plurality of preset transformation functions, so as to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector.

In this disclosure, the preset transformation function may be used to transform the first vector and the second vector to obtain a third vector and a fourth vector. The plurality of preset transformation functions may be preset by a user, or may be default (for example, y (X) is the ith preset transformation function of the plurality of preset transformation functions, X + C_iX is a first vector or a second vector, Y (X) is a third vector corresponding to the first vector or a fourth vector corresponding to the second vector, C_iConstant vectors corresponding to the ith preset transformation function and constant vectors C corresponding to the preset transformation functions_iDifferent), are not specifically limited in this disclosure.

After the first vectors of the initial sentences to which the error words belong and the second vectors corresponding to the corresponding error correction sentences are obtained in step 1032, each first vector can be transformed by using a plurality of preset transformation functions, so that a plurality of third vectors are obtained; meanwhile, each second vector may be transformed by using the plurality of preset transformation functions, so as to obtain a plurality of fourth vectors.

In step 1035, for each of the third and fourth vectors transformed by the predetermined transformation function, a second difference between the average of the third scores in the third vector and the average of the fourth scores in the fourth vector is calculated.

In this disclosure, after the plurality of third vectors corresponding to the initial sentence to which the error word belongs and the plurality of fourth vectors corresponding to the corresponding error-corrected sentence are obtained through the step 1034, a second difference value between an average value of each third score in the third vector and an average value of each fourth score in the fourth vector may be calculated for each of the third vectors and the fourth vectors obtained by transforming each preset transformation function. Thereby, a plurality of second difference values can be obtained.

It should be noted that the number of the preset transformation functions may be set by a user, or may be a default (for example, 6), and is not specifically limited in this disclosure.

In step 1036, a plurality of second differences are determined as semantic features of the error words and the error correction words.

After the second differences are obtained in step 1035, the second differences can be determined as semantic features of the error word and the error correction word.

Returning to fig. 1, in step 104, it is determined whether the error-correcting word is correct according to at least the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features.

In this disclosure, for each error word, after the first frequency of occurrence of the error word and its preceding word in the preset corpus, the second frequency of occurrence of the error word and its following word in the preset corpus are obtained in the above step 102, and the corresponding semantic features are obtained in the above step 103, they may be input into a preset xgboost (extreme gradient boost) model to determine whether the error word corresponding to the error word is correct. The xgboost model is a binary model, the output of the model may be 0 or 1, 0 may be used to characterize the error-correcting word error, and 1 may be used to characterize the error-correcting word correct.

In addition, in the present disclosure, the construction of the xgboost model may be performed based on the manually collated text. Firstly, in the same manner as the step 102, for each sample error word obtained by manual proofreading, obtaining a third co-occurrence frequency of the sample error word and a previous word thereof in a preset corpus and a fourth co-occurrence frequency of the sample error word and a next word thereof in the preset corpus; meanwhile, corresponding reference semantic features are obtained in the same manner as in the step 103; and then, inputting at least the third co-occurrence frequency, the fourth co-occurrence frequency and the reference semantic features into an initial xgboost model for training to obtain the preset xgboost model. The specific construction manner of the xgboost model is well known to those skilled in the art, and therefore, will not be described in detail in the present disclosure.

In addition, the xgboost model may be optimized, for example, the model may be optimized by modifying the model parameters according to the training and testing effects.

In one embodiment, whether the error-correcting word is correct or not can be determined according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. Specifically, the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features may be input into the preset xgboost model, so as to determine whether the error-correcting word corresponding to the error word is correct according to the output of the preset xgboost model. That is, when the output of the preset xgboost model is 0, it indicates that the error-correcting word corresponding to the error word is incorrect; and when the output of the preset xgboost model is 1, indicating that the error-correcting word corresponding to the error word is correct.

In another embodiment, in order to further improve the accuracy of text proofreading, when determining whether the error-correcting word is correct, the number information of the characters included in the error-correcting word and the error-correcting word may be considered in addition to the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features. Specifically, as shown in fig. 3, the method may further include the following step 105.

In step 105, a word belonging to a single word among the error word and the error-correcting word is labeled as 1, and a word belonging to a multi-word among the error word and the error-correcting word is labeled as 0.

After the error correction information is obtained in step 101, it may be determined whether each error word or error correction word is a single word, and if the error word or error correction word is a single word, the error word or error correction word is marked as 1, otherwise, the error word or error correction word is marked as 0. Thus, whether the error-correcting words are correct or not can be determined together according to the marks of the error-correcting words, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. Specifically, the flag of the error word, the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features may be input into the preset xgboost model, so as to determine whether the error word corresponding to the error word is correct according to the output of the preset xgboost model.

It should be noted that, the step 105 may be executed before the step 102 or the step 103, may be executed after the step 102 or the step 103, may be executed simultaneously with the step 102 or the step 103, and is not particularly limited in this disclosure.

In another embodiment, in order to further improve the accuracy of text proofreading, when determining whether the error-correcting word is correct, the second difference information may be considered in addition to the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features. Specifically, as shown in fig. 4, the method may further include the following step 106.

In step 106, the maximum value of the plurality of second difference values is marked as 1, and the second difference values except the maximum value of the plurality of second difference values are marked as 0.

After the plurality of second difference values are obtained in step 103 (i.e., step 1035), the maximum value of the plurality of second difference values may be marked as 1, and the second difference values except the maximum value of the plurality of second difference values may be marked as 0. In this way, whether the error-correcting word is correct or not can be determined jointly according to the mark of each second difference value, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. Specifically, the flag, the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature of each second difference may be input into the preset xgboost model, so as to determine whether the error-correcting word corresponding to the error word is correct according to the output of the preset xgboost model.

It should be noted that step 103 may be executed before step 102, may be executed after step 102, may be executed simultaneously with step 102, and is not particularly limited in this disclosure.

In addition, in order to further improve the accuracy of text proofreading, before error correction information is obtained according to the text to be proofread, preprocessing can be performed on the text to be proofread. Specifically, as shown in fig. 5, before the step 101, the method may further include the following step 107.

In step 107, the text to be collated is preprocessed to obtain a new text to be collated.

In the present disclosure, this preprocessing may include filtering out of illegal characters (e.g., spaces, empty lines, etc.). After the text to be collated is preprocessed, a new text to be collated may be obtained, and then, error correction information may be obtained based on the new text to be collated, that is, the above step 101 is performed.

FIG. 6 is a block diagram illustrating a text proofing apparatus according to an example embodiment. Referring to fig. 6, the apparatus 600 may include: a first determining module 601, configured to determine error correction information of each sentence in a text to be corrected, where the error correction information includes an error word and at least one error correction word corresponding to the error word; a second determining module 602, configured to determine, for each error word determined by the first determining module 601, a first frequency of occurrence of the error word and a preceding word of the error word in a preset corpus, and a second frequency of occurrence of the error word and a following word of the error word in the preset corpus, respectively; an obtaining module 603, configured to obtain, for each error correction word corresponding to the error word determined by the first determining module 601, a semantic feature of the error correction word in a corresponding sentence; a determining module, configured to determine whether the error correcting word is correct at least according to the first co-occurrence frequency and the second co-occurrence frequency determined by the second determining module 602 and the semantic features acquired by the acquiring module 603.

Optionally, the determining module 604 is configured to input at least the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features into a preset xgboost model to determine whether the error-correcting word is correct.

Optionally, the apparatus 600 may further include: the first marking module is used for marking words belonging to a single word in the error words and the error-correcting words as 1 and marking words belonging to multiple words in the error words and the error-correcting words as 0;

the decision module 604 is configured to: and judging whether the error-correcting words are correct or not according to the marks of the error words, the marks of the error-correcting words, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features, which are obtained by the first marking module.

Optionally, the obtaining module 603 includes: the replacing submodule is used for replacing the error words in the initial sentence to which the error words belong with the error correction words to obtain error correction sentences; a vector obtaining submodule, configured to obtain, through a Bert model, first vectors a corresponding to the initial sentences respectively＝(A₁,A₂,…,A_m) And the second vector B corresponding to the error correction sentence is equal to (B)₁,B₂,…,B_n) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A_iA first score for characterizing the reasonableness of the ith character appearing in the initial sentence, i ═ 1,2, …, m, B_jA second score for characterizing the reasonableness of the jth character of the error-corrected sentence appearing in the error-corrected sentence, j being 1,2, …, n; a first determining submodule, configured to determine a first difference between an average of each second score in the second vector and an average of each first score in the first vector as a semantic feature of the error word and the error correction word.

Optionally, the obtaining module 603 includes: the replacing submodule is used for replacing the error words in the initial sentence to which the error words belong with the error correction words to obtain error correction sentences; a vector obtaining sub-module, configured to obtain, through a Bert model, first vectors a ═ respectively corresponding to the initial sentences (a ═ a)₁,A₂,…,A_m) And the second vector B corresponding to the error correction sentence is equal to (B)₁,B₂,…,B_n) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A_iA first score for characterizing the reasonableness of the ith character appearing in the initial sentence, i ═ 1,2, …, m, B_jA second score for characterizing the reasonableness of the jth character of the error-corrected sentence appearing in the error-corrected sentence, j being 1,2, …, n; the transformation submodule is used for respectively transforming the first vector and the second vector through each preset transformation function in a plurality of preset transformation functions in sequence to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector; a calculation submodule, configured to calculate, for a third vector and a fourth vector obtained by transforming each of the preset transformation functions, an average value of each third score in the third vector and each fourth score in the fourth vector respectivelyA second difference in the averaged values; and the second determining submodule is used for determining a plurality of second difference values as semantic features of the error words and the error correction words.

Optionally, the apparatus 600 may further include: a second marking module, configured to mark a maximum value of the second difference values obtained by the calculation sub-module as 1, and mark a second difference value, excluding the maximum value, of the second difference values as 0;

the decision module 604 is configured to: and judging whether the error-correcting word is correct or not according to the mark of the second difference value, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features, which are obtained by the second marking module.

Optionally, the apparatus 600 may further include: the preprocessing module is used for preprocessing the text to be corrected before the first determining module determines the error correction information of each sentence in the text to be corrected, so as to obtain a new text to be corrected;

and the first determining module is used for determining the error correction information of each sentence in the new text to be corrected.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-mentioned text proofing method provided by the present disclosure.

Fig. 7 is a block diagram illustrating an electronic device 700 in accordance with an example embodiment. As shown in fig. 7, the electronic device 700 may include: a processor 701 and a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

The processor 701 is configured to control the overall operation of the electronic device 700, so as to complete all or part of the steps in the text proofreading method. The memory 702 is used to store various types of data to support operation at the electronic device 700, such as instructions for any application or method operating on the electronic device 700 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 702 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia components 703 may include screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 702 or transmitted through the communication component 705. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is used for wired or wireless communication between the electronic device 700 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 705 may thus include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the text proofing method described above.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the text proofing method described above is also provided. For example, the computer readable storage medium may be the memory 702 described above that includes program instructions executable by the processor 701 of the electronic device 700 to perform the text proofing method described above.

Fig. 8 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be provided as a server. Referring to fig. 8, an electronic device 800 includes a processor 822, which may be one or more in number, and a memory 832 for storing computer programs executable by the processor 822. The computer programs stored in memory 832 may include one or more modules that each correspond to a set of instructions. Further, the processor 822 may be configured to execute the computer program to perform the text proofing method described above.

Additionally, the electronic device 800 may also include a power component 826 and a communication component 850, the power component 826 may be configured to perform power management of the electronic device 800, and the communication component 850 may be configured to enable communication, e.g., wired or wireless communication, of the electronic device 800. The electronic device 800 may also include input/output (I/O) interfaces 858. The electronic device 800 may operate based on an operating system stored in the memory 832, such as Windows Server, Mac OSXTM, UnixTM, LinuxTM, and the like.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the text proofing method described above is also provided. For example, the computer readable storage medium may be the memory 832 including program instructions described above that are executable by the processor 822 of the electronic device 800 to perform the text proofing method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the text proofing method described above when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of text proofing, the method comprising:

2. The method according to claim 1, wherein said determining whether the error correction word is correct according to at least the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature comprises:

3. The method of claim 1, further comprising:

4. The method according to any one of claims 1 to 3, wherein the obtaining semantic features of the error word and the error correction word comprises:

5. The method according to any one of claims 1 to 3, wherein the obtaining semantic features of the error word and the error correction word comprises:

6. The method of claim 5, further comprising:

7. A method according to any of claims 1-3, characterized in that, prior to the step of determining error correction information for each sentence in the text to be collated, the method further comprises:

preprocessing the text to be corrected to obtain a new text to be corrected;

8. A text proofing apparatus, characterized in that the apparatus comprises:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.