CN110929514B

CN110929514B - Text collation method, text collation apparatus, computer-readable storage medium, and electronic device

Info

Publication number: CN110929514B
Application number: CN201911144534.0A
Authority: CN
Inventors: 苏海波; 苏萌; 刘译璟; 姚震; 檀玉飞; 黄伟
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2023-06-27
Anticipated expiration: 2039-11-20
Also published as: CN110929514A

Abstract

The present disclosure relates to a text collation method, apparatus, computer-readable storage medium, and electronic device. The method comprises the following steps: determining error correction information of each sentence in the text to be corrected, wherein the error correction information comprises error words and at least one error correction word corresponding to the error words; for each error word, determining a first co-occurrence frequency and a second co-occurrence frequency of the error word and the front word and the rear word thereof in a preset corpus respectively; aiming at each error correction word corresponding to the error word, acquiring semantic features; and judging whether the error correction word is correct or not at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. And the correctness of the error correction word is judged, so that the accuracy of text correction can be improved. When the correctness of the error correction words is determined, the collocation of the front and rear words and the context semantic features are comprehensively considered, so that the accuracy of the correctness determination of the error correction words can be ensured, and the accuracy of text correction is further improved. In addition, the intelligent and automatic correction work is realized, the pressure of manual correction is reduced, the working efficiency is improved, and the labor cost is reduced.

Description

Text collation method, text collation apparatus, computer-readable storage medium, and electronic device

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a text collation method, apparatus, computer readable storage medium, and electronic device.

Background

In text processing, a quite mature computer application system exists in the process of inputting, editing and typesetting, but the intermediate link of text proofreading still mainly depends on a manual processing stage, and the intermediate link becomes a bottleneck for restricting the development of the whole industry and influencing the working efficiency in the fields of news, publishing, office marks and the like. The manual text correction is time-consuming and labor-consuming, and the correction accuracy is difficult to ensure.

Based on the problems, at the present stage, an N-gram model is mainly adopted to detect errors in a text and give error correction suggestions, but the method only considers the collocation problem of front and rear words, and the text correction accuracy is low.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a text collation method, apparatus, computer-readable storage medium and electronic device.

To achieve the above object, according to a first aspect of embodiments of the present disclosure, there is provided a text collation method, the method comprising:

determining error correction information of each sentence in a text to be corrected, wherein the error correction information comprises error words and at least one error correction word corresponding to the error words;

For each error word, respectively determining a first co-occurrence frequency of the error word and a preceding word of the error word in a preset corpus and a second co-occurrence frequency of the error word and a following word of the error word in the preset corpus;

for each error correction word corresponding to the error word, acquiring semantic features of the error word and the error correction word;

and judging whether the error correction word is correct or not at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature.

Optionally, the determining whether the error correction word is correct at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature includes:

at least inputting the first co-occurrence frequency, the second co-occurrence frequency and the semantic features into a preset xgboost model to judge whether the error correction words are correct or not.

Optionally, the method further comprises:

marking words belonging to single characters in the error words and the error correction words as 1, and marking words belonging to multiple characters in the error words and the error correction words as 0;

the determining whether the error correction word is correct at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature comprises:

And judging whether the error correction word is correct or not according to the mark of the error word, the mark of the error correction word, the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature.

Optionally, the acquiring the semantic features of the error word and the error correction word includes:

replacing the error word in the initial sentence to which the error word belongs with the error correction word to obtain an error correction sentence;

respectively acquiring a first vector A= (A) corresponding to the initial sentence through a Bert model ₁ ,A ₂ ,…,A _m ) A second vector b= (B) corresponding to the error correction sentence ₁ ,B ₂ ,…,B _n ) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A _i For a first score for characterizing the rationality of the occurrence of the ith character in the initial sentence, i=1, 2, …, m, B _j J=1, 2, …, n for a second score for characterizing the rationality of the j-th character in the error correction sentence occurring in the error correction sentence;

and determining a first difference value between the average value of each second score in the second vector and the average value of each first score in the first vector as the semantic features of the error word and the error correction word.

transforming the first vector and the second vector through each preset transformation function in a plurality of preset transformation functions in sequence to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector;

respectively calculating a second difference value of the average value of each third score in the third vector and the average value of each fourth score in the fourth vector aiming at the third vector and the fourth vector obtained by the transformation of each preset transformation function;

And determining a plurality of second difference values as semantic features of the error word and the error correction word.

Optionally, the method further comprises:

marking a maximum value of the plurality of second difference values as 1, and marking a second difference value except for the maximum value of the plurality of second difference values as 0;

and judging whether the error correction word is correct or not according to the mark of the second difference value, the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature.

Optionally, before the step of determining the error correction information of each sentence in the text to be collated, the method further includes:

preprocessing the text to be checked to obtain a new text to be checked;

the determining the error correction information of each sentence in the text to be checked comprises the following steps:

and determining error correction information of each sentence in the new text to be checked.

According to a second aspect of embodiments of the present disclosure, there is provided a text collation apparatus, the apparatus comprising:

the first determining module is used for determining error correction information of each sentence in the text to be corrected, wherein the error correction information comprises error words and at least one error correction word corresponding to the error words;

The second determining module is used for respectively determining a first co-occurrence frequency of the error word and a preceding word of the error word in a preset corpus and a second co-occurrence frequency of the error word and a following word of the error word in the preset corpus for each error word determined by the first determining module;

the acquisition module is used for acquiring semantic features of the error correction words in corresponding sentences according to each error correction word corresponding to the error word determined by the first determination module;

and the judging module is used for judging whether the error correction word is correct or not at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features acquired by the acquisition module, which are determined by the second determination module.

According to a third aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method provided by the first aspect of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.

In the technical scheme, firstly, determining error words in each sentence in a text to be checked and at least one error correction word corresponding to each error word; then, for each error word, the first co-occurrence frequency and the second co-occurrence frequency of the error word and the preceding word and the following word of the error word can be respectively determined, and meanwhile, for each error word corresponding to the error word, corresponding semantic features are obtained; and finally, judging the correctness of the error correction word at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features. After the error word and the corresponding error correction word are obtained, the correctness of the error correction word is further judged, so that the accuracy of text correction can be improved. In addition, when the correctness of the error correction word is determined, the matching problem of the front word and the rear word is considered, and the context semantic features of the words are combined, so that the accuracy of the correctness determination of the error correction word can be ensured, and the accuracy of text correction is further improved. In addition, the text proofreading method enables the proofreading work to be intelligent and automatic, reduces the pressure of manual proofreading, improves the working efficiency and reduces the labor cost.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:

FIG. 1 illustrates a flow chart of a text collation method according to an exemplary embodiment.

FIG. 2A is a flowchart illustrating a method of acquiring semantic features according to an example embodiment.

FIG. 2B is a flowchart illustrating a method of acquiring semantic features according to another exemplary embodiment.

FIG. 3 illustrates a flow chart of a text collation method according to another exemplary embodiment.

FIG. 4 illustrates a flow chart of a text collation method according to another exemplary embodiment.

FIG. 5 illustrates a flow chart of a text collation method according to another exemplary embodiment.

FIG. 6 is a block diagram illustrating a text collation apparatus according to an example embodiment.

Fig. 7 is a block diagram of an electronic device, according to an example embodiment.

Fig. 8 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the disclosure, are not intended to limit the disclosure.

FIG. 1 illustrates a flow chart of a text collation method according to an exemplary embodiment. As shown in fig. 1, the above method may include the following steps 101 to 104.

In step 101, error correction information of each sentence in the text to be checked is determined, wherein the error correction information includes an error word and at least one error correction word corresponding to the error word.

In the present disclosure, the above error correction information may include 0, 1, or a plurality of error words, and each error word may correspond to 1 or a plurality of error correction words (i.e., error correction suggestions). For example, the sentence "today's atmosphere is very good. "error correction information is: the error word "weather" and its corresponding error correction word "weather".

Also, in the present disclosure, the correction information of each sentence in the text to be collated may be determined in various ways. In one embodiment, the error correction information may be obtained through an N-gram model. Specifically, this can be achieved by: (1) After the text to be checked is obtained, word segmentation processing, part-of-speech tagging and other operations can be performed on the text; (2) An N-gram model is adopted to take a huge corpus as a basis, positioning operation of error words is carried out, and the position where errors are likely to occur is detected; (3) Further detecting the position which is possibly in error by a part-of-speech N-gram method, if the detection is unreasonable, determining that the position is in error, and defining the position as an error word; (4) And carrying out error correction processing on the error words and giving out corresponding error correction words. Therefore, the error correction information of each sentence in the text to be checked can be obtained.

In another embodiment, the error correction information may be obtained by reverse error correction (i.e., character matching). Specifically, after the text to be checked is obtained, word segmentation processing can be performed on the text; and then, for each word, respectively matching the word with each word in a preset word stock to obtain a plurality of matching degrees, if the maximum value of the plurality of matching degrees is smaller than a preset matching degree threshold value, determining the word as an error word, and determining the word corresponding to one or more matching degrees which are ranked at the front in the preset word stock and are ranked at the front in the plurality of matching degrees as an error correction word corresponding to the error word, wherein the matching degrees are ranked from large to small. Therefore, the error correction information of each sentence in the text to be checked can be obtained.

The predetermined word library may be a database of words, may be generated based on a huge expected library, or may be an existing word library, and is not particularly limited in this disclosure. The preset matching degree threshold may be a value set by a user, or may be a default empirical value, which is not particularly limited in the present disclosure.

In another embodiment, the error correction information can be obtained by an N-gram model and a reverse error correction mode, then the error correction information obtained by the N-gram model and the reverse error correction mode are de-overlapped, and the error correction information after de-overlapping and merging is used as the error correction information of each sentence in the text to be corrected. Therefore, the comprehensiveness of error word detection can be improved, and the accuracy of text proofreading is improved.

In step 102, for each wrong word, a first co-occurrence frequency of the wrong word and a preceding word of the wrong word in a preset corpus and a second co-occurrence frequency of the wrong word and a following word of the wrong word in the preset corpus are respectively determined.

In the present disclosure, the preset corpus may be a database composed of text sentences. Further, in order to make the first co-occurrence frequency and the second co-occurrence frequency determined according to the technical solution of the embodiment of the present application more fit the text to be checked, the embodiment of the present application preferably forms the preset corpus from text sentences in the field to which the text to be checked belongs. It should be noted that, in the embodiment of the present application, the generation manner of the preset corpus is not limited.

The preceding word of the incorrect word may be a word located before and immediately adjacent to the incorrect word in the initial sentence to which the incorrect word belongs, and the following word of the incorrect word may be a word located after and immediately adjacent to the incorrect word in the initial sentence to which the incorrect word belongs. For example, the sentence "today's atmosphere is very good. "error correction information is: the error word "qi tian" and the corresponding error correction word "weather", wherein "today" is the former word of the error word "qi tian" and "true" is the latter word of the error word "qi tian".

And, the first co-occurrence frequency and the second co-occurrence frequency may be determined in various ways. In one embodiment, this may be determined by means of mathematical statistics. Specifically, for each wrong word, the number of occurrences of the wrong word and its preceding word in the preset corpus (i.e., the first co-occurrence frequency) and the number of occurrences of the wrong word and its following word in the preset corpus (i.e., the second co-occurrence frequency) may be counted separately.

In another embodiment, the first co-occurrence frequency and the second co-occurrence frequency may be determined by an N-gram model. Since the specific manner in which the first co-occurrence and the second co-occurrence are determined by the N-gram model is well known to those skilled in the art, details will not be provided in this disclosure.

In step 103, for each error correction word corresponding to the error word, the semantic features of the error word and the error correction word are obtained.

In the present disclosure, after the error correction information is obtained through the above step 101, corresponding semantic features may be obtained for each error correction word corresponding to each error word in the error correction information. In particular, the above semantic features may be obtained in a variety of ways. In one embodiment, this may be achieved by steps 1031-1033 shown in FIG. 2A.

In step 1031, the error word in the initial sentence to which the error word belongs is replaced with the error correction word, so as to obtain an error correction sentence.

Illustratively, the sentence "today's atmosphere is very good". "error correction information is: the error word "qi tian" and the error correction word "weather" corresponding thereto, wherein the initial sentence to which the error word "qi tian" belongs is: "today's atmosphere is very good. ". Thus, the error word "weather" in the initial sentence is replaced by the error correction word "weather", so that an error correction sentence "weather today is good. ".

In step 1032, a first vector corresponding to the initial sentence and a second vector corresponding to the error correction sentence are obtained by the Bert model, respectively.

In the present disclosure, a first vector a= (a ₁ ,A ₂ ,…,A _m ) Second vector b= (B ₁ ,B ₂ ,…,B _n ) Wherein m and n are the number of characters (including punctuation marks) contained in the initial sentence and the number of characters (including punctuation marks) contained in the error correction sentence, respectively, A _i For a first score for characterizing the rationality of the occurrence of the ith character in an initial sentence in that initial sentence, i=1, 2, …, m, B _j J=1, 2, …, n, a second score for characterizing the rationality of the j-th character in the error correction sentence appearing in the error correction sentence. Note that the number of characters included in the error word may be the same as or different from the number of characters included in the error word corresponding to the error word, and similarly, m and n may be equal or unequal.

Bert (Bidirectional encoder representation from transformers, a bi-directional coded representation of a transducer) is a method of pre-training language representation, a model that can be downloaded and used for free. Wherein the model can be used to extract high quality language features from sentences in the text to be collated. In the present disclosure, the first vectors a= (a) corresponding to the above initial sentences can be obtained by the Bert model, respectively ₁ ,A ₂ ,…,A _m ) Second vector b= (B) corresponding to error correction sentence ₁ ,B ₂ ,…,B _n )。

In step 1033, a first difference between the average of the second scores in the second vector and the average of the first scores in the first vector is determined as the semantic feature of the error word and the error correction word.

In the present disclosure, the magnitude of the first difference may reflect the quality of the corresponding error correction word, where when the first difference is greater than 0, it indicates that the corresponding error correction word is relatively good, and if the first difference is less than or equal to 0, it indicates that the corresponding error correction word is relatively poor.

In the step 1032, the first vector a= (a) corresponding to the initial sentence to which the wrong word belongs is obtained ₁ ,A ₂ ,…,A _m ) Second vector b= (B) corresponding to the corresponding error correction sentence ₁ ,B ₂ ,…,B _n ) After that, the first vector a= (a) can be calculated separately ₁ ,A ₂ ,…,A _m ) Each first score A of _i The average value of (i.e.,

) Second vector b= (B) ₁ ,B ₂ ,…,B _n ) Each second score B of _j Average value (i.e.)>

) The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the second vector b= (B ₁ ,B ₂ ,…,B _n ) Each second score B of _j Average value and first vector a= (a ₁ ,A ₂ ,…,A _m ) Each first score A of _i Is a first difference in the average value of (i.e.,

) Semantic features that are error words and error correction words are determined.

In another embodiment, the

steps

1031, 1032, 1034, 1035, 1036 shown in fig. 2B may be implemented.

In step 1034, the first vector and the second vector are transformed by each preset transformation function of the plurality of preset transformation functions in turn, so as to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector.

In the present disclosure, the above-mentioned preset transformation function may be used to transform the first vector and the second vector to obtain corresponding third vector and fourth vector. And, the plurality of preset transform functions may be preset by a user or default (for example, the i-th preset transform function of the plurality of preset transform functions is Y (X) =x+c) _i X is a first vector or a second vector, Y (X) is a third vector corresponding to the first vector or a fourth vector corresponding to the second vector, C _i Constant vector corresponding to the ith preset transformation function and constant vector C corresponding to each preset transformation function _i Different), not specifically defined in the present disclosure.

After the first vector of the initial sentence to which each error word belongs and the second vector corresponding to the corresponding error correction sentence are obtained through the step 1032, each first vector may be transformed by using a plurality of preset transformation functions, so as to obtain a plurality of third vectors; meanwhile, for each second vector, the second vector may be transformed by using the plurality of preset transformation functions, so as to obtain a plurality of fourth vectors.

In step 1035, for each of the third vector and the fourth vector obtained by transforming the preset transformation function, a second difference value between the average value of the third scores in the third vector and the average value of the fourth scores in the fourth vector is calculated.

In the present disclosure, after obtaining a plurality of third vectors corresponding to the initial sentence to which the error word belongs and a plurality of fourth vectors corresponding to the corresponding error correction sentence through the step 1034, the second difference between the average value of the third scores in the third vectors and the average value of the fourth scores in the fourth vectors may be calculated for the third vectors and the fourth vectors obtained by transforming each preset transformation function. Thus, a plurality of second differences may be obtained.

Note that, the number of the preset transform functions may be set by the user or may be default (for example, 6), which is not specifically limited in the present disclosure.

In step 1036, a plurality of second differences are determined as semantic features of the error word and the error correction word.

After the plurality of second differences are obtained through step 1035, the plurality of second differences may be determined as semantic features of the error word and the error correction word.

Returning to fig. 1, in step 104, it is determined whether the error correction word is correct based at least on the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature.

In the present disclosure, for each error word, after the first co-occurrence frequency of the error word and its preceding word in the preset corpus and the second co-occurrence frequency of the error word and its following word in the preset corpus are obtained through the above step 102, and the corresponding semantic features are obtained through the above step 103, they may be input into a preset xgboost (extreme gradient boosting, extreme gradient lifting) model to determine whether the error word corresponding to the error word is correct. The xgboost model is a classification model, the output of the model can be 0 or 1,0 can be used for representing error correction word errors, and 1 can be used for representing error correction word correctness.

In addition, in the present disclosure, the construction of the xgboost model may be performed based on the manually collated text. Firstly, according to the same manner as in the step 102, for each sample error word obtained by manual correction, a third co-occurrence frequency of the sample error word and a preceding word thereof in a preset corpus and a fourth co-occurrence frequency of the sample error word and a following word thereof in the preset corpus are obtained; meanwhile, corresponding reference semantic features are acquired in the same way as in the step 103; and then, at least the third co-occurrence frequency, the fourth co-occurrence frequency and the reference semantic features are used as training samples and input into an initial xgboost model for training, so that the preset xgboost model is obtained. The specific construction mode of the xgboost model is well known to those skilled in the art, so that detailed description is not given in this disclosure.

In addition, the xgboost model can be optimized, for example, the model can be optimized by modifying model parameters according to training and testing effects.

In one embodiment, it may be determined whether the error correction word is correct based on the first co-occurrence frequency, the second co-occurrence frequency, and the semantic features. Specifically, the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature may be input into the preset xgboost model, so as to determine whether the error correction word corresponding to the error word is correct according to the output of the preset xgboost model. That is, when the output of the preset xgboost model is 0, indicating that the error correction word corresponding to the error word is wrong; and when the output of the preset xgboost model is 1, the error correction word corresponding to the error word is correct.

In another embodiment, in order to further improve the accuracy of text proofreading, when determining whether the error correction word is correct, the error word and the character number information contained in the error correction word may be considered in addition to the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature. Specifically, as shown in fig. 3, the above method may further include the following step 105.

In step 105, the word belonging to the single word of the error word and the error correction word is marked as 1, and the word belonging to the multiple word of the error word and the error correction word is marked as 0.

After the error correction information is obtained in the step 101, it may be determined whether each error word or error correction word is a single word, and if the error word is a single word, the error word is marked as 1, otherwise, the error word is marked as 0. In this way, whether the error correction word is correct or not can be determined together based on the flag of the error word, the flag of the error correction word, the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature. Specifically, the marking of the error word, the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature may be input into the preset xgboost model, so as to determine whether the error word corresponding to the error word is correct according to the output of the preset xgboost model.

Note that, the step 105 may be performed before the step 102 or the step 103, may be performed after the step 102 or the step 103, may be performed simultaneously with the step 102 or the step 103, and is not particularly limited in this disclosure.

In still another embodiment, in order to further improve the accuracy of text proofreading, the second difference information may be considered in addition to the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature when determining whether the error correction word is correct. Specifically, as shown in fig. 4, the above method may further include the following step 106.

In step 106, the maximum value of the plurality of second difference values is marked as 1, and the second difference values other than the maximum value of the plurality of second difference values are marked as 0.

After the plurality of second differences are obtained through the above step 103 (i.e., step 1035), a maximum value of the plurality of second differences may be marked as 1, and second differences other than the above maximum value of the plurality of second differences may be marked as 0. Thus, it is possible to collectively determine whether or not the error correction word is correct based on the flag of each second difference value, the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature. Specifically, the marks of the second differences, the first co-occurrence frequency, the second co-occurrence frequency and the semantic features may be input into the preset xgboost model, so as to determine whether the error correction word corresponding to the error word is correct according to the output of the preset xgboost model.

The step 103 may be performed before the step 102, may be performed after the step 102, may be performed simultaneously with the step 102, and is not particularly limited in this disclosure.

In addition, in order to further improve the accuracy of text proofreading, the text to be proofread may be preprocessed before error correction information is obtained according to the text. Specifically, as shown in fig. 5, the method may further include the following step 107 before the step 101.

In step 107, the text to be checked is preprocessed, and a new text to be checked is obtained.

In the present disclosure, the preprocessing may include filtering out illegal characters (e.g., spaces, blank lines, etc.). After preprocessing the text to be checked, a new text to be checked may be obtained, and then error correction information may be obtained based on the new text to be checked, that is, the above step 101 is performed.

FIG. 6 is a block diagram illustrating a text collation apparatus according to an example embodiment. Referring to fig. 6, the apparatus 600 may include: a first determining module 601, configured to determine error correction information of each sentence in a text to be checked, where the error correction information includes an error word and at least one error correction word corresponding to the error word; a second determining module 602, configured to determine, for each of the error words determined by the first determining module 601, a first co-occurrence frequency of the error word and a preceding word of the error word in a preset corpus, and a second co-occurrence frequency of the error word and a following word of the error word in the preset corpus; an obtaining module 603, configured to obtain, for each of the error correction words corresponding to the error word determined by the first determining module 601, a semantic feature of the error correction word in a corresponding sentence; the determining module is configured to determine whether the error correction word is correct at least according to the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature acquired by the acquiring module 603, which are determined by the second determining module 602.

Optionally, the determining module 604 is configured to input at least the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature into a preset xgboost model to determine whether the error correction word is correct.

Optionally, the apparatus 600 may further include: the first marking module is used for marking the word belonging to the single word in the error word and the error correction word as 1, and marking the word belonging to the multi-word in the error word and the error correction word as 0;

the determining module 604 is configured to: and judging whether the error correction word is correct or not according to the mark of the error word, the mark of the error correction word, the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature obtained by the first marking module.

Optionally, the obtaining module 603 includes: a replacing sub-module, configured to replace the error word in the initial sentence to which the error word belongs with the error correction word, so as to obtain an error correction sentence; a vector obtaining submodule, configured to obtain first vectors a= (a) corresponding to the initial sentences respectively through a Bert model ₁ ,A ₂ ,…,A _m ) A second vector b= (B) corresponding to the error correction sentence ₁ ,B ₂ ,…,B _n ) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A _i For a first score for characterizing the rationality of the occurrence of the ith character in the initial sentence, i=1, 2, …, m, B _j J=1, 2, …, n for a second score for characterizing the rationality of the j-th character in the error correction sentence occurring in the error correction sentence; a first determination submodule for determining a first difference between an average value of each second score in the second vector and an average value of each first score in the first vectorAnd determining semantic features of the error word and the error correction word.

Optionally, the obtaining module 603 includes: a replacing sub-module, configured to replace the error word in the initial sentence to which the error word belongs with the error correction word, so as to obtain an error correction sentence; a vector obtaining submodule, configured to obtain first vectors a= (a) corresponding to the initial sentences respectively through a Bert model ₁ ,A ₂ ,…,A _m ) A second vector b= (B) corresponding to the error correction sentence ₁ ,B ₂ ,…,B _n ) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A _i For a first score for characterizing the rationality of the occurrence of the ith character in the initial sentence, i=1, 2, …, m, B _j J=1, 2, …, n for a second score for characterizing the rationality of the j-th character in the error correction sentence occurring in the error correction sentence; the transformation submodule is used for transforming the first vector and the second vector through each preset transformation function in a plurality of preset transformation functions in sequence to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector; the calculation sub-module is used for respectively calculating a second difference value of the average value of each third score in the third vector and the average value of each fourth score in the fourth vector according to the third vector and the fourth vector obtained by transformation of each preset transformation function; and the second determining submodule is used for determining a plurality of second difference values as semantic features of the error word and the error correction word.

Optionally, the apparatus 600 may further include: the second marking module is used for marking the maximum value of the second differences obtained by the calculating submodule as 1, and marking the second differences except for the maximum value of the second differences as 0;

the determining module 604 is configured to: and judging whether the error correction word is correct or not according to the mark of the second difference value, the first co-occurrence frequency, the second co-occurrence frequency and the semantic feature obtained by the second mark module.

Optionally, the apparatus 600 may further include: the preprocessing module is used for preprocessing the text to be checked before the first determining module determines the error correction information of each sentence in the text to be checked, so as to obtain a new text to be checked;

the first determining module is used for determining error correction information of each sentence in the new text to be checked.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described text collation method provided by the present disclosure.

Fig. 7 is a block diagram of an electronic device 700, according to an example embodiment. As shown in fig. 7, the electronic device 700 may include: a processor 701, a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

Wherein the processor 701 is configured to control the overall operation of the electronic device 700 to perform all or part of the steps of the text collation method described above. The memory 702 is used to store various types of data to support operation on the electronic device 700, which may include, for example, instructions for any application or method operating on the electronic device 700, as well as application-related data, such as contact data, messages sent and received, pictures, audio, video, and so forth. The Memory 702 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia component 703 can include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 702 or transmitted through the communication component 705. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is for wired or wireless communication between the electronic device 700 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or one or a combination of more of them, is not limited herein. The corresponding communication component 705 may thus comprise: wi-Fi module, bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic device 700 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated ASIC), digital signal processor (Digital Signal Processor, abbreviated DSP), digital signal processing device (Digital Signal Processing Device, abbreviated DSPD), programmable logic device (Programmable Logic Device, abbreviated PLD), field programmable gate array (Field Programmable Gate Array, abbreviated FPGA), controller, microcontroller, microprocessor, or other electronic components for performing the text-checking method described above.

In another exemplary embodiment, a computer readable storage medium is also provided, comprising program instructions which, when executed by a processor, implement the steps of the text collation method described above. For example, the computer readable storage medium may be the memory 702 including program instructions described above that are executable by the processor 701 of the electronic device 700 to perform the text collation method described above.

Fig. 8 is a block diagram of an electronic device 800, according to an example embodiment. For example, the electronic device 800 may be provided as a server. Referring to fig. 8, the electronic device 800 includes a processor 822, which may be one or more in number, and a memory 832 for storing computer programs executable by the processor 822. The computer program stored in memory 832 may include one or more modules each corresponding to a set of instructions. Further, the processor 822 may be configured to execute the computer program to perform the text collation method described above.

In addition, the electronic device 800 may further include a power supply component 826 and a communication component 850, the power supply component 826 may be configured to perform power management of the electronic device 800, and the communication component 850 may be configured to enable communication of the electronic device 800, such as wired or wireless communication. In addition, the electronic device 800 may also include an input/output (I/O) interface 858. The electronic device 800 may operate based on an operating system stored in the memory 832, such as Windows Server, mac OS XTM, unixTM, linuxTM, etc.

In another exemplary embodiment, a computer readable storage medium is also provided, comprising program instructions which, when executed by a processor, implement the steps of the text collation method described above. For example, the computer readable storage medium may be the memory 832 including program instructions described above that are executable by the processor 822 of the electronic device 800 to perform the text collation method described above.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described text collation method when executed by the programmable apparatus.

The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present disclosure within the scope of the technical concept of the present disclosure, and all the simple modifications belong to the protection scope of the present disclosure.

In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. The various possible combinations are not described further in this disclosure in order to avoid unnecessary repetition.

Moreover, any combination between the various embodiments of the present disclosure is possible as long as it does not depart from the spirit of the present disclosure, which should also be construed as the disclosure of the present disclosure.

Claims

1. A method of text collation, the method comprising:

judging whether the error correction word is correct or not at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features;

the obtaining the semantic features of the error word and the error correction word includes:

respectively acquiring a first vector A= (A) corresponding to the initial sentence through a Bert model ₁ ,A ₂ ,…,A _m ) A second vector b= (B) corresponding to the error correction sentence ₁ ,B ₂ ,…,B _n ) Wherein m and n are the number of characters contained in the initial sentence respectivelyThe number of characters contained in the error correction sentence, A _i For a first score for characterizing the rationality of the occurrence of the ith character in the initial sentence, i=1, 2, …, m, B _j J=1, 2, …, n for a second score for characterizing the rationality of the j-th character in the error correction sentence occurring in the error correction sentence;

one of the following:

determining a first difference value between the average value of each second score in the second vector and the average value of each first score in the first vector as the semantic features of the error word and the error correction word;

Transforming the first vector and the second vector through each preset transformation function in a plurality of preset transformation functions in sequence to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector; respectively calculating a second difference value of the average value of each third score in the third vector and the average value of each fourth score in the fourth vector aiming at the third vector and the fourth vector obtained by the transformation of each preset transformation function; and determining a plurality of second difference values as semantic features of the error word and the error correction word.

2. The method of claim 1, wherein the determining whether the error correction word is correct based at least on the first co-occurrence frequency, the second co-occurrence frequency, and the semantic feature comprises:

3. The method according to claim 1, wherein the method further comprises:

4. The method according to claim 1, wherein in case the step of obtaining semantic features of the error word and the error correction word comprises the step of determining a plurality of the second differences as the semantic features of the error word and the error correction word, the method further comprises:

5. A method according to any one of claims 1-3, characterized in that before the step of determining error correction information for each sentence in the text to be collated, the method further comprises:

preprocessing the text to be checked to obtain a new text to be checked;

6. A text collation apparatus, the apparatus comprising:

The judging module is used for judging whether the error correction word is correct or not at least according to the first co-occurrence frequency, the second co-occurrence frequency and the semantic features acquired by the acquisition module, which are determined by the second determining module;

wherein, the acquisition module includes:

a replacing sub-module, configured to replace the error word in the initial sentence to which the error word belongs with the error correction word, so as to obtain an error correction sentence;

a vector obtaining submodule, configured to obtain first vectors a= (a) corresponding to the initial sentences respectively through a Bert model ₁ ,A ₂ ,…,A _m ) A second vector b= (B) corresponding to the error correction sentence ₁ ,B ₂ ,…,B _n ) Wherein m and n are the number of characters contained in the initial sentence and the number of characters contained in the error correction sentence, respectively, A _i For a first score for characterizing the rationality of the occurrence of the ith character in the initial sentence, i=1, 2, …, m, B _j J=1, 2, …, n for a second score for characterizing the rationality of the j-th character in the error correction sentence occurring in the error correction sentence;

one of the following:

a first determining submodule, configured to determine a first difference between an average value of each second score in the second vector and an average value of each first score in the first vector as a semantic feature of the error word and the error correction word;

The transformation submodule is used for transforming the first vector and the second vector through each preset transformation function in a plurality of preset transformation functions in sequence to obtain a plurality of third vectors corresponding to the first vector and a plurality of fourth vectors corresponding to the second vector; the calculation sub-module is used for respectively calculating a second difference value of the average value of each third score in the third vector and the average value of each fourth score in the fourth vector according to the third vector and the fourth vector obtained by transformation of each preset transformation function; and the second determining submodule is used for determining a plurality of second difference values as semantic features of the error word and the error correction word.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-5.

8. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-5.