CN117217206A

CN117217206A - Text error correction method, apparatus, electronic device, storage medium, and program product

Info

Publication number: CN117217206A
Application number: CN202310547840.9A
Authority: CN
Inventors: 谢贵才; 张伟; 黄泽谦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-12-12

Abstract

The embodiment of the application discloses a text error correction method, a device, electronic equipment, a storage medium and a program product, which are applied to scenes of natural language processing; the embodiment of the application extracts the coding characteristics and the font characteristics of a plurality of candidate confusion words; determining the coding similarity corresponding to the coding features of any two candidate mixed words and determining the font similarity corresponding to the font features of any two candidate mixed words; weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity; and correcting the text to be corrected by the confusion word set. In the embodiment of the application, the encoding similarity and the font similarity are fused to measure the similarity of the candidate confusion words on the encoding and the font through the weighting process, so that a more comprehensive and accurate confusion word set is constructed, and the accuracy of text error correction by using the confusion word set is improved.

Description

Text error correction method, apparatus, electronic device, storage medium, and program product

Technical Field

The present application relates to the field of computer technology, and in particular, to a text error correction method, apparatus, electronic device, storage medium, and program product.

Background

Text correction refers to a technology for correcting a text with errors so as to enable the text to conform to language specifications and syntactic logic and reduce understanding disorder or misunderstanding. Text correction is widely applied, for example, text quality needs to be ensured in the fields of academic papers, news reports, mail communication, social networks and the like. In the natural language processing scene, text error correction is also a prerequisite for tasks such as text generation, machine translation and the like, and the effect and accuracy of the algorithm can be improved.

However, in the existing text correction method, text is generally directly identified and corrected based on context, and the accuracy of text correction is low due to the influence of context noise.

Disclosure of Invention

The embodiment of the application provides a text error correction method, a device, electronic equipment, a storage medium and a program product, which can improve the accuracy of text error correction.

The embodiment of the application provides a text error correction method, which comprises the following steps: extracting coding features and font features of a plurality of candidate confusion words; determining the coding similarity corresponding to the coding features of any two candidate confusion words, and determining the font similarity corresponding to the font features of any two candidate confusion words; weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity; and correcting the text to be corrected through the confusion word set.

The embodiment of the application also provides a text error correction device, which comprises: a feature extraction unit for extracting coding features and font features of a plurality of candidate confusion words; a similarity determining unit, configured to determine coding similarities corresponding to the coding features of any two candidate confusion words, and determine glyph similarities corresponding to the glyph features of any two candidate confusion words; the weighting unit is used for carrying out weighting processing on the coding similarity and the font similarity to obtain the total similarity of the arbitrary two candidate confusion words; a word set determining unit configured to determine a mixed word set from the plurality of candidate mixed words according to the total similarity; and the error correction unit is used for correcting the text to be corrected through the confusion word set.

The embodiment of the application also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform steps in any of the text error correction methods provided by the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, the instructions are suitable for being loaded by a processor to execute the steps in any text error correction method provided by the embodiment of the application.

The embodiments of the present application also provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in any of the text error correction methods provided by the embodiments of the present application.

The embodiment of the application can extract the coding characteristics and the font characteristics of a plurality of candidate confusion words; determining the coding similarity corresponding to the coding features of any two candidate confusion words, and determining the font similarity corresponding to the font features of any two candidate confusion words; weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity; and correcting the text to be corrected through the confusion word set.

According to the method, the influence of different characteristics can be comprehensively considered through weighting processing based on multidimensional characteristics of the candidate mixed words on codes and fonts, and the similarity of the codes and the fonts is fused with the similarity of the codes and the fonts to measure the similarity of the candidate mixed words, so that the constructed mixed word set comprises a plurality of mixed words similar to each other on the codes and the fonts, and the more comprehensive and accurate mixed word set is constructed. When the method is applied, the confusion word set can be directly called to correct the text to be corrected, so that the confusion word set is utilized to accurately match Chinese characters in the text to be corrected, and the correction efficiency is improved; in addition, as the mixed word set contains a plurality of mixed words similar in coding and font, errors caused by various reasons such as spelling errors, writing errors and the like can be identified, and the accuracy of text error correction is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a scenario of a text error correction method according to an embodiment of the present application;

FIG. 1b is a schematic flow chart of a text error correction method according to an embodiment of the present application;

FIG. 2a is a flow chart of a text error correction method according to another embodiment of the present application;

FIG. 2b is a schematic diagram of a multi-granularity feature of the candidate confusion word Chinese character < Teng, teng > provided by an embodiment of the present application;

FIG. 2c is a schematic diagram of processing retrieved text according to an embodiment of the present application;

fig. 2d is a schematic flow chart of error correction for text to be corrected according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text error correction device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a text error correction method, a text error correction device, electronic equipment, a storage medium and a program product.

The text error correction device can be integrated in an electronic device, and the electronic device can be a terminal, a server and other devices. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the text error correction apparatus may also be integrated in a plurality of electronic devices, for example, the text error correction apparatus may be integrated in a plurality of servers, and the text error correction method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to FIG. 1a, the text error correction method may be implemented by a server that may extract coding features as well as glyph features for a plurality of candidate confusion words; determining the coding similarity corresponding to the coding features of any two candidate mixed words and determining the font similarity corresponding to the font features of any two candidate mixed words; weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity; the server can obtain the text to be corrected from the terminal, correct the text to be corrected through the confusion word set, and obtain corrected text (correction result). The server may also send the corrected text to the terminal so that the terminal displays the corrected text.

The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples. It will be appreciated that in the specific embodiment of the present application, user-related data such as text to be corrected, text of questions, etc. is related to text entered by a user, when the embodiment of the present application is applied to a specific product or technology, user permission or consent is required, and the collection, use and processing of related data is required to comply with relevant laws and regulations and standards of relevant countries and regions.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.

In this embodiment, a text error correction method related to artificial intelligence is provided, as shown in fig. 1b, the specific flow of the text error correction method may be as follows:

110. the coding features and the font features of a plurality of candidate confusion words are extracted.

Wherein, the confusing words refer to Chinese characters with near shapes or near sounds, which are easy to be confusing when writing or reading. Confusing words are common in chinese, such as "say" and "will", "be" and so on. The candidate confusion words refer to possible confusion words determined in the text error correction process, for example, the candidate confusion words can be confusion words in the existing confusion word set, chinese characters in the text corpus, 2500 common Chinese characters and 1000 common Chinese characters recorded in any Chinese characters such as a modern Chinese common word list.

In some implementations, the Chinese characters in the text corpus can be used as candidate confusion words based on a text data set of the target scene (i.e., the text corpus of the target scene) that is collected and organized according to a particular method, where the text corpus can include language materials used in the target scene, such as words, chinese characters, vocabularies, idioms, phrases, short sentences, long sentences, poems, articles, and the like. The target scene refers to an application scene of the text error correction method in the embodiment of the application.

The coding feature is a feature obtained by coding Chinese characters (candidate confusion words) according to a coding mode of the Chinese characters. The coding mode of the Chinese characters comprises, but is not limited to, the existing coding modes such as stroke coding, pinyin coding, four-corner coding and the like.

The character pattern feature refers to the feature of the font morphology of the Chinese character (candidate confusing character). For example, the glyph features may include, but are not limited to, chinese character whole glyphs, constituent structures (including individual words, left-right structures, up-down structures, surrounding class structures), radicals or radicals, and the like.

In some implementations, the encoding features are features resulting from converting candidate confusion words according to specified encoding rules. The specified coding rule refers to a mode of coding based on the characteristics of the Chinese characters, for example, the specified coding rule includes, but is not limited to, at least one of stroke coding, pinyin coding, four-corner coding and the like, the stroke coding refers to splitting each Chinese character into a plurality of strokes and representing each stroke by numbers, for example, a "正" word 5 drawing and a "副" word 11 drawing, the pinyin coding refers to converting the Chinese character into Latin letters for spelling according to pronunciation, for example, a 'Beijing' is converted into a "Běijīng", the four-corner coding refers to obtaining four numbers according to the arrangement positions (upper left, upper right, lower left and lower right) of the strokes of the Chinese character, and the four numbers are used for representing one Chinese character.

In some implementations, the encoding features include at least one of stroke sequence features, pinyin sequence features, and tetra-angular encoding sequence features.

In practical application, the strokes are used as the minimum granularity component elements of Chinese character composition, and the complete stroke sequence (comprising sequence and strokes) can be used as the expression of the most basic of Chinese characters; the phonetic alphabet is an acoustic symbol representing the phonetic system of Chinese character pronunciation, it converts each Chinese character into phonetic symbols composed of Latin letters, is used for marking syllables and tones of Chinese characters, the phonetic alphabets or tones of different Chinese characters are similar, even the same is the main cause of near confusion; the four-corner coding is the Chinese character indexing system as the phonetic character indexing system and the first character indexing system. Different from the stroke sequence, the four-corner coding reserves single stroke structures such as transverse stroke, vertical stroke, left falling stroke, right falling stroke and the like, and also comprises a complex stroke unit of a plurality of stroke combinations such as point transverse stroke (such as "亠"), cross stroke (such as "十" and "乂"), and any Chinese character can be represented by using a coding sequence with the length of 5 (namely, the characteristics of the four-corner coding sequence are obtained), wherein the first four digits in the coding sequence are numbers corresponding to the arrangement positions of Chinese character strokes, the last digit is an auxiliary number, and the auxiliary number is used for distinguishing repeated words. Therefore, the stroke sequence features based on strokes can represent the feature expression of the minimum granularity of the candidate confusion words, the pinyin sequence features based on pinyin can represent the feature expression of the candidate confusion words on syllables and tones, the four-corner coding sequence features based on four-corner coding can represent the feature expression of the moderate granularity of the candidate confusion words, one or more of the sequence features are combined to form coding features, the features of the candidate confusion words with multiple granularity and multiple dimensions are covered, so that the coding features of the Chinese character features can be expressed more accurately, and the accuracy of text error correction is improved.

In some embodiments, a coding sequence based on Chinese character constituent features can be obtained based on a specified coding rule such as stroke coding, pinyin coding or four-corner coding, so as to obtain standard coding features which can express Chinese character features through conversion, so that an accurate confusion word set is constructed, and the accuracy of text error correction is improved. Specifically, the coding features are obtained by:

determining a coding sequence of the candidate confusion word according to a specified coding rule, wherein the specified coding rule comprises at least one of stroke coding, pinyin coding and four-corner coding;

the code sequence is converted into code features.

The stroke sequence features, the pinyin sequence features and the four-corner coding sequence features are obtained by converting candidate confusion words according to stroke coding, pinyin coding and four-corner coding.

The stroke classification comprises a basic method of permanent character eight method (totally classified into 8 types) and a basic method of sheaf character eight method (totally classified into 5 types), and a detailed classification method of Zhang Jingxian standard strokes of 32 types in the Chinese character folding rule of GB13000.1 character set issued by the education department, wherein at least one of the stroke classifications can be used as a stroke code to obtain stroke sequence characteristics. In some implementations, for stroke encoding, candidate confusion words may be split into corresponding stroke sequence features based on derived strokes of a Chinese character. For example, to more accurately characterize the similarity of the Chinese characters at the finest granularity, the candidate confusion word may be split into corresponding stroke sequence features based on 25 derived strokes as shown in Table 1 below.

TABLE 1

The common Chinese character pinyin comprises two kinds of Mandarin pinyin and phonetic notation, and the candidate confusion words can be converted into corresponding pinyin sequence characteristics based on any one of the two kinds of Chinese character pinyin as pinyin codes. In some implementations, for pinyin coding, candidate confusion words may be converted to corresponding pinyin sequence features based on Mandarin pinyin.

The phonetic alphabets or tones of different Chinese characters are similar, even the same, which is a main reason for causing near confusion. In some implementations, for pinyin encoding, pinyin and tone combinations of candidate confusion words may be formed into corresponding pinyin sequence features. For example, the pinyin of the Chinese character "Teng" is teng two-tone pronunciation, so the pinyin sequence feature of "Teng" can be expressed as "teng2".

In some embodiments, for the four-corner coding, the candidate confusion word can be split to obtain the corresponding four-corner coding sequence feature based on the corresponding pen shape of the four-corner coding of the Chinese character. For example, in order to more accurately describe the confusion of the granularity of Chinese characters, the candidate confusion words can be split into corresponding four-corner coding sequence features based on 10 pen shapes as shown in the following table 2.

TABLE 2

In some implementations, to facilitate further processing of the stroke sequence features, pinyin sequence features, and tetragonal coding sequence features of the candidate confusion word, the stroke sequence features, pinyin sequence features, and tetragonal coding sequence features may be represented in a string form. For example, taking the Chinese character "Teng" as an example, according to the derived strokes and the writing order of "Teng" in Table 1, it can be split into: the fold-back elements one by one And converting the stroke sequence characteristics into stroke sequence characteristics in the form of character strings: "\u4e3f\u4c\4e00\u4e00\u4e36\u4e0e\u4; e00\u4e00\u4e36\u31cf2c4d\u3169\u4e00'; according to the Pinyin of "Teng": teng2, which can be converted into pinyin sequence feature in the form of a string: 'teng2'; from the pen shapes in Table 2, "Teng" can be split into: 79227, and then converting the character string into a character string form of four-corner coding sequence features: "79227".

The near confusion of Chinese characters is related to the radical structure of the Chinese characters. Taking Chinese characters {腾,滕,腠} as an example, the left side of the three Chinese characters are all in a "月" shape, and the right side is in an up-down structure. The right side structure of Chinese character <腾,滕> has the same character head of "", and the confusing property is larger than that of Chinese character <腾,腠>. Therefore, in some embodiments, the components of the candidate confusion word can be used as the character shape characteristics of the candidate confusion word, so that the confusable characteristics in the character shape structure of the candidate confusion word can be more accurately represented, and the accuracy of text error correction is improved. Specifically, the glyph features include radical features that are obtained by:

splitting the character pattern structure of the candidate confusion character to obtain at least one component;

and obtaining the character components of the candidate confusion words according to the character components.

The character form structure refers to the composition form of Chinese characters, and comprises elements such as strokes, radicals, structures and the like in the characters.

The components refer to the components of the Chinese character with the same or related meaning, and are usually located at the left, right or up and down positions of the Chinese character to prompt or suggest the meaning of the Chinese character. The radical feature is a feature for expressing a radical of a Chinese character, for example, the radical may include, but is not limited to, at least one of a radical itself, a stroke number of the radical, a structure, a stroke order, and the like.

In practical application, the candidate confusion words can be split through the structural forms of Chinese characters such as an up-down structure, a left-right structure, a full-surrounding structure and the like, so that the components forming the candidate confusion words can be obtained. For example, the 14-type Chinese character splitting method shown in the following table 3 can be adopted, and the candidate confusion words are split to obtain the components thereof according to the font structures corresponding to the candidate confusion words. For example, chinese characters "Teng" are left and right structures, which can be divided into "Yue" and "YueFor the up-down structure, the Chinese character 'Fu' can be further split into "" and "马", so that "腾" is split into "月", "" and "马" which can not be split any more The radicals serve as radical features of candidate confusion words.

TABLE 3 Table 3

In some implementations, to facilitate further processing of the radical features of the candidate confusion word, the radical features may be represented in the form of strings. For example, after "腾" is split into non-splittable components "月", "" and "马", these components can be converted into corresponding character strings, and then the character string-form component features are spliced according to the writing order of the "腾" components: "\u6708\u9f79\u99ac).

In some embodiments, the characteristics of the whole fonts of the candidate confusion words can be extracted based on the display images of the candidate confusion words, so as to add the visual image modal characteristics of the candidate confusion words, and the confusable Chinese characters are distinguished from the dimensionality of the fonts displayed by the Chinese characters, so that the accuracy of text error correction is improved based on the multimodal characteristics. Specifically, the glyph feature includes a display feature that is obtained by:

converting the display image of the candidate confusion word into a gray scale image;

and taking the image characteristics of the gray level diagram as the display characteristics of the candidate confusion words.

The display image of the candidate mixed word refers to an image displaying the candidate mixed word, for example, the display image may include, but is not limited to, a display image of a standard Chinese character, the display image of the standard Chinese character may include a Chinese character image manufactured by a printing technology, the display image of the standard Chinese character is affected by factors such as a font, a font size, and the like, the display image of the non-standard Chinese character may include a real handwritten Chinese character image obtained by scanning or digital shooting and the like, and the display image of the non-standard Chinese character is affected by factors such as a thickness, a direction, a speed, and the like of strokes. The display feature refers to a feature extracted from the display image of the candidate confusion word, that is, an image feature corresponding to the display image of the candidate confusion word.

For example, a display image of the candidate confusion word may be acquired, and the display image may be converted into a gray scale image, which may be understood as a matrix with a matrix element value ranging from 0 to 255. And extracting features (matrix operation) of the gray level image (matrix) by using an average value method (averaging the matrix), a maximum value method, a minimum value method and the like to obtain the image features of the gray level image, namely the display features of the candidate confusion words. Therefore, the characteristics of the Chinese characters such as the form, the texture and the like are obtained by extracting the image characteristics of the gray level diagram of the candidate confusing words, and the different parts of the Chinese characters are represented, so that the confusing Chinese characters can be distinguished more accurately.

In some embodiments, a binary method based on a threshold may be used to convert the gray level map of the candidate confusion word into a binary image, and then a segmentation algorithm based on connectivity is used to segment the character region, so as to extract and obtain the outline feature of the candidate confusion word, and the gray level value or the parameter value corresponding to the outline feature is used as the display feature of the candidate confusion word.

120. And determining the coding similarity corresponding to the coding features of any two candidate confusion words and determining the font similarity corresponding to the font features of any two candidate confusion words.

The coding similarity refers to the similarity calculated according to the coding characteristics of the candidate confusion words. The font similarity refers to the similarity calculated according to the font characteristics of the candidate confusion words. For example, the similarity of the coding features/glyph features of any two candidate confusion words may be calculated based on the edit distance, euclidean distance, and cosine similarity to obtain the coding similarity/glyph similarity.

In some embodiments, the similarity information of any candidate confusion word can be converted into a more visual first similarity value and a more visual second similarity value based on the similarity condition, so that subsequent data processing is facilitated, and the accuracy of text error correction is improved. Specifically, the font similarity includes radical similarity, which is obtained by:

if the component features of any two candidate confusion words meet the similarity condition, taking the first similarity value as the component similarity;

and if the character component characteristics of any two candidate confusion words do not meet the similarity condition, taking the second similarity value as the character component similarity.

The similarity condition is a condition for judging whether the character string of the candidate confusion word is similar or not. It is understood that satisfying the similarity means that the component features are similar, and that not satisfying the similarity means that the component features are not similar. For example, the similarity condition may include, but is not limited to, whether the components themselves, the number of strokes of the components, the structure, the order of strokes, etc., are similar. In practical applications, the similarity condition may be characterized as a similarity threshold.

The first similarity value is a value used for representing similarity of the component features of any two candidate confusion words, and the second similarity value is a value used for representing dissimilarity of the component features of any two candidate confusion words. In practical application, a first similarity value and a second similarity value can be designated according to an application scene and practical requirements, for example, the first similarity value can be 1, and the second similarity value can be 0, so that the similarity of the component features of any two candidate confusion words is binarized, and similarity information is converted into a more visual form, so that subsequent data processing is facilitated.

For example, the similarity value of the component features of any two candidate mixed words, such as the number of strokes, the structure or the stroke sequence, can be calculated, for example, the difference value of the number of strokes of the component features of the two candidate mixed words is calculated, the smaller the difference value is, the higher the similarity value is, or the component features are classified into the types of 'left-right structure', 'upper-lower structure', and the like, then the structure types of the two component features are compared, the similarity value is increased, or the writing sequence of the strokes of the two component features is compared based on the writing sequence of the strokes, and the more similar the similarity value is. Thus, if the similarity value is equal to or greater than the similarity threshold value, the similarity condition is satisfied, the radical similarity is 1, and if the similarity value is less than the similarity threshold value, the similarity condition is not satisfied, and the radical similarity is 0.

In some embodiments, the radical feature of any two candidate confusion words meeting the similarity condition means that the radical feature of any one radical of any two candidate confusion words meeting the similarity condition, and the radical feature of any two candidate confusion words not meeting the similarity condition means that the radical feature of all radicals of any two candidate confusion words not meeting the similarity condition.

In some embodiments, the similarity condition may be that the components corresponding to the component features are identical. For example, by comparing the components of the Chinese character <腾,滕> and finding that the stroke number, structure and stroke order of the right-hand "" character head are the same, the Chinese character <腾,滕> is considered to have the same "" character head on the right-hand side, the similarity condition is satisfied by <腾,滕> and the similarity of the components of <腾,滕> is 1.

In some embodiments, the font similarity includes a display similarity, the display similarity being a similarity based on an overall font of the chinese character display, the confusable chinese character being distinguishable from dimensions of the overall font of the chinese character, the display similarity including at least one of a first display similarity, a second display similarity, and a third display similarity. The first display similarity, the second display similarity and the third display similarity are calculated similarities of display characteristics according to different methods.

In some embodiments, the first display similarity may be an overall similarity directly calculated by the image features, so that the overall font similarity may be quickly calculated based on the image features. Specifically, the first display similarity is obtained by: and calculating the similarity of the image features of any two candidate confusion words to obtain a first display similarity. For example, the first display similarity may be obtained by calculating the similarity of image features (gray value matrix) of gray images of display images of any two candidate confusion words, such as euclidean distance, cosine similarity, manhattan distance, and the like.

In some embodiments, the second display similarity may be a local similarity calculated from the image features, so that the local similarity of the image may be calculated. Specifically, the second display similarity is obtained by:

dividing a gray scale image of a display image into a plurality of image blocks;

calculating the characteristic mean value and the characteristic variance of each image block to obtain a characteristic vector of each image block, wherein the characteristic vector comprises the characteristic mean value and the characteristic variance of the image block;

forming a picture vector of the display image by the feature vectors of all the image blocks;

And calculating the similarity of the image vectors of the display images of any two candidate confusion words to obtain a second display similarity.

For example, the gray level map of the display image may be sampled in a sliding manner through a sliding window, each sampling corresponds to one image block, the feature mean and the feature variance of the image block corresponding to the sliding window are calculated, for example, the feature mean a and the feature variance are obtained by calculating each line of the gray level matrix, for example, the feature vector of the image block 1 is (a 1, b 1), the feature vectors of all the image blocks are spliced to obtain the feature vector (i.e., the map vector) of the display image, and the similarity of the map vectors of the display image of any two candidate confusion words is calculated through euclidean distance, cosine similarity, manhattan distance, and the like, so as to obtain the second display similarity.

In some embodiments, the third display similarity may be a similarity calculated by a perceptual hash algorithm, where the perceptual hash algorithm only needs to calculate a hash value, so that the method has high calculation efficiency, has a certain robustness to common deformations such as rotation, scaling, translation, and the like of the image, and can ensure that a relatively accurate similarity comparison result can be maintained under the deformation within a certain range. Specifically, the third display similarity is obtained by:

According to the display characteristics of the display image, calculating the gray average value of the gray image of the display image;

comparing the gray value and the gray average value of the gray map aiming at the gray map of the display image corresponding to any candidate confusion word to obtain a perception hash value sequence of the display image;

and comparing the perceived hash value sequences of the display images of any two candidate confusion words to obtain a third display similarity.

For example, the display feature may be a matrix of gray values of a gray scale map of the display image. For a display image corresponding to any candidate confusion word, an average value (namely a gray average value) of gray values of all pixels in a gray image can be calculated, each pixel in the gray image is compared with the gray average value, the pixel with the value greater than or equal to the average value is marked as 1, the pixel with the value less than the average value is marked as 0, the comparison results of all pixels in the gray image are combined into a binary string according to the sequence of the pixels, and a perceived hash value sequence of the display image corresponding to the gray image is obtained, for example, if the gray values of two pixels are respectively 10 and 200 and the average value is 100, the comparison results of the two pixels are respectively 0 and 1, and the hash value sequence obtained after combination is 01. For the perceptual hash value sequence of the display image of any two candidate confusion words, the number of values with the same number of bits in the sequence can be counted, if the values of one position of the two perceptual hash value sequences are the same, 1 is counted, and the final value counted by the same value is used as the third display similarity.

In some embodiments, the display similarity may be obtained by weighting the first display similarity, the second display similarity, and the third display similarity, where the weight value corresponding to the first display similarity is greater than the weight value corresponding to the second display similarity and the weight value corresponding to the third display similarity.

In some embodiments, the encoded similarity may be evaluated quantitatively based on the encoded length of the encoded feature of the candidate confusion word to convert the similarity information into a more intuitive form for subsequent data processing. Specifically, the coding similarity is obtained by:

acquiring a first coding length of a first coding feature and a second coding length of a second coding feature, wherein the first coding feature and the second coding feature are coding features of any two candidate confusion words;

and according to the first coding length and the second coding length, calculating to obtain the coding similarity of any two candidate confusion words.

The coding length refers to the length of a coding feature, the first coding length and the second coding length are respectively the length of a first coding feature and the length of a second coding feature, the first coding feature and the second coding feature are respectively the coding features of a first candidate confusion word and a second candidate confusion word, and the first candidate confusion word and the second candidate confusion word are any two candidate confusion words. For example, when the coding feature is expressed in the form of a character string, the coding length is a character string length, and when the coding feature is expressed in the form of a vector, the coding length is a vector length.

In some embodiments, since features with shorter encoding lengths are more easily matched than features with longer encoding lengths, that is, feature similarity may be calculated by encoding features with shorter encoding lengths, in order to reduce the influence of encoding lengths on similarity, distance similarity may be calculated by combining feature similarity of encoding features with encoding length calculation, where feature similarity of encoding features is used to characterize similarity of encoding features, and adding encoding lengths may make the finally obtained encoding similarity more accurate. The feature similarity of the encoding features may include, but is not limited to, at least one of the similarities in the form of encoding length differences, edit distances, maximum common encoding features, and the like.

In some embodiments, the closer the coding lengths of the two coding features are, the smaller the difference of the number of the sequence elements is, so that the similarity degree of the coding features can be represented by using the difference of the coding lengths, the coding similarity of the candidate confusion words can be obtained through rapid calculation, and the efficiency of the constructed confusion word set is improved. Specifically, the coding similarity includes a feature length similarity, and according to a first coding length and a second coding length, the coding similarity of any two candidate confusion words is calculated, including:

Determining a code length difference of the first code length and the second code length, and determining a maximum code length value from the first code length and the first code length;

and calculating the feature length similarity of any two candidate confusion words according to the code length difference value and the maximum code length value.

The feature length similarity refers to the similarity of the coding feature lengths.

For exampleThe feature length similarity (SimScare) can be calculated by the following formula _LEN )：

Wherein len (s ₁ ) Represents a first code length, len (s ₂ ) Represents a second code length, len (s ₁ )-len(s ₂ ) Representing the difference between the first encoded length and the second encoded length, abs () represents the absolute value obtained, that is, abs (len(s) ₁ )-len(s ₂ ) I.e., the encoded length difference value), max (len(s) ₁ ),len(s ₂ ) A larger value (i.e., a maximum code length value) of the first code length and the second code length.

In some embodiments, the similarity degree of coding features can be represented by using an editing distance, the editing distance has wide applicability and flexibility, various character strings with different types and lengths can be effectively processed, and certain fault tolerance is achieved, so that the integrity of a constructed confusion word set is improved, and the accuracy of text error correction is improved. Specifically, the coding similarity includes a distance similarity, and according to the first coding length and the second coding length, the coding similarity of any two candidate confusion words is calculated, including:

Determining an edit distance for converting the first encoded feature to a second encoded feature;

and calculating the distance similarity of any two candidate confusion words according to the editing distance, the first coding length and the second coding length.

The edit distance refers to the minimum number of operations required to convert one code feature into another code feature, and can be used to measure the similarity (degree of difference) between two code features.

For example, the edit distance may be calculated in combination with coding features, such as code length, etc., such that the distance similarity is obtained by performing a target operation on the edit distance, the first code length, and the second code length, where the target operation may include, but is not limited to, at least one of a base operation, which is the most basic mathematical operation such as addition, subtraction, multiplication, division, etc., or a higher-level operation, which is a more complex and abstract mathematical operation such as micro-integration, etc. Since each editing operation is to take a sequence element as an operation object, information of fine granularity hierarchy between sequences can be represented.

In some embodiments, the editing distance may be divided by a target coding length, where the target coding length is the larger of the first coding length and the second coding length, to obtain the distance similarity of any two candidate confusion words. Therefore, the editing distance is divided by the target coding length to obtain a standardized distance similarity value, so that the calculated distance similarity is more accurate. For example, the distance similarity (SimScare) can be calculated by the following formula _ED )：

Wherein ED is _distance Representing the edit distance.

In some embodiments, the similarity of the coding features can be represented by using the largest common coding feature, so that the similarity of the local coding of the coding features in the serialization dimension can be mined, the integrity of the constructed confusion word set can be improved, and the accuracy of text error correction can be improved. Specifically, the coding similarity includes a common feature similarity, and according to a first coding length and a second coding length, the coding similarity of any two candidate confusion words is calculated, including:

determining a maximum common coding feature of the first coding feature and the second coding feature;

and calculating the common feature similarity of any two candidate confusion words according to the maximum common coding feature, the first coding length and the second coding length.

Wherein the maximum common coding feature refers to the length of the largest common substring and/or sub-sequence in the first coding feature and the second coding feature.

For example, the length of the largest common substring and/or sub-sequence can be determined by comparing the first coding feature and the second coding feature, so that the common feature similarity can be obtained by performing target operation on the length of the largest common substring and/or sub-sequence, the first coding length and the second coding length. The maximum common coding feature may characterize the aggregate intersection result of two sequences, representing the same substring/subsequence of the two sequences, the common substring and subsequence indicating how many sequence elements are the same for the two sequences, so the index represents the local similarity of the two sequences.

In some embodiments, the maximum common code feature may be divided by the first code length and/or the second code length to obtain a common feature similarity for any two candidate confusion words. Therefore, the standardized common feature similarity is obtained by combining the coding length of the coding features, so that the calculated common feature similarity is more accurate. For example, the common feature similarity (SimScare) can be calculated by the following formula _LCS )：

Where LCS represents the largest common substring or subsequence, len (LCS) represents the largest common coding feature.

130. And weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words.

The weighting processing refers to giving different weights to the coding similarity and the font similarity, and performing weighted average, weighted summation or other weighting operations on the coding similarity and the font similarity to distinguish the contribution degree of the coding similarity and the font similarity to the similarity degree of the candidate confusion words.

Wherein the overall similarity may characterize the degree of confusion of any two candidate confusion words.

Therefore, the text error correction method of the embodiment of the application comprehensively considers the influence of different characteristics through weighting treatment based on the multi-dimensional characteristics of the candidate mixed word codes and the fonts on the codes and the fonts, and fuses the codes similarity and the fonts similarity to measure the similarity of the candidate mixed word codes and the fonts on the codes and the fonts, so that the constructed mixed word set comprises a plurality of mixed words similar to the codes and the fonts, and the more comprehensive and accurate mixed word set is constructed.

In some embodiments, the coding similarity and the font similarity can be weighted for multiple times based on different weights, so that the influence of different indexes is comprehensively considered, the similarity degree of the candidate confusion words is more comprehensively evaluated, inaccuracy caused by relying on single characteristics is avoided, the accuracy of a constructed confusion word set is improved, and the accuracy of text error correction is improved. Specifically, the weighting processing is performed on the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words, which comprises the following steps:

according to the first weight, carrying out weighting treatment on the coding similarity of any two candidate confusion words to obtain the weighted coding similarity of any two candidate confusion words;

and according to the second weight, carrying out weighted processing on the weighted coding similarity and the font similarity of any two candidate confusion words to obtain the total similarity of any two candidate confusion words.

The first weight and the second weight may be weights assigned according to application scenes and actual needs, or may be weights determined by methods such as statistics, probability theory or automatic weight learning based on machine learning. The first weight and the second weight may be the same or different.

For example, for any two candidate confusion words, the weighted coding similarity can be obtained by multiplying the first weight by the coding similarity, and then the weighted coding similarity and the font similarity are weighted and summed or weighted and averaged by the second weight to obtain the total similarity.

In practical applications, when the coding features include multiple coding features, the coding similarity corresponding to the combined features may be calculated after the multiple coding features are combined. However, in some embodiments, in order to better, multi-granularity and multi-dimensionally compare the similarity of the coding features of the candidate confusion words, the corresponding coding similarity may be calculated for each coding feature, and then the coding similarity and the font similarity corresponding to the multiple coding features are fused to calculate the total similarity.

For example, the coding features include stroke sequence features, pinyin sequence features and four-corner coding sequence features, and the coding similarity corresponding to the stroke sequence features of any two candidate confusion words, the coding similarity corresponding to the pinyin sequence features of any two candidate confusion words, and the coding similarity corresponding to the four-corner coding sequence features of any two candidate confusion words may be calculated respectively. Wherein the coding similarity comprises corresponding feature length similarity (SimScare _LEN ) Distance similarity (SimScare) _ED ) And common feature similarity (SimScare) _LCS ). For any coding feature, the feature length similarity, the distance similarity and the common feature similarity corresponding to the coding feature can be weighted and summed through a first weight to obtain the weighted coding similarity (SimScore) corresponding to the coding feature. Taking the stroke sequence feature as an example, the weighted coding similarity (SimScare) corresponding to the stroke sequence feature of any two candidate confusion words can be obtained through calculation according to the following formula:

wherein,represents the maximum common subsequence similarity among the common feature similarities,representing the maximum common substring similarity among the common feature similarities, simScore in the above formula _ED 、SimScore _LCS 、/>And +.>The similarity corresponding to the stroke sequence characteristics is respectively represented by SimScore at 0.2, 0.5, 0.1 and 0.2 _ED 、SimScore _LCS 、/>And +.>And a corresponding first weight. It should be noted that, the first weights corresponding to the stroke sequence feature, the pinyin sequence feature and the tetragonal coding sequence feature may be different, and the calculation process may be the method in the above formula.

For another example, after obtaining the weighted coding similarity corresponding to the stroke sequence features, the weighted coding similarity corresponding to the pinyin sequence features and the weighted coding similarity corresponding to the four-corner coding sequence features of any two candidate confusion words, the total similarity of the any two candidate confusion words may be obtained by calculating the following formula:

SimScorc＝0.35×SimScore _stroke +0.25×SimScore _quadrangle +0.10×SimScore _radical +0.30×SimScore _img +1.0×SimScore _pinyin ；

Wherein, simScore _stroke 、SimScore _quadrangle 、SimScore _radical 、SimScore _img 、SimScore _pinyin The method respectively represents the weighted coding similarity corresponding to the stroke sequence characteristics, the weighted coding similarity corresponding to the four-corner coding sequence characteristics, the radical similarity, the display similarity and the weighted coding similarity corresponding to the pinyin sequence characteristics. 0.35, 0.25, 0.10, 0.30 and 1.0 respectively represent a second weight corresponding to the weighted coding similarity corresponding to the stroke sequence feature, a second weight corresponding to the weighted coding similarity corresponding to the four-corner coding sequence feature, a second weight corresponding to the radical similarity, a second weight corresponding to the display similarity and a second weight corresponding to the weighted coding similarity corresponding to the pinyin sequence feature.

Taking the candidate confusion word < Teng > as an example, the weighted coding similarity, the radical similarity and the display similarity shown in the following table 4 can be calculated, and the total similarity in the table 4 can be obtained by weighted summing the weighted coding similarity, the radical similarity and the display similarity in the table 4.

TABLE 4 Table 4

In some embodiments, to improve the efficiency of automatic weight learning, the first weight and the second weight may be adjusted by the confusion level. Specifically, the text error correction method further comprises the following steps:

According to the initial first weight and the initial second weight, weighting the coding similarity and the font similarity to obtain the initial total similarity of any two candidate confusion words;

determining a candidate word confusion level corresponding to the initial total similarity;

obtaining the marking grade corresponding to any two candidate confusion words;

and adjusting the initial first weight and the initial second weight according to the candidate word confusion level and the labeling level to obtain the first weight and the second weight.

The initial first weight and the initial second weight refer to an initial weight value specified according to an application scene and actual requirements.

The confusion level refers to a level representing the similarity degree between Chinese characters. For example, the confusion hierarchy may include a first level confusion, a second level confusion, and a third level confusion, where the first level confusion is the lowest level of similarity and the third level confusion is the highest level of similarity, and the first level confusion, the second level confusion, and the third level confusion correspond to the first similarity threshold interval, the second similarity threshold interval, and the third similarity threshold interval, respectively.

For example, an initial first weight and an initial second weight may be preset, and taking any two candidate confusion words as < Teng, teng > as an example, a manually-labeled confusion level (i.e., labeling level) of < Teng, teng > may be obtained. And (3) weighting the coding similarity and the font similarity of the < Teng > to obtain the initial total similarity of the < Teng >, and determining the corresponding confusion level (namely the candidate word confusion level) according to the similarity threshold interval in which the value of the initial total similarity is positioned. The labeling level and the candidate confusion level can be compared, if the labeling level is inconsistent, the initial first weight and the initial second weight are adjusted, the initial total similarity of < Teng, teng > is calculated again through the adjusted initial first weight and the adjusted initial second weight, the candidate word confusion level of < Teng, teng > is determined, the labeling level and the candidate confusion level are compared again, if the labeling level is inconsistent, the initial first weight and the initial second weight are adjusted again until the labeling level is consistent with the candidate confusion level, and the adjusted initial first weight and the adjusted initial second weight obtained in the last adjustment are used as the first weight and the second weight respectively.

In some embodiments, to boost the initial first weight and the initial second weight may be adjusted by a grid search algorithm to quickly and accurately obtain the first weight and the second weight. The grid search algorithm selects the best combination of parameters by traversing the given combination of parameters of the initial first weight and the initial second weight.

140. A set of confusion words is determined from the plurality of candidate confusion words based on the overall similarity.

Wherein, the confusion word set refers to a Chinese character set formed by similar candidate confusion words. In some embodiments, the set of confusion comprises at least one pair of confusion, each pair of confusion comprising at least two nearby candidate confusion, e.g., the set of confusion comprises a first pair of confusion, "Teng, striae," a second pair of confusion, "say," a third pair of confusion.

For example, any two candidate mixed words with the coding similarity and the font similarity greater than a preset similarity threshold may be used as similar target mixed words, the preset similarity threshold may be a threshold set according to an application scenario or actual requirements, and if the total similarity between any mixed word and a plurality of mixed words is greater than the preset similarity threshold, the mixed words are considered as similar target mixed words. If the total similarity 1.7896 of "Teng", "Teng" in the confusion words "Teng", "striae" is greater than the preset similarity threshold value 0.65, then "Teng" and "Teng" are considered as similar target confusion words, and the similar target confusion words can be added into the confusion word set as a confusion word pair of < Teng.

150. And correcting the text to be corrected by the confusion word set.

The text to be corrected refers to the text to be recognized and corrected. In some embodiments, the text to be corrected may be language material in a target scene, which may be various scenes suitable for text correction, such as, but not limited to, application scenes including instant messaging, knowledge questions and answers, text retrieval, and the like.

For example, taking the target scene as an example applied to the instant messaging application, the confusion word set in the embodiment of the application can be stored in the error correction module of the instant messaging application, when a user inputs a short text in the instant messaging application, and under the condition that the user is individually permitted, the confusion word set in the error correction module can be called, the confusion word set is used for identifying the wrong word in the text to be corrected, the correction is performed on the wrong word, and then the corrected text is displayed in the instant messaging application. It should be noted that, the text error correction method in the embodiment of the present application may be implemented in a server or in a terminal.

It can be understood that when the text error correction method of the embodiment of the application is applied, the confusion word set can be directly called to correct errors of the text to be corrected, and the confusion word set can be utilized to accurately match Chinese characters in the text to be corrected, so that the error correction efficiency is improved; in addition, as the mixed word set contains a plurality of mixed words similar in coding and font, errors caused by various reasons such as spelling errors, writing errors and the like can be identified, and the accuracy of text error correction is improved.

In some embodiments, the text semantic of the text to be corrected can be filtered to obtain the word to be corrected, so that the number of words to be processed in the text correction is reduced, and the text correction efficiency is improved. Specifically, correcting the text to be corrected by the confusion word set includes:

determining a word to be corrected from the text to be corrected;

determining a target confusion word pair corresponding to the word to be corrected from the confusion word set;

and correcting the error correction words in the text to be corrected through the target confusion word pairs.

For example, a text to be corrected is used for 'traveling in love', the text can be converted into a text sequence, the text sequence is input into a natural language processing (Natural Language Processing, NLP) model such as a BERT model for text semantic analysis, a to-be-corrected word 'traveling' is determined, a target confusion word pair containing 'traveling' is obtained, a pair of lovers, donkey and a whirl 'are respectively replaced by the confusion word in the target confusion word pair, text semantic analysis is respectively carried out on each replaced text to determine the matching degree of the replaced confusion word and the to-be-corrected text, and a' pair 'with the highest matching degree is taken as a target confusion word, so that the' traveling in love 'is corrected into a lover' in love based on the target confusion word.

In some embodiments, before determining the word to be corrected, word segmentation processing may be performed on the text to be corrected, so as to determine the word to be corrected based on the first word segmentation result. The word segmentation can be performed on the text to be corrected based on the target confusion word, so that the word to be corrected in the text to be corrected is corrected based on the second word segmentation result. For example, the text to be corrected, "the emotion travel in love" may be segmented into "in/love/middle/emotion/travel" (i.e., the first segmentation result), and the word segmentation result may be queried through a dictionary to find that the "travel" is a misplaced word (i.e., the word to be corrected). For example, the "travel" in the text to be corrected can be replaced by the confusion word in the target confusion word pair, so that a lover in the love "," a lover in the love ", and the" lover in the love "of the text to be corrected are obtained, word segmentation processing is performed on the three replaced texts respectively, and based on the fact that the lover in the love is found to have no wrong word after dictionary query, the wrong words" donkey "and the wrong word" rotation "in the love are found in the lover in the love, and therefore, the correct error correction result of the lover in the love of the text to be corrected can be determined to be the lover in the love.

The text error correction scheme provided by the embodiment of the application can be applied to various application scenes. For example, taking instant messaging as an example, extracting coding features and font features of a plurality of candidate confusion words; determining the coding similarity corresponding to the coding features of any two candidate mixed words and determining the font similarity corresponding to the font features of any two candidate mixed words; weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity; and correcting the text to be corrected by the confusion word set.

As can be seen from the foregoing, the embodiment of the present application may integrate the influence of different characteristics by weighting based on the multi-dimensional characteristics of the candidate mixed word codes and fonts on the codes and fonts, and fuse the codes similarity and the fonts similarity to measure the similarity of the candidate mixed word codes and fonts on the codes and fonts, so that the constructed mixed word set includes a plurality of mixed words similar to each other on the codes and fonts, and constructs a more comprehensive and accurate mixed word set. When the method is applied, the confusion word set can be directly called to correct the text to be corrected, and the confusion word set can be utilized to accurately match Chinese characters in the text to be corrected, so that the correction efficiency is improved; in addition, as the mixed word set contains a plurality of mixed words similar in coding and font, errors caused by various reasons such as spelling errors, writing errors and the like can be identified, and the accuracy of text error correction is improved.

The method described in the above embodiments will be described in further detail below.

In this embodiment, a method according to an embodiment of the present application will be described in detail by taking an application to text retrieval as an example.

As shown in fig. 2a, a text error correction method is applied to a server, and the text error correction method specifically comprises the following steps:

210. the coding features and the font features of a plurality of candidate confusion words are extracted.

The coding features comprise stroke sequence features, pinyin sequence features and four-corner coding sequence features. The glyph features include radical features and display features.

Chinese text correction aims to identify and detect inputs that a user may contain errors, and to modify and correct the detected errors by various methods to provide a better user experience. In different business scenes, the focus of error types focused on by text error correction is different, for example, a pinyin input method focuses on errors of a near-sound type, a handwriting input method focuses on font errors in scenes such as OCR, a search engine focuses on errors of all types including near-sound, near-shape, disorder, word missing, multiple words, grammar errors and the like. For error types such as near-voice, near-shape, and the like, most methods rely on confusion set resources.

It is counted that 76% of the incorrect Chinese characters are associated with pronunciation similarity, 46% are associated with font similarity, and 29% are associated with both. The existing confusion set mining scheme is also based on the angle design scheme with similar voice and shape of the Chinese characters, but most of the existing scheme only considers unilateral characteristics in the utilization of voice and shape characteristics, and cannot be suitable for confusion character judgment in all scenes. For example, three Chinese characters { direction, what, rics } are similar in shape, and when only single characteristics (such as Chinese character radical structure and strokes) are considered, confusing property among three Chinese characters can be found to be inconsistent: (1) consider radical and glyph features: chinese characters < what, rices > belong to the left-right structure, and are all by a single person; the Chinese characters have different structures and components. If the feature is used for judging that the confusing property of Chinese characters < what, the rics > is obviously larger than that of the Chinese characters < what, the direction >; (2) consider stroke edit distance: the edit distance of Chinese character < what, rics > is 3, the edit distance of Chinese character < what, direction > is 2; at this time, the confusing property between Chinese characters < what, direction > is larger than that of Chinese characters < what, rics >. Therefore, when the voice-form confusion set is mined by using only a single level of features, all scenes cannot be easily handled, and the mined confusion set is not comprehensive enough. In addition, the existing open source confusion sets have the problems of incomplete resources and one-sided confusion information, and model training or continuous mining is carried out based on the open source sets, so that errors are easy to occur.

The text error correction method provided by the embodiment of the application is realized by adopting a multi-granularity computing scheme in the aspects of voice and shape coding feature selection and measurement, is different from the existing scheme by adopting only single granularity features in component, structure or stroke coding, is better mined into confusion sets of different levels of different scenes, simultaneously introduces various granularity features including fine granularity stroke sequences (namely stroke sequence features), four-corner codes (namely four-corner coding sequence features) and component font features (namely component features) in a data acquisition stage, and also simulates visual confusion information, and introduces Chinese character gray images (display images) as coarse granularity font features (namely display features). Similarly, for near confusion feature characterization, pinyin sequences and tone features (i.e., pinyin sequence features) are introduced. For example, taking the candidate confusing words Chinese < Teng, teng > as an example, multi-granularity features such as stroke sequence features, pinyin sequence features, four-corner coding sequence features, radical features and the like of < Teng, teng > as shown in figure 2b are introduced.

Aiming at display characteristics, the embodiment of the application introduces multi-mode picture data as the rough granularity integral font characteristics of Chinese characters. By collecting the data of the existing font library or the corpus of Chinese character handwriting recognition, the third-party tool library (such as pygame) is used for carrying out gray processing on Chinese character images, all Chinese character images are converted into gray image values (namely gray images), and the image characteristics of the gray images are used as display characteristics.

220. And determining the coding similarity corresponding to the coding features of any two candidate confusion words and determining the font similarity corresponding to the font features of any two candidate confusion words.

The multi-granularity characteristics of Chinese character confusion comprise 5 kinds of characteristics with different granularities, such as stroke sequence characteristics, pinyin sequence characteristics, four-corner coding sequence characteristics, radical characteristics, display characteristics and the like. To obtain different scale information representations of different granularity features, the features are divided into two types for calculation, namely sequence features (namely coding features): the method comprises stroke sequence features, pinyin sequence features, four-corner coding sequences and non-sequence features (namely font features): including radical features and display features. Taking candidate confusing words Chinese character < Teng, teng > as an example, multiple calculation indexes can be adopted aiming at the multi-granularity characteristic, and different scale information under different granularity characteristics of < Teng, teng > can be correspondingly obtained.

For the sequence characteristics, the distance similarity, the common characteristic similarity and the characteristic length similarity can be calculated based on the editing distance of < Teng, teng >, the maximum common substring/the maximum common subsequence of the sequence and the sequence length difference. The edit distance is the minimum number of operations required to be performed to convert one chinese character sequence into another chinese character sequence, and may represent the degree of difference between the two chinese character sequences. Because each editing operation takes sequence elements as operation objects, the information of fine granularity layers among sequences can be represented; the sequence maximum common substring/maximum common subsequence is the intersection result of two Chinese character sequence sets, and represents the same substring/subsequence of the two sequences. The common substring and subsequence indicate how many sequence elements are the same for both sequences, so the index represents the local similarity of both sequences; the sequence length difference is that in the calculation process, two characteristics of a stroke sequence and a pinyin sequence can meet the condition of unequal lengths. In addition to the features, the similarity of the sequence length difference values can be calculated, and the number difference of the sequence elements can be compared.

For the non-sequence feature, the component features of < Teng > can be directly used as similarity calculation score supplement, for example, two Chinese characters < Teng >, have the same component structure and radicals, and the similarity of the component character features is directly set to be 1; for gray image similarity calculation, since the gray processing is performed, the Chinese character image (gray image) can be regarded as a matrix pixel value result, and the first display similarity, the second display similarity and the third display similarity can be calculated based on the gray image, thereby obtaining the similarity of the whole font.

230. And weighting the coding similarity and the font similarity through the first weight and the second weight to obtain the total similarity of any two candidate confusion words.

And in the multi-granularity feature fusion stage, performing feature fusion by adopting weighted average calculation. For different scenes, the confusion characteristic representations with different granularities can be calculated by only adjusting the characteristic parameters with different granularities. Specifically, after the similarity results of different levels of the multi-granularity features are obtained, the confusion degree index (namely the total mixed similarity) of the Chinese characters is calculated in two steps. The method comprises the following steps: step one, multi-level information fusion, namely classifying according to granularity characteristics, and carrying out linear weighted fusion on similarity information of different layers under each granularity of < Teng, teng >. Taking the stroke sequence as an example, the edit distance similarity, the maximum common substring similarity, the maximum common subsequence similarity and the sequence length difference value of the stroke sequence characteristics are respectively obtained. Obtaining similarity scores (namely weighted coding similarity) of stroke sequence features through first weight calculation; and secondly, fusing the features with different granularity of < Teng > in a linear weighting mode based on the second weight to obtain a similarity score (namely total similarity) of the multi-granularity feature fusion.

Finally, the similarity score of the multi-granularity feature fusion of < Teng, teng > can be used as the confusion degree between Chinese character pairs < Teng, teng >, and confusion set mining is completed based on the similarity score. In addition, the fusion feature weights with different granularities can be directly adjusted to obtain the confusion degree of different emphasis, for example, the handwriting input method focuses on the character confusion feature, and the pinyin sequence feature weight can be set to 0 for calculation.

240. A set of confusion words is determined from the plurality of candidate confusion words based on the overall similarity.

After calculating the corresponding total similarity of every two candidate mixed words in the plurality of candidate mixed sets, any two candidate mixed words with the total similarity being greater than a preset similarity threshold value can be used as similar target mixed words, and mixed word pairs are formed and added into the mixed word set. For example, for comparison, a threshold may be set for the weighted code similarity in the embodiment of the present application, for example, a second similarity threshold may be set for the weighted code similarity corresponding to the stroke sequence feature or the pinyin sequence feature, for example, 0.65, and a first similarity threshold 1.5 may be set for the total similarity, that is, when the weighted code similarity corresponding to the stroke sequence feature or the pinyin sequence feature of two chinese characters is greater than 0.65, or the total similarity is greater than 1.5, the two chinese characters may be considered to be confused (that is, the target confusion word). For example, whether the total similarity is greater than the first similarity threshold may be detected, if yes, the two Chinese characters are considered to be mixed, if no, whether the stroke sequence feature or the pinyin sequence feature is greater than the second similarity threshold is detected, if yes, the two Chinese characters are considered to be mixed, and if no, the two Chinese characters are considered to be not mixed.

The user-entered errors are of various types. The emphasis on the type of error is also different in different scenarios. For example, the pinyin input method focuses more on errors of the near-pronunciation type, the handwriting input method and OCR focus more on font errors, and the search engine focuses on errors of all types including near-pronunciation, near-shape, disordered, missing words, multiple words, grammar errors and the like. For errors such as near-pitch errors, near-shape errors and the like, after the error words are identified, error correction is mainly performed by relying on confusion resources, so that a rich, complete and accurate confusion set is very important for error correction.

250. And acquiring the text to be corrected from the terminal.

After a user inputs a problem text through a text retrieval application installed on the terminal, the terminal can send the problem text to a server as the text to be corrected under the condition that the user is individually permitted, or can send the problem text to the server as the text to be corrected if a word to be corrected exists after the terminal identifies the problem text (such as word segmentation).

260. And correcting the text to be corrected by the confusion word set.

In the text search scenario, as shown in the schematic diagram of processing the search text in fig. 2c, the server may perform recognition processing (i.e., query understanding) such as word segmentation, rewrite, error correction, intention recognition, component analysis, synonym, word weight, compactness, and unnecessary stay on the search text (Query) input by the user. And then, one or more strategies (such as individuation, grading fine sorting, coarse sorting and index recall) are used for recalling and sorting the recognition and understanding results, and a retrieval answer is output to the terminal. The error correction aims at identifying and detecting input (text to be corrected) possibly containing errors by a user, modifying and correcting the detected errors through various methods, and the corrected result mainly acts on recall.

The process of correcting the text to be corrected may be integrated in the correction model of the correction module of the server. After the server receives the text to be corrected, the correction model can be called to identify the error words, and the error correction is carried out by replacing the fragment and the word confusion set; and filtering and sorting by means of the confusion similar features during error correction sorting, so as to relieve sorting model overhead. As shown in fig. 2d, taking a text to be corrected (Query) as an example of "love travel in love", the server may perform word segmentation and error detection on the text to be corrected to determine a word to be corrected, and call a confusion word set (word confusion set) to replace the word to be corrected, and determine an error correction result "lovers in love" from a plurality of replaced texts through word segmentation or semantic analysis. After the error correction result is obtained, the server can carry out intention recognition on the error correction result, recall and sort the recognition result and output a retrieval answer terminal.

In order to evaluate the beneficial effects brought by the text error correction method of the embodiment of the application, the application selects a Chinese text error correction data set for effect evaluation, and simultaneously selects an confusion set of an open source in a pyrrector tool (Chinese text error correction tool) for comparison, and compares the confusion set with the confusion set obtained by the embodiment of the application, so that the text error correction method of the embodiment of the application proves that the excavation capability is expanded. The details are as follows:

(one) evaluating the dataset: and selecting Track1 (Track 1) data of a Chinese text correction task SIGHAN (International Chinese word segmentation evaluation) and CCL2022-CLTC (Chinese learner text correction task) for evaluation. The SIGHA data set is the disclosed complex text error correction task data and comprises three versions of SIGHA 2013, SIGHA 2014 and SIGHA 2015. And three versions of data are combined after simplified and complex conversion during evaluation. CCL2022-CLTC data provides spelling error data comprising near-pronunciation, near-shape and near-pronunciation for a Chinese learner text error correction task, a track one Chinese spell check task.

The present evaluation aims at evaluating the effect of mining the confusion set, so that the confusion set pairs can be extracted from the data sets described above. For example, the above data provides that if a sentence pair is equal in length, the wrong word pair, such as < after all, can be extracted from the sentence pair as a confusion set pair, if the sentence pair is against … and against …. After treatment, 4205 confusion set pairs were included in the SIGHA and CCL2022-CLTC was 85 confusion set pairs.

(II) evaluation protocol: because the main stream scheme does not pay attention to specific implementation details (including a used seed confusion set, a mining result and the like), the text correction method of the embodiment of the application is compared with a near-voice and near-shape dictionary in a Chinese text correction tool pyrrector.

(III) evaluating results: the confusion set matching rate is used as an evaluation index, namely, when the confusion set pair appears in a near-pitch dictionary, a near-shape dictionary or the confusion score meets the threshold condition, the confusion set is judged. Finally, the proportion of the confusion set number passing the scheme judgment condition in the evaluation data is counted. The results are shown in Table 5 below:

TABLE 5

Scheme for the production of a semiconductor device	SIGHAN	CCL2022-CLTC
			pycorrector	14.41％	22.35％
Embodiments of the application	82.59％	87.06％

As can be seen from table 5, the ratio of the confusion set obtained by the text error correction method of the embodiment of the present application in the evaluation data is 82.59% and 87.06%, respectively, whereas the ratio of the confusion set obtained by the existing pyrrector tool (chinese text error correction tool) in the evaluation data is only 14.41% and 22.35%, respectively. Obviously, compared with the traditional Chinese text error correction tool, the text error correction method provided by the embodiment of the application can construct a more comprehensive and accurate confusion set.

From the above, in the data acquisition stage, the embodiment of the application introduces multi-granularity multi-mode features; in the voice-shape feature calculation stage, adopting a plurality of calculation indexes aiming at multi-granularity features, and correspondingly acquiring different scale information under different granularity features; in the multi-granularity feature fusion stage, feature fusion is carried out by adopting weighted average calculation, and confusion feature representations with different granularities can be calculated by only adjusting feature parameters with different granularities aiming at different scenes. Therefore, the embodiment of the application can construct a more comprehensive and accurate confusion set based on the multi-granularity characteristic, and improves the accuracy of text error correction.

In order to better implement the method, the embodiment of the application also provides a text error correction device which can be integrated in electronic equipment, wherein the electronic equipment can be a terminal, a server and the like. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in this embodiment, a method according to an embodiment of the present application will be described in detail by taking a specific integration of a text error correction device in a server as an example.

For example, as shown in fig. 3, the text error correction apparatus may include a feature extraction unit 310, a similarity determination unit 320, a weighting unit 330, a word set determination unit 340, and an error correction unit 350, as follows:

first feature extraction unit 310

For extracting coding features and glyph features of a plurality of candidate confusion words.

In some embodiments, the coding features include at least one of stroke sequence features, pinyin sequence features, and tetragonal coding sequence features, and the feature extraction unit 310 may specifically be configured to:

The code sequence is converted into code features.

In some implementations, the glyph features include radical features, and the feature extraction unit 310 may be specifically configured to:

In some implementations, the glyph features include display features, and the feature extraction unit 310 may be specifically configured to:

(two) similarity determination unit 320

The method is used for determining the coding similarity corresponding to the coding features of any two candidate confusion words and determining the font similarity corresponding to the font features of any two candidate confusion words.

In some embodiments, the glyph similarity includes radical similarity, and the similarity determination unit 320 may be specifically configured to:

In some embodiments, the similarity determination unit 320 may specifically be configured to:

In some embodiments, the coding similarity includes a distance similarity, and the calculating to obtain the coding similarity of any two candidate confusion words according to the first coding length and the second coding length includes:

In some embodiments, the coding similarity includes a common feature similarity, and the calculating to obtain the coding similarity of any two candidate confusion words according to the first coding length and the second coding length includes:

In some embodiments, the coding similarity includes a feature length similarity, and the calculating to obtain the coding similarity of any two candidate confusion words according to the first coding length and the second coding length includes:

(III) weighting Unit 330

And the method is used for carrying out weighting processing on the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words.

In some embodiments, the weighting unit 330 may be specifically configured to:

In some embodiments, the weighting unit 330 may also be configured to:

obtaining the marking grade corresponding to any two candidate confusion words;

(IV) word set determination unit 340

For determining a set of confusion words from a plurality of candidate confusion words based on the overall similarity.

(fifth) error correction unit 350

For correcting the text to be corrected by the set of confusing words.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, the text error correction apparatus of the present embodiment includes a feature extraction unit, a similarity determination unit, a weighting unit, a word set determination unit, and an error correction unit. A feature extraction unit for extracting coding features and font features of a plurality of candidate confusion words; the similarity determining unit is used for determining the coding similarity corresponding to the coding features of any two candidate confusion words and determining the font similarity corresponding to the font features of any two candidate confusion words; the weighting unit is used for carrying out weighting processing on the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; a word set determining unit configured to determine a mixed word set from a plurality of candidate mixed words according to the total similarity; and the error correction unit is used for correcting the text to be corrected through the confusion word set.

Therefore, the embodiment of the application can construct a more comprehensive and accurate confusion word set by weighting, fusing the coding similarity and the font similarity to measure the candidate confusion word codes and the font similarity on the codes and the fonts, so as to improve the accuracy of text error correction by using the confusion word set.

The embodiment of the application also provides electronic equipment which can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, or the like.

In this embodiment, a detailed description will be given taking an example that the electronic device of this embodiment is a server, for example, as shown in fig. 4, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically:

the server may include one or more processor cores 'processors 410, one or more computer-readable storage media's memory 420, a power supply 430, an input module 440, and a communication module 450, among other components. Those skilled in the art will appreciate that the server architecture shown in fig. 4 is not limiting of the server and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

The processor 410 is a control center of the server, connects various parts of the entire server using various interfaces and lines, performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 420, and calling data stored in the memory 420. In some embodiments, processor 410 may include one or more processing cores; in some embodiments, processor 410 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The memory 420 may be used to store software programs and modules, and the processor 410 may perform various functional applications and data processing by executing the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the server, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, memory 420 may also include a memory controller to provide processor 410 with access to memory 420.

The server also includes a power supply 430 that provides power to the various components, and in some embodiments, the power supply 430 may be logically connected to the processor 410 via a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. Power supply 430 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The server may also include an input module 440, which input module 440 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The server may also include a communication module 450, and in some embodiments the communication module 450 may include a wireless module, through which the server may wirelessly transmit over short distances, thereby providing wireless broadband internet access to the user. For example, the communication module 450 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and the like.

Although not shown, the server may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 410 in the server loads executable files corresponding to the processes of one or more application programs into the memory 420 according to the following instructions, and the processor 410 executes the application programs stored in the memory 420, so as to implement various functions as follows:

Extracting coding features and font features of a plurality of candidate confusion words; determining the coding similarity corresponding to the coding features of any two candidate mixed words and determining the font similarity corresponding to the font features of any two candidate mixed words; weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words; determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity; and correcting the text to be corrected by the confusion word set.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

From the above, according to the embodiment of the application, the encoding similarity and the font similarity are fused to measure the candidate mixed word encoding and the font similarity on the encoding and the font, so as to construct a more comprehensive and accurate mixed word set, thereby improving the text error correction accuracy of the mixed word set.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the text error correction methods provided by embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer programs/instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer program/instructions from the computer-readable storage medium, and the processor executes the computer program/instructions to cause the electronic device to perform the methods provided in the various alternative implementations provided in the above-described embodiments.

The steps in any text error correction method provided by the embodiment of the present application can be executed by the instructions stored in the storage medium, so that the beneficial effects that any text error correction method provided by the embodiment of the present application can be achieved, and detailed descriptions of the previous embodiments are omitted.

The foregoing has described in detail a text error correction method, apparatus, electronic device, storage medium and program product provided by embodiments of the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for aiding in the understanding of the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A method for text correction, comprising:

extracting coding features and font features of a plurality of candidate confusion words;

determining the coding similarity corresponding to the coding features of any two candidate confusion words, and determining the font similarity corresponding to the font features of any two candidate confusion words;

Weighting the coding similarity and the font similarity to obtain the total similarity of any two candidate confusion words;

determining a set of confusion words from the plurality of candidate confusion words according to the overall similarity;

and correcting the text to be corrected through the confusion word set.

2. The text error correction method of claim 1, wherein the encoding features include at least one of stroke sequence features, pinyin sequence features, and tetragonal encoding sequence features, the encoding features being obtained by:

determining the coding sequence of the candidate confusion word according to a specified coding rule, wherein the specified coding rule comprises at least one of stroke coding, pinyin coding and four-corner coding;

the code sequence is converted to the code feature.

3. The text error correction method of claim 1, wherein the glyph features include radical features that are derived by:

and obtaining the component characteristics of the candidate confusion words according to the component.

4. The text error correction method of claim 3, wherein the glyph similarity includes radical similarity, the radical similarity being obtained by:

if the component features of any two candidate confusion words meet the similarity condition, taking a first similarity value as the component similarity;

and if the component characteristics of any two candidate confusion words do not meet the similarity condition, taking a second similarity value as the component similarity.

5. The text error correction method of claim 1, wherein the glyph feature includes a display feature, the display feature being derived by:

and taking the image characteristic of the gray level map as the display characteristic of the candidate confusion word.

6. The text error correction method of claim 1, wherein the coding similarity is obtained by:

acquiring a first coding length of a first coding feature and a second coding length of a second coding feature, wherein the first coding feature and the second coding feature are the coding features of the arbitrary two candidate confusion words;

And calculating the coding similarity of any two candidate confusion words according to the first coding length and the second coding length.

7. The text error correction method of claim 6, wherein the coding similarity includes a distance similarity, and the calculating the coding similarity of the arbitrary two candidate confusion words according to the first coding length and the second coding length includes:

determining an edit distance that converts the first encoding feature to the second encoding feature;

and calculating the distance similarity of the arbitrary two candidate confusion words according to the editing distance, the first coding length and the second coding length.

8. The text error correction method of claim 6, wherein the coding similarity includes a common feature similarity, and the calculating the coding similarity of the arbitrary two candidate confusion words according to the first coding length and the second coding length includes:

determining a maximum common encoding feature of the first encoding feature and the second encoding feature;

and calculating the common feature similarity of the arbitrary two candidate confusion words according to the maximum common coding feature, the first coding length and the second coding length.

9. The text error correction method of claim 6, wherein the coding similarity includes a feature length similarity, and the calculating the coding similarity of the arbitrary two candidate confusion words according to the first coding length and the second coding length includes:

and calculating the characteristic length similarity of any two candidate confusion words according to the coding length difference value and the maximum coding length value.

10. The text error correction method of any one of claims 1 to 9, wherein said weighting said coding similarity and said glyph similarity to obtain a total similarity of said any two of said candidate confusing words comprises:

according to the first weight, carrying out weighting processing on the coding similarity of any two candidate confusion words to obtain weighted coding similarity of any two candidate confusion words;

and according to a second weight, carrying out weighted processing on the weighted coding similarity and the font similarity of the arbitrary two candidate confusion words to obtain the total similarity of the arbitrary two candidate confusion words.

11. The text error correction method of claim 10, wherein the method further comprises:

weighting the coding similarity and the font similarity according to an initial first weight and an initial second weight to obtain initial total similarity of any two candidate confusion words;

obtaining the marking grade corresponding to any two candidate confusion words;

and adjusting the initial first weight and the initial second weight according to the candidate word confusion grade and the labeling grade to obtain the first weight and the second weight.

12. A text error correction apparatus, comprising:

a feature extraction unit for extracting coding features and font features of a plurality of candidate confusion words;

a similarity determining unit, configured to determine coding similarities corresponding to the coding features of any two candidate confusion words, and determine glyph similarities corresponding to the glyph features of any two candidate confusion words;

the weighting unit is used for carrying out weighting processing on the coding similarity and the font similarity to obtain the total similarity of the arbitrary two candidate confusion words;

A word set determining unit configured to determine a mixed word set from the plurality of candidate mixed words according to the total similarity;

and the error correction unit is used for correcting the text to be corrected through the confusion word set.

13. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the text error correction method as claimed in any one of claims 1 to 11.

14. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the text error correction method of any of claims 1 to 11.

15. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the text error correction method of any of claims 1 to 11.