CN108280051A - Detection method, device and the equipment of error character in a kind of text data - Google Patents

Detection method, device and the equipment of error character in a kind of text data Download PDF

Info

Publication number
CN108280051A
CN108280051A CN201810067388.5A CN201810067388A CN108280051A CN 108280051 A CN108280051 A CN 108280051A CN 201810067388 A CN201810067388 A CN 201810067388A CN 108280051 A CN108280051 A CN 108280051A
Authority
CN
China
Prior art keywords
character
similar
text data
detected
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810067388.5A
Other languages
Chinese (zh)
Other versions
CN108280051B (en
Inventor
刘英博
王建民
张育萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201810067388.5A priority Critical patent/CN108280051B/en
Publication of CN108280051A publication Critical patent/CN108280051A/en
Application granted granted Critical
Publication of CN108280051B publication Critical patent/CN108280051B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Abstract

The present invention provides detection method, device and the equipment of error character in a kind of text data, this method includes:The occurrence number of character in text data to be detected is counted, the target character frequently occurred in text data to be detected is obtained;According to the fallibility character repertoire being pre-created, the similar character set for including target character is obtained, wherein similar character set includes similar character similar with target character shape;If occurrence number of the similar character in text data to be detected is more than zero and is less than predetermined threshold value, confirm that the similar character in text data to be detected is error character.The present invention is by obtaining the target character frequently occurred in text, and judge whether the character similar with target character shape occurred in text is error character, the similar error character of the shape generated in manual entry data is fully considered, effectively have detected the error character in text data, replace artificial error correction, improves error character detection efficiency.

Description

Detection method, device and the equipment of error character in a kind of text data
Technical field
The present invention relates to text recognition technique fields, and in particular to the detection method of error character in a kind of text data, Device and equipment.
Background technology
The level of IT application of today's society is maked rapid progress, our each social action, which is substantially all, can be converted into data, And it preserves in the database.Other than the data such as the daily record data, the behavioral data that are automatically generated by computer, also have at present big Amount data cannot be automatically generated, and still need to manually be entered into system, text data is exactly Typical Representative therein.Word is recorded Enter into computer, the behavior that can be all related in the live and work for being most people, such as:Maintenance personal can service every time Maintenance conditions daily record is filled in later;Financial staff will record the whereabouts that every is paid wages and content etc..
This kind of data that cannot be automatically generated are that text-processing brings some challenges and problem.Worker is carrying out typing When, it inevitably will appear careless mistake, the character of input error, these wrong words are often the phonetically similar word or likeness in form word of correct characters.Its In, likeness in form word is one of main source of wrong word;There are the similar word of many shapes, their meaning in the character repertoire of computer It is identical, but indicates that their coding is entirely different, such as:Arabic numerals and English alphabet have half-angle and full-shape Two kinds of forms;Other than the different character pair of the identical coding of meaning, also some similar characters pair of meaning different shape, example Such as:There are much other similar characters with Arabic numerals " 1 " in character repertoire, including Chinese character " Shu " and English alphabet " I ". Importer is in typing information, it is likely that a certain form in half-angle or full-shape can be voluntarily selected in no clear specification, or The similar character of person's erroneous input shape.After the more parts of different text datas in source pool together, it inevitably will appear many places mistake Malapropism or the inconsistent situation of format.
Other than the erroneous input of importer, the difference of area and culture will also result in the disunity in character format;Than Such as the number and English alphabet of the usual full-shape of Japanese, and the number and English alphabet of the usual half-angle of Chinese, both record Text data be aggregating after, just will appear half widths symbol and double byte character it is mixed in together, a large amount of format disunity The situation of document confusion caused by and.
Therefore, the ambiguity that wrong word is brought has caused great difficulties the arrangement of text data and statistics.The prior art In, it usually needs manually a large amount of daily records or text data are checked, to unify format or correct ambiguous word;But it is uninteresting in this way Work be significant wastage to human resources, and it is less efficient.
Invention content
In view of the above defects of the prior art, the present invention provides a kind of detection side of error character in text data Method, device and equipment.
An aspect of of the present present invention provides a kind of detection method of error character in text data, including:To text to be detected The occurrence number of character is counted in data, obtains the target character frequently occurred in text data to be detected;According to advance The fallibility character repertoire of establishment obtains and includes the similar character set of target character, wherein the similar character set includes and mesh Mark the similar similar character of character shape;If occurrence number of the similar character in text data to be detected is more than zero and less than pre- If threshold value, then confirm that the similar character in text data to be detected is error character.
Wherein, further include after the step of similar character confirmed in text data to be detected is error character:It obtains Occurrence number of each character in text data to be detected in similar character set belonging to error character, and error character is changed Just it is being the most character of occurrence number.
Wherein, the fallibility character repertoire that the basis is pre-created obtains the step of the similar character set comprising target character Further include before rapid:Character set is obtained, size normalized is carried out to the corresponding image data of each character in character set;And according to The corresponding image data of each character, obtains the shape similarity between each character;According to the shape similarity between character, to word Symbol is clustered, and similar character set is obtained;Wherein, the similarity between any two character in the similar character set More than default similarity, the fallibility character repertoire includes at least one similar character set.
Wherein, the step of shape similarity obtained between each character specifically includes:Using multiple similarity calculations Method calculates separately the similarity between each character;According in advance to the weighted value of each similarity calculating method distribution, Yi Jitong The similarity that each similarity calculating method obtains is crossed, the shape similarity between each character is obtained.
Wherein, the multiple similarity calculating method includes comparison method pixel-by-pixel, projection block comparison method and the ratio of width to height With method.
Wherein, it was also wrapped before described the step of carrying out size normalized to the corresponding image data of each character in character set It includes:The metamessage of the corresponding image data of each character is recorded, the metamessage includes the ratio of width to height of image data;Correspondingly, it adopts The step of similarity between calculating each character with the ratio of width to height matching method, specifically includes:Member letter corresponding to each character image data The ratio of width to height recorded in breath is compared, and obtains the corresponding similarity of the ratio of width to height matching method.
Wherein, described the step of obtaining the target character frequently occurred in text data to be detected, specifically includes:To each word The occurrence number of symbol carries out sequence from big to small, using the character in preceding preset ratio in sequence as target character, and/or Occurrence number is more than the character of preset times as target character.
Another aspect of the present invention provides a kind of detection device of error character in text data, including:Statistical module is used The occurrence number of character counts in text data to be detected, obtains the target frequently occurred in text data to be detected Character;Acquisition module, for according to the fallibility character repertoire being pre-created, obtaining the similar character set for including target character, In, the similar character set includes similar character similar with target character shape;Confirmation module, if existing for similar character Occurrence number in text data to be detected is more than zero and is less than predetermined threshold value, then confirms the similar character in text data to be detected Symbol is error character.
Another aspect of the present invention provides a kind of detection device of error character in text data, including:At least one place Manage device;And at least one processor being connect with the processor communication, wherein:The memory is stored with can be by the place The program instruction that device executes is managed, the processor calls described program instruction to be able to carry out the text that the above-mentioned aspect of the present invention provides The detection method of error character in data, such as including:The occurrence number of character in text data to be detected is counted, is obtained Take the target character frequently occurred in text data to be detected;According to the fallibility character repertoire being pre-created, it includes target word to obtain The similar character set of symbol, wherein the similar character set includes similar character similar with target character shape;If similar Occurrence number of the character in text data to be detected is more than zero and is less than predetermined threshold value, then confirms in text data to be detected Similar character is error character.
Another aspect of the present invention provides a kind of non-transient computer readable storage medium, and the non-transient computer is readable Storage medium stores computer instruction, and the computer instruction makes the computer execute the text that the above-mentioned aspect of the present invention provides The detection method of error character in data, such as including:The occurrence number of character in text data to be detected is counted, is obtained Take the target character frequently occurred in text data to be detected;According to the fallibility character repertoire being pre-created, it includes target word to obtain The similar character set of symbol, wherein the similar character set includes similar character similar with target character shape;If similar Occurrence number of the character in text data to be detected is more than zero and is less than predetermined threshold value, then confirms in text data to be detected Similar character is error character.
The detection method of error character, device and equipment in text data provided by the invention, by obtaining text intermediate frequency The target character of numerous appearance, and judge whether the character similar with target character shape occurred in text is error character, is filled Divide and consider the similar error character of shape generated in manual entry data, effectively has detected the erroneous words in text data Symbol replaces artificial error correction, reduces cost of labor, improves error character detection efficiency.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Some bright embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow diagram of the detection method of error character in text data provided in an embodiment of the present invention;
Fig. 2 be text data provided in an embodiment of the present invention in error character detection method character size normalization at The front and back schematic diagram of reason;
Fig. 3 is the structural schematic diagram of the detection device of error character in text data provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram of the detection device of error character in text data provided in an embodiment of the present invention.
Specific implementation mode
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention A part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having The every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
Fig. 1 is the flow diagram of the detection method of error character in text data provided in an embodiment of the present invention, such as Fig. 1 It is shown, including:Step 101, the occurrence number of character in text data to be detected is counted, obtains text data to be detected In the target character that frequently occurs;Step 102, it according to the fallibility character repertoire being pre-created, obtains similar comprising target character Character set, wherein the similar character set includes similar character similar with target character shape;Step 103, if it is similar Occurrence number of the character in text data to be detected is more than zero and is less than predetermined threshold value, then confirms in text data to be detected Similar character is error character.
In a step 101, the word frequency that all characters occur in text data to be detected is counted first, and word frequency is a certain word Accord with the frequency occurred in text data to be detected or number;According to occurrence number, target character can be obtained;Target character is The more character of occurrence number in text to be detected;Alternatively, target character equally can be user-defined error The more character of number.
Select the character frequently occurred as target character be due to, if occur in a document frequent of certain characters not counting Height just can not show that " it is importer with character similar in the character shape to occur in document by occurrence number statistics The conclusion of error ";And the more frequent character of occurrence number, the possibility higher that they are logged, therefore those and their phases As character be also more likely typing personnel maloperation and input;And the error character inputted due to error substantially can not It can be frequent word.
In a step 102, it according to the target character obtained in step 101, is inquired in fallibility character repertoire, obtains phase Like character set;The similar character set includes multiple characters, and each character and having in shape for target character are very strong Similitude;Therefore, user is when carrying out text data typing, it is possible to is entered into using similar character as target character to be checked It surveys in text data.
In step 103, according to the similar character set obtained in step 102, collection is searched in text data to be detected The similar character similar with target character shape (being searched for example, by using the cryptographic Hash of each word) that conjunction includes;If a certain A similar character appears in text to be detected, and to be less than predetermined threshold value (such as entire for the number occurred in document to be detected The 0.1% of text data character sum), then it is assumed that the similar character occurred in text is error character or ambiguity character.
The detection method of error character in text data provided in an embodiment of the present invention is frequently occurred by obtaining in text Target character, and judge whether the character similar with target character shape occurred in text is error character, is fully considered The similar error character of shape generated in manual entry data, effectively has detected the error character in text data, replaces Artificial error correction reduces cost of labor, improves error character detection efficiency.
Based on any of the above embodiments, the similar character confirmed in text data to be detected is error character The step of after further include:Obtain appearance of each character in text data to be detected in the similar character set belonging to error character Number, and be the most character of occurrence number by error character correction.
Specifically, after in confirming text data to be detected there are error character, it is believed that the error character is manual entry Character similar with target character shape;Therefore, it can count similar to the error character shape according to similar character set Occurrence number of the character (other characters for belonging to the same similar character set) in text data to be detected, and think The error character should be the most similar character of occurrence number.
Based on any of the above embodiments, the fallibility character repertoire that the basis is pre-created, it includes target word to obtain Further included before the step of similar character set of symbol:Character set is obtained, the corresponding image data of each character in character set is carried out Size normalized;And according to the corresponding image data of each character, obtain the shape similarity between each character;According to character Between shape similarity, character is clustered, obtain similar character set;Wherein, appointing in the similar character set Similarity between two characters of meaning is more than default similarity, and the fallibility character repertoire includes at least one similar character set.
Wherein, character set includes multiple characters, and common Chinese character set has Unicode, BIG5 and GB2312;Wherein, Include a variety of fonts of same Chinese character in Unicode, e.g., Chinese character " family " word includes just " family ", " Kobe " two kinds of fonts, is The follow-up similar character cluster that carries out provides condition.
Specifically, it before carrying out text detection, needs to create fallibility character repertoire first;Select character set first, then for The calculating for facilitating character similarity first has to the size of the image data of character is unified;As shown in Fig. 2, for example, for word The image data for each character concentrated is accorded with, its ratio of width to height is all unified for 1 to 1, and stretching or boil down to 100*100 pixels Picture;The square frame that picture can be regarded as to a 100*100, by word closely it is nested wherein, as ruler Very little normalized.
Then according to the shape similarity between image data acquisition two-by-two character, and according to the similarity between character, Character is clustered;It is believed that when the similarity of two characters is more than default similarity (such as 90%), can recognize Belong to same class for the two characters, that is, is divided to same similar character set.Therefore, to number, letter and the common Chinese Word is clustered, and the character 90% or more with their similarities is found, these characters are classified as one kind, final to obtain Fallibility character repertoire.
Based on any of the above embodiments, the step of shape similarity obtained between each character specifically wraps It includes:The similarity between each character is calculated separately using multiple similarity calculating methods;According in advance to each similarity calculation side The weighted value of method distribution, and the similarity that is obtained by each similarity calculating method, the shape obtained between each character are similar Degree.
Specifically, when carrying out the calculating of shape similarity, phase is distributed using a variety of computational methods, and for each method Answer ground weighted value;The weighted value of each method is that the shape between final character two-by-two is similar to the sum of products of result of calculation Degree.Pass through a variety of similarity calculating methods so that final shape similarity can consider all kinds of factors, can be accurately anti- Reflect similar degree between character.
Based on any of the above embodiments, the multiple similarity calculating method includes comparison method, projection pixel-by-pixel Block comparison method and the ratio of width to height matching method.
Specifically, the embodiment of the present invention specifically uses comparison method, projection block comparison method and the ratio of width to height matching pixel-by-pixel Method is combined, and distributes a weight respectively to these three methods, comprehensive be added obtains final similarity;Comparison method pixel-by-pixel Basic thought is each pixel of two character correspondence image data after normalization to be compared, by similar pixel Number divided by total pixel number, obtain final similarity;The basic thought of projection block comparison method is to calculate the character after normalization The often row and the black picture element of each column sum of corresponding image data, the similarity of the two character information is matched with this.
Based on any of the above embodiments, described that the corresponding image data of each character in character set progress size is returned Further included before the step of one change processing:The metamessage of the corresponding image data of each character is recorded, the metamessage includes picture number According to the ratio of width to height;Correspondingly, the step of calculating the similarity between each character using the ratio of width to height matching method specifically includes:To each word The ratio of width to height recorded in the corresponding metamessage of symbol image data is compared, and obtains the corresponding similarity of the ratio of width to height matching method.
Specifically, partial character, such as capitalization English letter Z and small cannot be distinguished since the size of image data normalizes Writing English alphabet z, shape is identical after normalization, can not differentiate;Therefore, so before being normalized, elder generation is needed Record the metamessage of character raw image data, original the ratio of width to height such as character, the information such as original bitmap.And it is matched in the ratio of width to height In the similarity calculation of method, the ratio of width to height recorded in the metamessage of two characters is compared, normalization process can be made up In caused by character information lack.
Based on any of the above embodiments, the target character frequently occurred in the acquisition text data to be detected Step specifically includes:Sequence from big to small is carried out to the occurrence number of each character, the word of preceding preset ratio will be in sequence Symbol is used as target character, and/or occurrence number is more than the character of preset times as target character.
Specifically, after the occurrence number for having counted all characters, sequence from big to small is carried out according to the size of number, Using the character of the preceding ratio (such as 5%) of sequence as target character;Alternatively, number is more than certain value, such as 500 times, as Target character.
The detection method of error character, has the advantages that in text data provided in an embodiment of the present invention:
Library once is built, is repeatedly multiplexed;It only needs to carry out single pass to character set, establishes fallibility character repertoire;Later can To be corrected to large volume document using fallibility character repertoire, the cost for carrying out text data error correction every time is reduced, and pass through The particular content of fallibility character repertoire and document that this method is established out is completely independent, and can be used for the discrimination of any document substantially Adopted word is corrected, and versatility is very strong, has a wide range of application;
In certain occasions for needing to proofread a large amount of text datas, the wrong word of the overwhelming majority can be corrected automatically, Many text data works for correction for needing manually carry out can be partially completed;It is more typical in certain text data mistakes In the case of, it might even be possible to substitution is artificial completely, and human cost can be greatly reduced using the method that the present invention mentions;
By this method treated text data, uniform format, character error rate is extremely low;Compared to former data, ambiguity is corrected Data afterwards bring great convenience to be subsequently for statistical analysis, save the time of data cleansing, and improve number According to the accuracy of Result;
Applicability is very strong, can be not only used for general text data error correction, it can also be used to which personalized text data error correction needs It asks;Fallibility character library, frequent word and the wrongly written character threshold value mentioned in this method can be carried out according to the individual demand of user Modification, until achieving the effect that enable User Satisfaction.
Fig. 3 is the structural schematic diagram of the detection device of error character in text data provided in an embodiment of the present invention, such as Fig. 3 It is shown, including:Statistical module 301 is counted for the occurrence number to character in text data to be detected, is obtained to be detected The target character frequently occurred in text data;Acquisition module 302, for according to the fallibility character repertoire being pre-created, obtaining packet Similar character set containing target character, wherein the similar character set includes similar character similar with target character shape Symbol;Confirmation module 303, if being more than zero for occurrence number of the similar character in text data to be detected and less than default threshold Value then confirms that the similar character in text data to be detected is error character.
Wherein, statistical module 301 counts the word frequency that all characters occur in text data to be detected first, and word frequency is certain The frequency or number that one character occurs in text data to be detected;Statistical module 301 can obtain target according to occurrence number Character;Target character is the character that occurrence number is more in text to be detected;Alternatively, target character equally can be to use The more character of the customized error number in family.
Wherein, acquisition module 302 is looked into according to the target character obtained in statistical module 301 in fallibility character repertoire It askes, obtains similar character set;The similar character set includes multiple characters, each character with target character in shape With very strong similitude;Therefore, user is when carrying out text data typing, it is possible to be recorded similar character as target character Enter into text data to be detected.
Wherein, confirmation module 303 is according to the similar character set obtained in acquisition module 302, in text data to be detected The middle similar character similar with target character shape (being searched for example, by using the cryptographic Hash of each word) searched set and include; If some similar character appears in text to be detected, and the number occurred in document to be detected is less than predetermined threshold value (such as 0.1% of entire text data character sum), then the similar character occurred in the confirmation of confirmation module 303 text is mistake Accidentally character or ambiguity character.
The detection device of error character in text data provided in an embodiment of the present invention is frequently occurred by obtaining in text Target character, and judge whether the character similar with target character shape occurred in text is error character, is fully considered The similar error character of shape generated in manual entry data, effectively has detected the error character in text data, replaces Artificial error correction reduces cost of labor, improves error character detection efficiency.
Based on any of the above embodiments, described device further includes:Module is corrected, for obtaining belonging to error character Similar character set in occurrence number of each character in text data to be detected, and by error character correction be occurrence number Most characters.
Based on any of the above embodiments, described device further includes:Normalized module, for obtaining character Collection carries out size normalized to the corresponding image data of each character in character set;And according to the corresponding picture number of each character According to obtaining the shape similarity between each character;Cluster module, for according to the shape similarity between character, to character into Row cluster, obtains similar character set;Wherein, the similarity between any two character in the similar character set is more than Default similarity, the fallibility character repertoire include at least one similar character set.
Based on any of the above embodiments, the normalized module specifically includes:Computing unit, for using Multiple similarity calculating methods calculate separately the similarity between each character;Acquiring unit, for according in advance to each similarity The weighted value of computational methods distribution, and the similarity that is obtained by each similarity calculating method, obtain the shape between each character Shape similarity.
Based on any of the above embodiments, the multiple similarity calculating method includes comparison method, projection pixel-by-pixel Block comparison method and the ratio of width to height matching method.
Based on any of the above embodiments, described device further includes:Recording unit, it is corresponding for recording each character The metamessage of image data, the metamessage include the ratio of width to height of image data;Correspondingly, computing unit is specifically used for:To each The ratio of width to height recorded in the corresponding metamessage of character image data is compared, and obtains the corresponding similarity of the ratio of width to height matching method.
Based on any of the above embodiments, the statistical module 301 is specifically used for:To the occurrence number of each character into The character that preceding preset ratio is in sequence is more than by the sequence of row from big to small as target character, and/or by occurrence number The character of preset times is as target character.
Fig. 4 is the structural schematic diagram of the detection device of error character in text data provided in an embodiment of the present invention, such as Fig. 4 Shown, which includes:At least one processor 401;And at least one processor communicated to connect with the processor 401 402, wherein:The memory 402 is stored with the program instruction that can be executed by the processor 401, and the processor 401 calls Described program instructs the detection method for being able to carry out error character in the text data that the various embodiments described above are provided, such as wraps It includes:The occurrence number of character in text data to be detected is counted, the mesh frequently occurred in text data to be detected is obtained Marking-up accords with;According to the fallibility character repertoire being pre-created, the similar character set for including target character is obtained, wherein described similar Character set includes similar character similar with target character shape;If similar character goes out occurrence in text data to be detected Number is more than zero and is less than predetermined threshold value, then confirms that the similar character in text data to be detected is error character.
The embodiment of the present invention also provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage Medium storing computer instructs, which makes computer execute erroneous words in the text data that corresponding embodiment is provided The detection method of symbol, such as including:The occurrence number of character in text data to be detected is counted, text to be detected is obtained The target character frequently occurred in data;According to the fallibility character repertoire being pre-created, the similar character for including target character is obtained Set, wherein the similar character set includes similar character similar with target character shape;If similar character is to be detected Occurrence number in text data is more than zero and is less than predetermined threshold value, then confirms that the similar character in text data to be detected is mistake Accidentally character.
The embodiments such as detection device of error character are only schematical in text data described above, wherein making The unit illustrated for separating component may or may not be physically separated, and the component shown as unit can be Or it may not be physical unit, you can be located at a place, or may be distributed over multiple network units.It can be with Some or all of module therein is selected according to the actual needs to achieve the purpose of the solution of this embodiment.The common skill in this field Art personnel are not in the case where paying performing creative labour, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be expressed in the form of software products in other words, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Certain Part Methods of example or embodiment.
Finally it should be noted that:The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, it will be understood by those of ordinary skill in the art that:It still may be used With technical scheme described in the above embodiments is modified or equivalent replacement of some of the technical features; And these modifications or replacements, various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. the detection method of error character in a kind of text data, which is characterized in that including:
The occurrence number of character in text data to be detected is counted, the mesh frequently occurred in text data to be detected is obtained Marking-up accords with;
According to the fallibility character repertoire being pre-created, the similar character set for including target character is obtained, wherein the similar character Set includes similar character similar with target character shape;
If occurrence number of the similar character in text data to be detected is more than zero and is less than predetermined threshold value, text to be detected is confirmed Similar character in notebook data is error character.
2. according to the method described in claim 1, it is characterized in that, described confirm that the similar character in text data to be detected is Further include after the step of error character:
Occurrence number of each character in text data to be detected in the similar character set belonging to error character is obtained, and will be wrong Accidentally character correction is the most character of occurrence number.
3. according to the method described in claim 1, it is characterized in that, the fallibility character repertoire that the basis is pre-created, obtains packet Further included before the step of similar character set containing target character:
Character set is obtained, size normalized is carried out to the corresponding image data of each character in character set;And according to each character Corresponding image data obtains the shape similarity between each character;
According to the shape similarity between character, character is clustered, obtains similar character set;Wherein, the similar character The similarity between any two character in symbol set is more than default similarity, and the fallibility character repertoire includes at least one phase Like character set.
4. according to the method described in claim 3, it is characterized in that, the step of the shape similarity obtained between each character It specifically includes:
The similarity between each character is calculated separately using multiple similarity calculating methods;
According in advance to the weighted value of each similarity calculating method distribution, and obtained by each similarity calculating method similar Degree, obtains the shape similarity between each character.
5. according to the method described in claim 4, it is characterized in that, the multiple similarity calculating method includes comparing pixel-by-pixel Method, projection block comparison method and the ratio of width to height matching method.
6. according to the method described in claim 5, it is characterized in that, it is described to the corresponding image data of each character in character set into Further included before the step of row size normalized:
The metamessage of the corresponding image data of each character is recorded, the metamessage includes the ratio of width to height of image data;
Correspondingly, the step of calculating the similarity between each character using the ratio of width to height matching method specifically includes:
The ratio of width to height recorded in the corresponding metamessage of each character image data is compared, it is corresponding to obtain the ratio of width to height matching method Similarity.
7. according to the method described in claim 1, it is characterized in that, described obtain the mesh frequently occurred in text data to be detected The step of marking-up accords with specifically includes:
Sequence from big to small is carried out to the occurrence number of each character, the character of preceding preset ratio will be in sequence as target Character, and/or occurrence number is more than the character of preset times as target character.
8. the detection device of error character in a kind of text data, which is characterized in that including:
Statistical module counts for the occurrence number to character in text data to be detected, obtains text data to be detected In the target character that frequently occurs;
Acquisition module, for according to the fallibility character repertoire being pre-created, obtaining the similar character set for including target character, In, the similar character set includes similar character similar with target character shape;
Confirmation module, if being more than zero for occurrence number of the similar character in text data to be detected and being less than predetermined threshold value, Then confirm that the similar character in text data to be detected is error character.
9. the detection device of error character in a kind of text data, which is characterized in that including:
At least one processor;
And at least one processor being connect with the processor communication, wherein:The memory is stored with can be by the place The program instruction that device executes is managed, the processor calls described program instruction to be able to carry out as described in claim 1 to 7 is any Method.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
CN201810067388.5A 2018-01-22 2018-01-22 Detection method, device and the equipment of error character in a kind of text data Active CN108280051B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810067388.5A CN108280051B (en) 2018-01-22 2018-01-22 Detection method, device and the equipment of error character in a kind of text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810067388.5A CN108280051B (en) 2018-01-22 2018-01-22 Detection method, device and the equipment of error character in a kind of text data

Publications (2)

Publication Number Publication Date
CN108280051A true CN108280051A (en) 2018-07-13
CN108280051B CN108280051B (en) 2019-04-05

Family

ID=62804831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810067388.5A Active CN108280051B (en) 2018-01-22 2018-01-22 Detection method, device and the equipment of error character in a kind of text data

Country Status (1)

Country Link
CN (1) CN108280051B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783811A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of method, apparatus, equipment and storage medium identifying text editing mistake
CN109858473A (en) * 2018-12-28 2019-06-07 天津幸福生命科技有限公司 A kind of adaptive method for correcting error, device, readable medium and electronic equipment
CN110147549A (en) * 2019-04-19 2019-08-20 阿里巴巴集团控股有限公司 For executing the method and system of text error correction
CN111368918A (en) * 2020-03-04 2020-07-03 拉扎斯网络科技(上海)有限公司 Text error correction method and device, electronic equipment and storage medium
CN111797838A (en) * 2019-04-08 2020-10-20 上海怀若智能科技有限公司 Blind denoising system, method and device for picture documents
CN113159035A (en) * 2021-05-10 2021-07-23 北京世纪好未来教育科技有限公司 Image processing method, device, equipment and storage medium
CN113591440A (en) * 2021-07-29 2021-11-02 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment
WO2022121172A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Text error correction method and apparatus, electronic device, and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004265004A (en) * 2003-02-28 2004-09-24 Techno Network Shikoku Co Ltd System and method for acknowledging error in inputting character string of peculiar information
CN1670723A (en) * 2004-03-16 2005-09-21 微软公司 Systems and methods for improved spell checking
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN103853702A (en) * 2012-12-06 2014-06-11 富士通株式会社 Device and method for correcting idiom error in linguistic data
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004265004A (en) * 2003-02-28 2004-09-24 Techno Network Shikoku Co Ltd System and method for acknowledging error in inputting character string of peculiar information
CN1670723A (en) * 2004-03-16 2005-09-21 微软公司 Systems and methods for improved spell checking
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN103853702A (en) * 2012-12-06 2014-06-11 富士通株式会社 Device and method for correcting idiom error in linguistic data
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许君,等: "云环境中的近似复制文本检测", 《计算机研究与发展》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783811A (en) * 2018-12-26 2019-05-21 东软集团股份有限公司 A kind of method, apparatus, equipment and storage medium identifying text editing mistake
CN109783811B (en) * 2018-12-26 2023-10-31 东软集团股份有限公司 Method, device, equipment and storage medium for identifying text editing errors
CN109858473A (en) * 2018-12-28 2019-06-07 天津幸福生命科技有限公司 A kind of adaptive method for correcting error, device, readable medium and electronic equipment
CN109858473B (en) * 2018-12-28 2023-03-07 天津幸福生命科技有限公司 Self-adaptive deviation rectifying method and device, readable medium and electronic equipment
CN111797838A (en) * 2019-04-08 2020-10-20 上海怀若智能科技有限公司 Blind denoising system, method and device for picture documents
CN110147549A (en) * 2019-04-19 2019-08-20 阿里巴巴集团控股有限公司 For executing the method and system of text error correction
CN111368918A (en) * 2020-03-04 2020-07-03 拉扎斯网络科技(上海)有限公司 Text error correction method and device, electronic equipment and storage medium
CN111368918B (en) * 2020-03-04 2024-01-05 拉扎斯网络科技(上海)有限公司 Text error correction method and device, electronic equipment and storage medium
WO2022121172A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Text error correction method and apparatus, electronic device, and computer readable storage medium
CN113159035A (en) * 2021-05-10 2021-07-23 北京世纪好未来教育科技有限公司 Image processing method, device, equipment and storage medium
CN113159035B (en) * 2021-05-10 2022-06-07 北京世纪好未来教育科技有限公司 Image processing method, device, equipment and storage medium
CN113591440A (en) * 2021-07-29 2021-11-02 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN108280051B (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN108280051B (en) Detection method, device and the equipment of error character in a kind of text data
US11244208B2 (en) Two-dimensional document processing
EP3437019B1 (en) Optical character recognition in structured documents
US10482174B1 (en) Systems and methods for identifying form fields
US20240012846A1 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
US20220004878A1 (en) Systems and methods for synthetic document and data generation
US10878003B2 (en) System and method for extracting structured information from implicit tables
CN109344831A (en) A kind of tables of data recognition methods, device and terminal device
CN111062259A (en) Form recognition method and device
CN110245557A (en) Image processing method, device, computer equipment and storage medium
CN110502664A (en) Video tab indexes base establishing method, video tab generation method and device
CN103345616A (en) Fingerprint storage comparison system based on behavioral analysis
CN110516221A (en) Extract method, equipment and the storage medium of chart data in PDF document
CN110347855A (en) Paintings recommended method, terminal device, server, computer equipment and medium
US20220415008A1 (en) Image box filtering for optical character recognition
CN112036295B (en) Bill image processing method and device, storage medium and electronic equipment
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
US10643022B2 (en) PDF extraction with text-based key
US20130322759A1 (en) Method and device for identifying font
CN103227810B (en) A kind of methods, devices and systems identifying remote desktop semanteme in network monitoring
CN110427604B (en) Form integration method and device
WO2020211380A1 (en) Intelligent recognition method for front-end code in page design, and related device
CN115457581A (en) Table extraction method and device and computer equipment
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
US20230162517A1 (en) Interactive visual representation of semantically related extracted data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant