CN110533020B

CN110533020B - Character information identification method and device and storage medium

Info

Publication number: CN110533020B
Application number: CN201810516918.XA
Authority: CN
Inventors: 王盛涛; 温广滔; 明细龙; 蒋健
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-05-25
Filing date: 2018-05-25
Publication date: 2022-08-12
Anticipated expiration: 2038-05-25
Also published as: CN110533020A

Abstract

The invention discloses a method, a device and a storage medium for recognizing character information, belonging to the technical field of image recognition and aiming at improving the recognition accuracy of character contents in an image. The character information identification method comprises the following steps: identifying a text area in the image; selecting a character in the character area as a reference character, and determining words formed by the reference character in different word orders and adjacent characters; determining words with the occurrence frequency meeting preset conditions in a corpus from various words composed under different word orders; taking the word order of the words with the occurrence frequency meeting the preset conditions as the word arrangement word order in the word area; and outputting each line of characters identified from the character area according to the character arrangement language order.

Description

Character information identification method and device and storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method and an apparatus for recognizing text information, and a storage medium.

Background

In recent years, with the rapid development of image recognition technology, the demand for recognizing the text content in the image is increasing, and how to improve the recognition accuracy of the text content in the image so as to accurately express the semantic meaning of the text content in the image is a technical problem to be considered.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying character information and a storage medium, which are used for improving the identification accuracy of character contents in an image so as to accurately express the semantics of the character contents in the image.

In a first aspect, an embodiment of the present invention provides a method for identifying text information, including:

identifying a text area in the image;

selecting a character in the character area as a reference character, and determining words formed by the reference character in different word orders and adjacent characters;

determining words with the occurrence frequency meeting preset conditions in a corpus from various words composed under different word orders;

taking the word order of the words with the occurrence frequency meeting the preset conditions as the word arrangement word order in the word area;

and outputting each line of characters identified from the character area according to the character arrangement language sequence.

The method for recognizing the text information provided by the embodiment of the invention comprises the steps of firstly selecting a character from a recognized text area as a reference character, then utilizing the reference character to respectively use words formed by different language sequences and adjacent characters, and determining words with the occurrence frequency meeting preset conditions in a corpus from all words formed under different language sequences, wherein the preset conditions can be that the occurrence frequency is the maximum, and the occurrence frequency is larger than a preset threshold value and the like. And then, taking the word sequence of the words with the occurrence frequency meeting the preset conditions as a character arrangement word sequence in the character area, and further outputting each line of characters identified from the character area according to the character arrangement word sequence. Because the word sequences of the text contents in different images may be different and the same image may also include multiple word sequences, words formed by at least two adjacent characters in different word sequences are selected from the text area, the occurrence probability of the words is utilized to determine which word sequence is the correct word sequence of the text contents, and each row of characters recognized from the text area is output according to the determined correct word sequence, so that the accuracy of text recognition can be improved.

Optionally, before the determining that the word sequence of the word whose occurrence frequency meets the preset condition is used as the word arrangement word sequence in the word region, the method further includes: and determining the occurrence frequency of the words with the occurrence frequency meeting the preset conditions, wherein the occurrence frequency is higher than a set first threshold value.

In the embodiment of the present invention, the occurrence frequency of the words whose occurrence frequency meets the preset condition may be determined to be higher than the set first threshold, and then the word sequence of the words whose occurrence frequency meets the preset condition is used as the word arrangement word sequence in the word region, thereby excluding some words which may cause misjudgment of the word sequence in the formed words, and these words cannot be used as an independently applicable unit in the meaning of the words, but a certain occurrence frequency is given in the corpus, so that the accuracy of word recognition can be further improved.

Optionally, the method further includes:

for a text region containing a plurality of lines of characters, also determining a first character and a last character of each line in the text region;

respectively determining the occurrence frequency of words formed by the last character in the previous line and the first character in the next line in the corpus according to different arrangement sequences aiming at the two adjacent lines;

determining the sequence of the characters in each line in the character area according to the sequence of the words with the occurrence frequency larger than a second threshold value;

and according to the determined line sequence, determining the next line of characters to be output after outputting one line of characters from the character area.

In the embodiment of the invention, when the character area comprises a plurality of lines of characters, the line sequence of the character content can be judged, and the recognized content is arranged according to the correct line sequence, so that the accuracy of character recognition is further improved.

Optionally, the different arrangement orders are determined according to the word arrangement word order.

Optionally, after outputting the characters included in the text region according to the text arrangement language order, the method further includes:

and taking the occurrence frequency of the word with the occurrence frequency meeting the preset condition as the confidence coefficient of the characters contained in the character region according to the character arrangement word order, and outputting the confidence coefficient.

In the embodiment of the invention, the occurrence frequency of the word with the occurrence frequency meeting the preset condition can be used as the confidence coefficient, and the confidence coefficient is output and used for the credibility of the character information identified by the identification method provided by the embodiment of the invention, so that the presentation mode of the identification result is enriched.

Optionally, the recognizing the text area in the image specifically includes:

carrying out binarization processing on the image;

performing expansion processing according to a preset expansion coefficient and performing corrosion processing according to a preset corrosion coefficient on the image subjected to binarization processing to obtain a plurality of white areas in the image;

determining a target white region with an area larger than a third threshold and an aspect ratio larger than a fourth threshold from the plurality of white regions;

determining a plurality of pixel points corresponding to the target white area in the image, wherein an area formed by the pixel points is a character area of the image; and identifies the text region.

In the embodiment of the present invention, the image may be binarized first, and then the binarized image may be expanded according to a preset expansion coefficient and corroded according to a preset corrosion coefficient to remove small black noise and small white noise in the image, and then a target white region having an area larger than a third threshold and an aspect ratio larger than a fourth threshold is determined from the image from which the small black noise and the small white noise are removed, and a region formed by a plurality of pixel points corresponding to the target white region in the image is a text region of the image, so that the text region in the image can be accurately extracted, which is beneficial to further improving the recognition accuracy of the text region in the image.

In a second aspect, an embodiment of the present invention provides a text information recognition apparatus, including:

the identification unit is used for identifying a character area in the image;

the first determining unit is used for selecting one character in the character area as a reference character and determining words formed by the reference character in different language sequences and adjacent characters;

the second determining unit is used for determining words of which the occurrence frequency accords with preset conditions in the corpus from the words formed under different word orders;

and the output unit is used for taking the word sequence of the words with the occurrence frequency meeting the preset conditions as the character arrangement word sequence in the character area and outputting each line of characters identified from the character area according to the character arrangement word sequence.

Optionally, the second determining unit is further configured to:

and determining the occurrence frequency of the words with the occurrence frequency meeting the preset conditions, wherein the occurrence frequency is higher than a set first threshold value.

Optionally, the output unit is further configured to:

Optionally, the identification unit is further configured to:

carrying out binarization processing on the image;

performing expansion processing according to a preset expansion coefficient and performing corrosion processing according to a preset corrosion coefficient on the image subjected to the binarization processing to obtain a plurality of white areas in the image;

In a third aspect, an embodiment of the present invention further provides a text information recognition apparatus, including at least one processor and at least one memory, where the memory stores a computer program, and when the program is executed by the processor, the processor is caused to execute the steps of the method in the first aspect.

In a fourth aspect, embodiments of the present invention further provide a storage medium storing computer instructions, which, when executed on a computer, cause the computer to perform the steps of the method according to the first aspect.

According to the technical scheme in the embodiment of the invention, one character is selected from the recognized text region as the reference character, then the reference character is used for forming words with different language sequences and adjacent characters respectively, the words with the occurrence frequency meeting the preset condition in the corpus are determined from all the words formed under different language sequences, the language sequence of the words with the occurrence frequency meeting the preset condition is used as the text arrangement language sequence in the text region, and each line of characters recognized from the text region are output according to the text arrangement language sequence.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2a is a flowchart of a text information recognition method according to an embodiment of the present invention;

fig. 2b is a schematic diagram of an image after performing gray scale processing according to an embodiment of the present invention;

fig. 2c is a schematic diagram of an image after binarization processing according to an embodiment of the present invention;

FIG. 2d is a schematic diagram illustrating an image after expansion processing according to an embodiment of the present invention;

FIG. 2e is a schematic diagram illustrating an image after being processed by etching according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an image including two text regions according to an embodiment of the present invention;

FIG. 4a is a diagram illustrating a reference character in a text area 1 according to an embodiment of the present invention, wherein the reference character is composed of words with different word orders and adjacent characters;

FIG. 4b is a diagram illustrating a word composed of reference characters in different word orders and adjacent characters in the text area 2 according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating each line of characters recognized from a text region for output according to an embodiment of the present invention;

fig. 6 is a flowchart of determining a line order of text regions according to an embodiment of the present invention;

fig. 7a is a schematic diagram illustrating a method for determining a line order of text regions according to an embodiment of the present invention;

fig. 7b is a schematic diagram of another method for determining the line order of text regions according to the embodiment of the present invention;

fig. 8 is a schematic diagram illustrating another method for determining a line order of text regions according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a sequence of commonly used text regions according to an embodiment of the present invention;

FIG. 10 is a flowchart of determining a sequence of text regions according to an embodiment of the present invention;

fig. 11a is a schematic diagram illustrating a sequence of a determined text region according to an embodiment of the present invention;

FIG. 11b is a schematic diagram illustrating another example of determining a sequence of text regions according to an embodiment of the present invention;

fig. 12 is a schematic view of a text message identification apparatus according to an embodiment of the present invention;

fig. 13 is a schematic view of another text information recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the technical solutions of the present invention. All other embodiments obtained by a person skilled in the art without any inventive work based on the embodiments described in the present application are within the scope of the protection of the technical solution of the present invention.

Some concepts related to the embodiments of the present invention are described below.

NLP: natural Language Processing, is a subject of research on Language problems of human interaction with computers. On the one hand, it is a branch of linguistic information processing, and on the other hand, it is one of the core topics of AI (Artificial Intelligence).

OCR: optical Character Recognition, refers to a process in which an electronic device, such as a scanner, a digital camera, etc., checks a printed Character on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer word by a Character Recognition method, i.e., a process in which an image is scanned and then the word in the image is analyzed to obtain word information in the image.

OpenCV: is a cross-platform computer vision library issued based on BSD license (open source), and can run on Linux, Windows, Android and Mac OS operating systems. The method is light and efficient, is composed of a series of C functions and a small number of C + + classes, provides interfaces of languages such as Python, Ruby, MATLAB and the like, and realizes a plurality of general algorithms in the aspects of image processing and computer vision.

Bayesian formula: british mathematician Bayes 1702-1761 was developed to describe the relationship between two conditional probabilities, typically the probability of event A occurring under event B is different from the probability of event B occurring under event A.

The word order: the combination order or arrangement order of words, for example, the word order arranged from left to right, the word order arranged from right to left, the word order arranged from top to bottom, and the word order arranged from bottom to top.

And (3) line sequence: the arrangement order of the lines of the text region is, for example, an order of lines arranged from top to bottom, and an order of lines arranged from bottom to bottom.

In a specific practice process, the inventor of the present invention finds that, in the existing image recognition technology, the accuracy of recognizing characters in an image is low, and the sequence of characters is easily disordered, so that the semantics of the character contents in the image cannot be accurately expressed, and therefore, a more accurate character recognition method is required to improve the accuracy of character content recognition.

For this reason, the inventors of the present invention considered that, for understanding a line of text, the correct word order is a key factor for correctly understanding the text content. The word order of the text content in different images may be different, some from left to right, some from right to left, some from top to bottom, or even from bottom to top. For another example, in a cartoon image, different language sequences may be used in different dialog boxes for rich and varied forms, so that the dialog contents of a cartoon may include multiple language sequences. Further, the word order of characters in images created by different image creators may be different. If the text content in the image is indiscriminately identified by a language order, the arrangement of the identified text is necessarily wrong, and the text content cannot be accurately reproduced. Therefore, in order to improve the recognition accuracy of the text content in the image, the word order of the text region in the image must be accurately determined, so that the recognized text can accurately express the semantic meaning of the text content in the image. Furthermore, the line sequence of the character contents can be judged, and the recognized contents are arranged according to the correct line sequence, so that the accuracy of character recognition is further improved.

When the text content is understood in the correct word order, the probability that a word composed of a plurality of adjacent characters is the correct word is high, that is, the probability in normal use is also high. Therefore, in the embodiment of the present invention, the correct word order of a row of words is determined by using the characteristic of words, for this reason, in the embodiment of the present invention, when understanding the content of a word in the correct word order is defined by using the occurrence frequency meeting a preset condition, the preset condition may be that the occurrence frequency is the maximum, for example, a word composed of at least two adjacent characters in different word orders is selected from the content of the word, and then the word with the maximum occurrence frequency in the words is used to determine which word order is the correct word order of the content of the word, the preset condition may also be that the occurrence frequency is greater than a preset threshold, the preset threshold may be 90%, 95%, 80%, and so on, for example, when the preset threshold is 90%, a word with the occurrence frequency greater than 90% is selected from the content of the word composed of at least two adjacent characters in different word orders is selected from the content of the word to determine which word order is the correct word order of the content of the word, the preset condition may also be that the frequency of occurrence is maximum and greater than a preset threshold, which is only an example.

The use probability of the word can be judged by using the probability of the word appearing in the existing word stock. The technical solutions in the examples of the present invention are explained in detail below.

The embodiment of the invention provides a method for identifying literal information, which comprises the steps of selecting a character in a literal region as a reference character, then determining words respectively composed of the reference character by different language sequences and adjacent characters, selecting the language sequence of the words with the occurrence frequency meeting the preset conditions in a corpus from all the words composed under different language sequences as a literal arrangement language sequence in an image literal region, and outputting the characters contained in the literal region according to the selected literal arrangement language sequence.

The method for recognizing text information in the embodiment of the present invention may be applied to an application scenario as shown in fig. 1, where the application scenario includes a terminal device 10 and a cloud system 11, where the terminal device 10 is any intelligent electronic device capable of operating according to a program and automatically processing a large amount of data at a high speed, such as a computer, a mobile phone, and the like, the terminal device 10 is connected to the cloud system 11 through a network, an image to be recognized is stored in the cloud system 11, the image to be recognized may be one frame or multiple frames, the terminal device 10 may obtain the image to be recognized in the cloud system 11, for example, the terminal device 10 may access the cloud system 11 through a website where the image to be recognized is located, and further obtain the image to be recognized from the cloud system 11, and then the terminal device 10 may recognize text information in the image to be recognized according to the method for recognizing text information provided in the embodiment of the present invention, these matters will be described in the examples hereinafter.

Another possible application scenario is that, for example, when the image to be recognized is an image locally stored by the terminal device, the terminal device may directly obtain the locally stored image, and then recognize the text information in the image according to the text recognition method provided by the embodiment of the present invention.

Another possible application scenario, for example, when the image to be recognized is an image corresponding to a real object such as a comic book, a passport, a business photograph, and the like, the another possible application scenario may include a terminal device and a scanning device, where the terminal device and the scanning device may be connected through a data line or a network. The image to be identified corresponding to the real object is obtained through the scanning function of the scanning device, and the image obtained through scanning is sent to the terminal device, and the terminal device can identify the character information in the image according to the character identification method provided by the embodiment of the invention.

It should be noted that the above-mentioned application scenarios are only presented for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied in any scenario where applicable.

The following describes a method for recognizing text information according to an embodiment of the present invention with reference to an application scenario shown in fig. 1.

As shown in fig. 2a, a method for identifying text information according to an embodiment of the present invention includes:

step S201: and carrying out binarization processing on the image.

In the embodiment of the present invention, as described above, the terminal device, such as a computer, a smart phone, or the like, may obtain the image to be recognized from the cloud system, for example, the terminal device obtains the image to be recognized by accessing the website of the image to be recognized in the cloud system, and the image to be recognized may be one frame or multiple frames.

The terminal device may first perform binarization processing on the image, and convert values of all pixel points in the image into 0 or 255, where the value of the pixel point is 0 representing black and the value of the pixel point is 255 representing white. For example, a threshold may be set, the values of all pixels whose values of the pixels in the image are greater than the threshold are all converted into 255, and the values of all pixels whose values of the pixels in the image are less than or equal to the threshold are all converted into 0, where it is assumed that the threshold is 125, and usually, the pixels whose values of the pixels in the image are greater than the threshold 125 are determined as a specific object, and the values of the pixels in the part are all converted into 255, that is, white; the value of the pixel point in the image is judged to be the background and the value of the pixel point is converted into 0, namely black, so that the image is converted into a binary image with obvious black and white effect, and the subsequent image processing is facilitated.

Certainly, in practical applications, if the image acquired by the terminal device is stored in the RGB format, the image may be subjected to gray scale processing before the image is subjected to binarization processing, so as to reduce the amount of calculation of the image in subsequent processing and improve the efficiency of image processing, for example, as shown in fig. 2b, the image is an image obtained by performing gray scale processing on the image stored in the RGB format, so that the values of all the pixel points in the image stored in the RGB format are between 0 and 255, and fig. 2c is an image obtained by binarizing the image shown in fig. 2b, so that the values of all the pixel points in the image shown in fig. 2b are converted into 0 or 255.

Step S202: and performing expansion processing according to a preset expansion coefficient and performing corrosion processing according to a preset corrosion coefficient on the image after binarization processing to obtain a plurality of white areas in the image.

After the binarization processing of the image, the terminal device may perform expansion processing on the image according to a preset expansion coefficient, where the expansion processing is mainly to remove some small black point noise in the image, and after the expansion processing, some small black point noise in the image may be filled in a white area, where the preset expansion coefficient may be flexibly set according to specific needs, for example, to be 2, 3, and the like.

For example, when the map after the binarization processing shown in fig. 2c is subjected to the expansion processing to obtain the map shown in fig. 2d, it can be seen that the map after the expansion processing shown in fig. 2d shows more white areas than the map before the expansion processing, i.e., the map shown in fig. 2c, and the white areas are areas where some small black point noises in the map shown in fig. 2c appear after the expansion processing.

The terminal equipment can also perform corrosion treatment on the expanded picture, the purpose of the corrosion treatment is mainly to remove some small white point noises in the picture, and similarly, the corrosion coefficient can be flexibly set according to actual needs without any limitation. For example, the graph shown in fig. 2d is subjected to an erosion process to remove some small white noise in the graph shown in fig. 2d, and then the graph shown in fig. 2e is obtained, but in practical applications, the image after the erosion process may be subjected to an expansion process again according to actual needs to remove the noise in the image again, so that only a plurality of white areas with large color block bodies are finally left in the image.

Step S203: and determining a target white area with the area larger than a third threshold value and the aspect ratio larger than a fourth threshold value from the plurality of white areas.

In the embodiment of the present invention, the inventor considers that after the text region in the image is subjected to the dilation etching process as described above, a white rectangular region with a certain area and an aspect ratio is generally formed, and some background regions in the image that do not belong to the text region generally form a white region with a small area and an irregular shape, so a third threshold value for determining the white area and a fourth threshold value for determining the aspect ratio of the white region may be set, and then, after the third threshold value and the fourth threshold value are set, a target white region with an area larger than the third threshold value and an aspect ratio larger than the fourth threshold value may be determined from a plurality of white regions, and the target white region determined by the terminal device may be one or a plurality of white regions.

Since the text area appears as a white rectangular area after the expansion corrosion treatment, the value of the fourth threshold may be 1-2, for example, 1.5, 2, and so on. In a specific practice process, when the third threshold is set, the number of the character areas included in the images of different types can be considered in a combined manner, and the third threshold is dynamically adjusted according to the images of different types.

Step S204: and determining a plurality of pixel points corresponding to the target white area in the image, wherein the area formed by the pixel points is the character area of the image.

For example, only one target white region is determined, a plurality of pixel points corresponding to the target white region are determined from the image, and a region formed by the pixel points is a text region of the image. For example, when the determined target white areas include two target white areas, for convenience of description, the two determined target white areas are respectively referred to as a target white area 1 and a target white area 2, then a plurality of pixel points corresponding to the target white area 1 are determined from the image, and an area formed by the plurality of pixel points is a text area 1, that is, the text area 1 is obtained according to the target white area 1, and then a plurality of pixel points corresponding to the target white area 2 are determined from the image, and an area formed by the plurality of pixel points is a text area 2, that is, the text area 2 is obtained according to the target white area 2.

Step S205: a text region in the image is identified.

After the text area in the image is determined, the text area in the image can be recognized, so as to obtain characters included in the text area. In the embodiment of the present invention, specifically, an OCR character recognition technology is used to recognize characters included in a text area in an image.

When the image only comprises one text area, the OCR character recognition technology is used for recognizing the text area, the characters included in the text area are recognized in a row-by-row mode, namely, the characters included in the first row of the text area are recognized firstly, and then the characters included in the second row of the text area are recognized according to the arrangement sequence of the rows from top to bottom. When the image includes a plurality of text areas, for example, text area 1 and text area 2, the terminal device may recognize the characters in text area 1 first and then recognize the characters in text area 2, and of course, may recognize text area 2 first and then recognize text area 1.

In the embodiment of the present invention, specifically, taking the image shown in fig. 3 as an example, in the image, a text area 1 and a text area 2 are included, and it is assumed that the terminal device uses an OCR character recognition technology to recognize the text area 1 in the image and then recognizes the text area 2 in the image, when recognizing the text area 1, the first row of the text area includes the characters that: "complete", then identify the character that the second line includes: this may be several! ". When recognizing the text area 2, the characters included in the first line are sequentially recognized: "human results", the second line includes the characters: "ashen," the third row includes the characters: ". Big on the same track ", the fourth row contains the characters: "greater than" and the fifth row includes the characters: "feel sister everywhere".

It should be noted that, in the embodiment of the present invention, the steps S201 to S204 executed before the step S205 is executed are optional execution steps in the embodiment of the present invention, and after the optional execution steps are performed, the binarization, the expansion and the erosion are performed on the image, and then the text region is acquired from the image, so that the accuracy of extracting the text region in the image can be increased, and the accuracy of subsequently identifying the characters in the text region of the image by using the OCR character recognition technology is further improved. Of course, in practical applications, the above steps S201 to S204 may not be performed, and the character region of the image may be recognized by directly using the OCR character recognition technology, or may be extracted by using another method.

Step S206: selecting a character in the character area as a reference character, and determining words formed by the reference character and adjacent characters in different language sequences.

Continuing with the example of fig. 3 and taking the example of recognizing the text area 1 shown in fig. 3 as an example, in the embodiment of the present invention, the probability of the occurrence of the event a under the condition of the event B is different from the probability of the occurrence of the event B under the condition of the event a by making full use of the bayesian formula. Therefore, one character in the text region can be selected as a reference character, words are composed of characters in different word orders and adjacent characters, and the occurrence probability of words composed in different word orders is determined in the next step. Specifically to fig. 3, in identifying the characters included in the first line of the text region 1 in the image: "finished", and the second line includes the characters: after "these characters are several points", any character may be selected as a reference character in the text region 1, and a word composed of the reference character and an adjacent character in different language order, or a word composed of the reference character and a plurality of adjacent characters (e.g., 2 characters, 3 characters) in different language order may be determined.

Similarly, for the text area 2 in the image 3, the characters included in the first line in which the text area 2 is recognized respectively: "human results", the second line includes the characters: "ashen," the third row includes the characters: ". Big on the same track ", the fourth row contains the characters: "big at hand", the fifth line includes the characters: after "feel miss", any character may be selected as a reference character in the text area 2, and a word composed of the reference character and an adjacent character in different language order or a word composed of the reference character and a plurality of adjacent characters (e.g. 2 characters, 3 characters) in different language order may be determined.

In practical applications, the word order can be roughly divided into four categories, i.e., a left-to-right ordering, a right-to-left ordering, a top-to-bottom ordering, and a bottom-to-top ordering, wherein the most common word order is a left-to-right ordering. Therefore, after selecting a character as a reference character in the text area, words composed of the reference character and its neighboring characters in the left-to-right arranged language sequence, words composed of the reference character and its neighboring characters in the right-to-left arranged language sequence, words composed of the reference character and its neighboring characters in the top-to-bottom arranged language sequence, and words composed of the reference character and its neighboring characters in the bottom-to-top arranged language sequence can be determined respectively.

For example, for the text area 1 in fig. 3, any character in the text area 1 may be selected as a reference character, where the first character "complete" in the first row in the selected text area 1 shown in fig. 4a is used as the reference character, a word "complete" is formed by "having" a character adjacent to the character "complete" in the left-to-right language sequence, and a word "complete" is formed by "having" a character adjacent to the character "complete" in the top-to-bottom language sequence, and since there is no character behind the reference character "complete" in the right-to-left language sequence arranged in the top-to-bottom language sequence, there is no character behind the word arranged in the bottom-to-top language sequence, there are no words formed in the left-to-right and top-to-bottom language sequences, and thus two words are formed in total, that is "complete" and "complete" respectively.

For the text area 2, the third character "taste" in the second row in the text area 2 shown in fig. 4b is selected as a reference character, a word "taste" is formed by the left-to-right language sequence and the adjacent character "ran", a word "taste" is formed by the right-to-left language sequence and the adjacent character "don't", a word "taste" is formed by the top-to-bottom language sequence and the adjacent character "dao", and a word "taste pair" is formed by the bottom-to-top language sequence and the adjacent character "pair", that is, four words are formed in total in the direction of the four language sequences, namely, "taste", "ran", "taste", and "taste pair", respectively.

Step S207: determining words with the frequency meeting preset conditions in the corpus from the words composed under different word orders.

In the embodiment of the present invention, the method in the embodiment of the present invention is specifically described in detail by taking the maximum occurrence frequency as an example when the occurrence frequency meets the preset condition. The corpus in step S203 in the embodiment of the present invention may be a corpus locally stored in the terminal device, or a corpus in the internet, where the corpus includes a large number of words and occurrence frequencies of the words, and in a specific practical process, the occurrence frequencies of the words may also be referred to as usage frequencies of the words, or simply referred to as word frequencies.

For two words, namely "complete" and "complete", formed in the text region 1, the terminal device may obtain the occurrence frequencies of the two words from the corpus, respectively, so as to determine the word with the highest occurrence frequency in the two words. It is assumed here that the frequency of occurrence of the word "complete" in the corpus is 80%, and since "complete" cannot be calculated as an independently applicable unit in the meaning of the language, the word may not be recorded in the corpus, and at this time, the frequency of occurrence of the word "complete" may be considered to be 0, or a very small frequency of occurrence, such as 0.1%, 1%, etc., is assigned to "complete", and here, taking the frequency of occurrence of the word "complete" as 0, it is obvious that the word with the largest frequency of occurrence of the two words, "complete" and "complete" is "complete".

For four words, namely "tasteful", "natural", "taste", and "taste pair", formed in the text region 2, the terminal device may also obtain the occurrence frequencies of the four words from the corpus, respectively, so as to determine the word with the highest occurrence frequency among the four words. It is assumed here that the frequency of occurrence of the word "taste" is 0, the frequency of occurrence of the word "no" is 0, the frequency of occurrence of the word "taste" is 90%, and the frequency of occurrence of the word "taste pair" is 0, and then the word with the highest frequency of occurrence among the four words is "taste".

Step S208: and taking the word sequence of the words with the occurrence frequency meeting the preset conditions as the word arrangement word sequence in the word area.

After the terminal device obtains the word with the maximum occurrence frequency in the character area, the word sequence of the word with the maximum occurrence frequency can be used as the character arrangement word sequence of the characters in the character area. For example, for the text area 1, after it is determined that the word with the largest frequency of occurrence of the two words "complete" and "complete" is "complete", the word sequence corresponding to the "complete" word may be used as the text arrangement word sequence in the text area 1, and since the "complete" word is a word composed of the first character "complete" in the first row in the text area 1 as the reference character and the word sequence arranged from left to right and the adjacent character "complete", the word sequence corresponding to the "complete" word is the word sequence arranged from left to right, and then the word sequence arranged from left to right is used as the text arrangement word sequence in the text area 1.

In the embodiment of the present invention, in order to further improve the recognition accuracy, before the step S204 is executed, the occurrence frequency of the word with the largest occurrence frequency may be determined to be higher than the set first threshold.

For example, after determining that the word with the highest frequency of occurrence in the two words "complete" and "complete" is "complete", it may also be determined whether the frequency of occurrence of the word "complete" with the highest frequency of occurrence is higher than a set threshold, where the threshold is assumed to be 50%, and since the frequency of occurrence of "complete" is 80% and higher than the set threshold 50%, the word sequence corresponding to the word "complete" is taken as the word arrangement word sequence in the word area 1, that is, the word sequence arranged from left to right is taken as the word arrangement word sequence in the word area 1, so that it may be further excluded that there are some words in the words composed of different word sequences and adjacent characters in the reference character respectively, which cannot be used as a unit in the meaning of the words, by determining the frequency of occurrence of the word with the highest frequency of occurrence and higher than the set threshold, but a smaller occurrence frequency is also assigned in the corpus, so that the aim of further improving the identification accuracy is fulfilled.

With respect to the text area 2, after determining that the word with the largest frequency of occurrence in the "taste", "natural", "taste", and "taste pair" is the "taste", the word sequence corresponding to the word of the "taste" can be used as the text arrangement word sequence of the text in the text area 2, and since the word of the "taste" is a word composed of the third character "taste" in the second row in the text area 2 as the reference character and the word of the word "lane" adjacent to the word sequence arranged from top to bottom, the word sequence corresponding to the word of the "taste" is the word sequence arranged from top to bottom, and then the word sequence arranged from top to bottom is used as the text arrangement word sequence in the text area 2.

Similarly, for the text area 2, before the step S204 is executed, the occurrence frequency of the word with the highest occurrence frequency may be determined to be higher than the set first threshold, and then the word sequence of the word with the highest occurrence frequency may be used as the text arrangement word sequence in the text area, and the description will not be repeated here.

Step S209: and outputting each line of characters identified from the text area according to the word arrangement language order.

In the embodiment of the present invention, after determining the text arrangement language order of the text region, the terminal device outputs all characters included in the text region according to the determined text arrangement language order. In consideration of the practical application, when the word arrangement language order of the word region is a language order arranged from left to right or a language order arranged from right to left, and the word region includes a plurality of lines of characters, the inventor of the present invention considers that the arrangement order of the more common lines corresponding to the correct semantics of the words of the word region, i.e. the line order, is as shown in fig. 5: the first line, the second line and the last line are sequentially arranged from top to bottom, so that when the word arrangement language sequence of the word area is determined to be a language sequence arranged from left to right or a language sequence arranged from right to left, the characters included in the first line and the characters included in the second line in the word area can be directly and sequentially output according to the word arrangement language sequence of the word area.

For example, in the text area 1 listed above, after the terminal device takes the word sequence arranged from left to right as the word sequence in the text area 1, the terminal device sequentially outputs the words "completed" included in the first row in the text area 1 and then outputs the words "included in the second row in the text area 1 according to the word sequence, which are several points |. "i.e. the text output by text area 1 is: "it will be a few points after it is finished! ".

In the embodiment of the present invention, after outputting the characters in each row in the text region according to the text arrangement language order of the text region, the occurrence frequency of the word with the largest occurrence frequency may be used as the confidence level of the characters included in the text region according to the text arrangement language order, and the confidence level is output. For example, the word "completed" in the above-listed character region 1 has the largest frequency of occurrence, specifically 80%, and therefore 80% may be used as the confidence level of the characters of the character region 1 output in the order of the left-to-right arranged language, and the confidence level may be output, for example, the confidence level may be output after the characters of the character region 1 are output, that is, the word "completed this may be something |)! 80% ".

It should be noted that, in the embodiment of the present invention, to prevent the erroneous determination, different reference characters may be selected, and step S206 to step S208 shown in fig. 2 are performed multiple times to verify the result of the language order determination, that is, when the language orders determined multiple times or the language orders determined in most cases are the same, the determined language order is finally used as the correct language order.

In the embodiment of the present invention, in order to further improve the accuracy of text recognition, the text area including a plurality of rows of characters in the text area may further determine the line sequence of the text content, and usually, if the text arrangement language, i.e., the language sequence, of the text area is determined, actually, the arrangement sequence, i.e., the line sequence, of the lines in the text area is limited to several of them, and similarly, the arrangement sequence (the columns may also be regarded as the lines arranged vertically), i.e., the line sequence, of the columns in the text area is also limited to several of them, for example, the text arrangement language sequence in the text area is a language sequence arranged from left to right or a language sequence arranged from right to left, and then the line sequence of the text area is either an arrangement sequence from top to bottom or a arrangement sequence from bottom to bottom; for another example, if the word arrangement language order in the word region is a language order arranged from top to bottom or a language order arranged from bottom to top, the word region is arranged in a right-to-left or left-to-right sequence.

It should be noted that, in the multiple rows of characters described herein, each row may be a horizontally arranged row, a vertically arranged row (i.e., a column), a diagonally arranged row, or even an arc or curve arranged row, and so on, and the description herein mainly takes two common cases of horizontal arrangement and vertical arrangement as an example.

Thus, in a text region containing multiple lines of characters, the steps shown in fig. 6 may also be performed to determine the line order in the text region, the steps shown in fig. 6 including:

step S601: determining the first character and the last character of each line in the character area;

step S602: respectively determining the occurrence frequency of words formed by the last character in the previous line and the first character in the next line in the corpus according to different arrangement sequences aiming at the two adjacent lines;

step S603: determining the sequence of characters in each line in the character area according to the sequence of the words with the occurrence frequency larger than a second threshold value;

step S604: and according to the determined line sequence, determining the next line of characters to be output after outputting one line of characters from the character area.

In practical applications, the word arrangement language order of the word area is determined, and the first character and the last character of each row in the word area are also determined, for example, when the word arrangement language order is arranged from left to right, the leftmost character of each row is the first character of the row, and the rightmost character is the last character of the row; in the following, the process of determining the line sequence in the text area is described in detail by specifically using the text arrangement language sequence shown in fig. 7a and 7b as the language sequence arranged from left to right, where the rightmost character of each line is the first character of the line, and the leftmost character is the last character of the line.

The text area shown in fig. 7a and 7b, in the text arrangement language order, i.e. the language order from left to right, the first and last characters in each row may be determined, for any selected adjacent two rows in the text area, determining the occurrence frequency of words formed by the last character of the previous line and the first character of the next line in the corpus according to the arrangement sequence of different lines, for example, as shown in fig. 7a, the lines are arranged in the order of black bold arrows, that is, the first line is the uppermost line in the text area, the last line is the lowermost line, the last character "further" in the first row may be selected to be "wanted" with the first character "wanted" in the second row to form the word "further wanted" to determine the frequency of occurrence of the word "further wanted" in the corpus, assuming that the frequency of occurrence of the word "further wanted" in the corpus is 90%.

For example, as shown in fig. 7b, according to the arrangement order of the lines shown by the black thick arrows, that is, the top line in the text area is the last line, and the bottom line is the first line, the last character "pan" in the second line and the last line, that is, the first character "true" in the third line are selected to form the word "pan true", and it is assumed that the frequency of the word "pan true" first in the corpus is 0.

It is assumed here that the third threshold in step S603 is 50%, then the frequency of occurrence of the word "want to go back" is greater than the third threshold, and the frequency of occurrence of the word "crime" is less than the third threshold, so the arrangement order of the lines corresponding to the word "want to go back" can be determined as the line order of the characters in each line in the text region, that is, the arrangement order of the lines in the text region is shown in fig. 7a, the top line is the first line, and the bottom line is the last line, so the next line of characters to be output can be determined after each line of characters is output according to the determined line order, and the characters included in each line in the text region can be output in sequence. That is, after outputting the text "really good and me still" included in the first line, the text "want to eat one dish again" included in the second line is sequentially output, and the text "row not go? ", the complete text that is finally output is: "really good to eat I want to eat one dish again and go wrong? ".

It should be noted that, in the embodiment of the present invention, in order to further improve the accuracy of character recognition, two different adjacent rows may be selected, and step S602 and step S603 are performed multiple times to verify the line sequence determination result, that is, when the line sequences determined multiple times or the line sequences determined in most cases are the same, the determined line sequence is finally used as the correct line sequence.

In practical applications, it may also occur that, for any selected two adjacent lines in the text area, according to the arrangement sequence of different lines, it is determined that the frequency of occurrence of a word composed of the last character in the previous line and the first character in the next line in the corpus is less than a threshold, for example, in the text area shown in fig. 8, according to the arrangement sequence of the line in the last line of the uppermost line and the last line of the lowermost line of the text area, it is determined that the last character in the first line is "enclosed by the square in fig. 8," the first character in the second line is "enclosed by the square in fig. 8," and the frequency of occurrence of the composed word "this" in the corpus is 0 and is less than a third threshold; according to the line arrangement sequence of the uppermost line, the last line and the lowermost line of the text area, it is determined that the last character of the line in the first line is 'completed' enclosed by an oval in fig. 7, the first character of the last line is 'completed' enclosed by an oval in fig. 8, the occurrence frequency of the word 'completed' is 0 and is also less than a third threshold, and at this time, the line arrangement sequence can be directly used as the line sequence of the text area according to the line arrangement sequence common in practical application, that is, the first line, the second line and the last line from top to bottom in sequence.

In the embodiment of the present invention, when the word arrangement order of the text region is a top-down arrangement or a bottom-up arrangement, and the arrangement order of the columns of the text region is also limited to be either a left-to-right arrangement or a right-to-left arrangement, the inventor of the present invention considers that in practical applications, generally, when the word arrangement order is a top-down arrangement or a bottom-up arrangement, the arrangement order of the more commonly used columns corresponding to the correct semantics of the text region is as shown in fig. 9, and the first column, the second column, and the last column are from right to left, so that when the word arrangement order of the text region is determined to be a top-down arrangement or a top-up arrangement, the first column included in the text region, the characters, the second column, and the last column may be sequentially output in the above-mentioned arrangement order of the commonly used columns shown in fig. 9, The second column includes characters until the last column includes characters.

That is, after the terminal device takes the word sequence arranged from top to bottom as the word arrangement word sequence in the word area shown in fig. 9, the words "miss big and small", which are included in the first column, the words "feel like the taste", which are included in the second column, the words "know different from the normal", which are included in the third column, and the words "people" which are included in the third column are sequentially output according to the arrangement sequence of the commonly used columns. "i.e. the last output text of the text area is: "all size sisters have a different perception of taste than everyone. "

In the embodiment of the present invention, in order to further prompt the accuracy of character recognition, when a plurality of columns of characters are in a character area, the sequence in the character area may also be determined according to the steps shown in fig. 10, where the steps shown in fig. 10 include:

step S701: determining the first character and the last character of each column in the text area;

step S702: respectively determining the occurrence frequency of words formed by the last character in the previous column and the first character in the next column in the corpus according to different arrangement sequences aiming at the two adjacent columns;

step S703: determining the sequence of each column of characters in the character area according to the sequence of the words with the occurrence frequency larger than a fifth threshold;

step S704: in accordance with the determined sequence, a next sequence of characters to be output is determined after outputting a sequence of characters from the character area.

The following specifically describes the process of determining the sequence of the text regions by taking the text regions in which the text arrangement language sequence shown in fig. 11a and 11b is arranged from top to bottom as an example. Similarly, when the word arrangement language order is a top-down arrangement language order, the first character and the last character of each column are determined, that is, the top character of each column is the first character of the column, and the bottom character is the last character of the column.

For example, as shown in fig. 11a, assuming that the rightmost side of the text area is the first column and the leftmost side is the last column, the last character "feel" in the second column and the first character "know" in the third column may be selected to constitute the word "feel", and the frequency of occurrence of the word "feel" in the corpus is determined, where it is assumed that the frequency of occurrence of the word "feel" in the corpus is 90%. Then, in the order of the columns shown in FIG. 11 b: the left-most side of the text area is a first column, the right-most side of the text area is a last column, namely a fourth column, the last character "Van" of the second column and the first character "pair" of the third column are selected to form a word "Van Pair", and the frequency of the word "Van Pair" in the corpus is assumed to be 0.

It is assumed here that the fifth threshold in step S703 is also 50%, then the frequency of occurrence of the word "sense" is greater than the fifth threshold, and the frequency of occurrence of the word "normal pair" is less than the fifth threshold, so that the arrangement order of the rows corresponding to the word "sense" can be determined as the column order of the characters in each column in the text region, that is, the arrangement order of the columns in the text region is shown in fig. 11a, the right side is the first column, and the left side is the last column, so that the next column of characters to be output can be determined after each column of characters is output according to the determined column order, and the characters included in each column in the text region are output in sequence. That is, after outputting the character "miss big and small" included in the first column, the character "feeling of taste" included in the second column, the character "known to be different from that" included in the third column, and the character "person" included in the fourth column are sequentially output. ", the complete text that is finally output is: "all size sisters have a different perception of taste than everyone. ".

In practical application, the situation that the occurrence frequency of words formed by the last character of the previous column and the first character of the next column in the corpus is smaller than the threshold value according to the arrangement sequence of different columns for any two adjacent columns in the text area also occurs, and at this time, the arrangement sequence of the columns which is common in practical application, that is, the right side is the first column, and the left side is the last column, can also be directly used as the sequence of the text area.

Therefore, in the above method, the correct word order is considered as the key factor for correctly understanding the word content, the word order of the text in different images may be different, or the same image includes text with different word order, therefore, the method for identifying the text information in the embodiment of the invention selects one character in the text area as the reference character, then determining words respectively composed of the reference characters in different language sequences and adjacent characters, selecting the language sequence of the words with the occurrence frequency meeting the preset conditions in the corpus from the words composed in different language sequences, if the preset condition is that the frequency of occurrence is maximum, then the word sequence of the word with the maximum frequency of occurrence can be used as the word arrangement word sequence in the image word area, and the characters contained in the character area are output according to the selected character arrangement language order, so that the aim of improving the accuracy of character recognition is fulfilled.

In practical application, the method for recognizing the text information in the embodiment of the present invention can be edited into a special image text information recognition tool through a programming language, such as OpenCV, C + +, C, and the like, and is applied to any application scene in which the text information in an image needs to be recognized, for example, a cartoon picture is browsed by using a special cartoon Application (APP) in the field of NLP.

The method for identifying the text information in the embodiment of the invention can also be applied to application scenes involving further mining and analyzing the text content in the image, such as application scenes extracted for cartoon image types, application scenes extracted for cartoon styles, application scenes clustered by similar cartoons, application scenes recommended by other cartoons related to the cartoon types, and the like.

For example, in an application scenario in which a cartoon type is extracted according to recognized text information, a subject term that can reflect the subject of the cartoon can be further extracted from the recognized text information, and then whether the cartoon is a love type cartoon, a hallucination type cartoon, or a cross type cartoon is analyzed according to the proposed subject term. For example, in an application scenario in which a cartoon style is extracted based on the recognized text information, words related to a cartoon character may be further extracted from the recognized text information, and the style of the cartoon may be analyzed as a child series or a girl series.

Based on the same inventive concept, the embodiment of the present invention provides a text information recognition apparatus, and the specific implementation of the text information recognition method of the apparatus can refer to the description of the above method embodiment, and repeated parts are not repeated, as shown in fig. 12, the apparatus includes:

the identification unit is used for identifying a character area in the image;

a first determining unit 80, configured to select a character in the text region as a reference character, and determine words formed by the reference character and adjacent characters in different language orders;

a second determining unit 81, configured to determine, from among words composed in different word orders, a word whose occurrence frequency in the corpus meets a preset condition;

and the output unit 83 is configured to use the word sequence of the word whose occurrence frequency meets a preset condition as the character arrangement word sequence in the character region, and output each line of characters identified from the character region according to the character arrangement word sequence.

Optionally, the second determining unit is further configured to:

Optionally, the output unit is further configured to:

Optionally, the identification unit is further configured to:

carrying out binarization processing on the image;

Based on the same inventive concept, an apparatus for recognizing text information is further provided in the embodiment of the present invention, as shown in fig. 13, including at least one processor 90 and at least one memory 91, where the memory 91 stores a computer program, and when the program is executed by the processor 90, the processor 90 executes the steps of the method for recognizing text information in the embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention further provides a storage medium, where the storage medium stores computer instructions, and when the computer instructions are run on a computer, the computer is caused to execute the steps of the text information identification method in the embodiment of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for recognizing text information is characterized by comprising the following steps:

identifying a text area in the image;

2. The method according to claim 1, wherein before the determining the word sequence of the word whose occurrence frequency meets the preset condition as the word arrangement word sequence in the word area, the method further comprises: and determining the occurrence frequency of the words with the occurrence frequency meeting the preset conditions, wherein the occurrence frequency is higher than a set first threshold value.

3. The method of claim 1, wherein the method further comprises:

4. The method according to claim 3, wherein said different arrangement order is determined according to said word arrangement language order.

5. The method of claim 1, wherein after outputting the characters included in the text region in the text arrangement language order, the method further comprises:

6. The method according to any one of claims 1 to 5, wherein the identifying the text area in the image specifically comprises:

carrying out binarization processing on the image;

7. A character information recognition apparatus, comprising:

the identification unit is used for identifying a character area in the image;

8. The apparatus of claim 7, wherein the second determining unit is further configured to:

9. The apparatus of claim 7, wherein the output unit is further configured to:

10. The apparatus of claim 9, wherein said different arrangement order is determined according to said word arrangement language order.

11. The apparatus of claim 7, wherein the output unit is further configured to:

12. The apparatus according to any one of claims 7 to 11, wherein the identification unit is further configured to:

carrying out binarization processing on the image;

13. A text information recognition apparatus comprising at least one processor and at least one memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any one of claims 1 to 6.

14. A storage medium storing computer instructions which, when executed on a computer, cause the computer to perform the steps of the method according to any one of claims 1 to 6.