CN112668600B - Text recognition method and device - Google Patents

Text recognition method and device Download PDF

Info

Publication number
CN112668600B
CN112668600B CN201910983555.5A CN201910983555A CN112668600B CN 112668600 B CN112668600 B CN 112668600B CN 201910983555 A CN201910983555 A CN 201910983555A CN 112668600 B CN112668600 B CN 112668600B
Authority
CN
China
Prior art keywords
feature
slices
characteristic
feature sequence
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910983555.5A
Other languages
Chinese (zh)
Other versions
CN112668600A (en
Inventor
胡文阳
侯军
蔡晓聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sensetime International Pte Ltd
Original Assignee
Sensetime International Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sensetime International Pte Ltd filed Critical Sensetime International Pte Ltd
Priority to CN201910983555.5A priority Critical patent/CN112668600B/en
Publication of CN112668600A publication Critical patent/CN112668600A/en
Application granted granted Critical
Publication of CN112668600B publication Critical patent/CN112668600B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The disclosure provides a text recognition method and device. In the method, an initial feature map of a text picture is subjected to pooling treatment to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices; obtaining dependency information among the plurality of first feature slices based on the first feature sequence; obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices; and obtaining a text recognition result of the text picture based on the second feature sequence. By learning the dependency information among the feature slices based on the text pictures to perform text recognition, the accuracy of text recognition is improved.

Description

Text recognition method and device
Technical Field
The present disclosure relates to the field of computers, and in particular, to a text recognition method and apparatus.
Background
Text recognition is widely applied in various fields, such as license plate recognition, natural scene text recognition, document text line recognition and the like. The recognition accuracy of the related text recognition model needs to be further improved.
Disclosure of Invention
The disclosure provides a text recognition method and device.
In a first aspect, a text recognition method is provided, including:
pooling the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices;
obtaining dependency information among the plurality of first feature slices based on the first feature sequence;
Obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices;
and obtaining a text recognition result of the text picture based on the second feature sequence.
In one implementation, the obtaining dependency information between the plurality of first feature slices based on the first feature sequence includes:
carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result;
performing full connection processing on the cavity convolution processing result to obtain a full connection processing result;
and obtaining the dependency information among the plurality of first characteristic slices based on the full connection processing result.
In yet another implementation, the obtaining dependency information between the plurality of first feature slices based on the full connection processing result includes:
carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result;
And replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices.
In yet another implementation, the obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices includes:
and carrying out graph convolution processing on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.
In yet another implementation, the performing a graph convolution process on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence includes:
and carrying out graph convolution on the adjacency matrix included in the dependency information among the plurality of first feature slices, the degree matrix of the adjacency matrix and the first feature sequence to obtain a second feature sequence.
In yet another implementation, the obtaining, based on the second feature sequence, a text recognition result of the text picture includes:
Based on the second feature sequence, obtaining a classification result of each second feature slice in a plurality of second feature slices included in the second feature sequence;
And obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.
In a second aspect, there is provided a scene text recognition apparatus, comprising:
The pooling processing unit is used for pooling the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices;
A first obtaining unit, configured to obtain dependency information between the plurality of first feature slices based on the first feature sequence;
The second acquisition unit is used for acquiring a second characteristic sequence based on the dependency information among the plurality of first characteristic slices and the first characteristic sequence;
and the third acquisition unit is used for acquiring a text recognition result of the text picture based on the second feature sequence.
In one implementation, the first acquisition unit includes:
the cavity convolution unit is used for carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result;
The full-connection processing unit is used for carrying out full-connection processing on the cavity convolution processing result to obtain a full-connection processing result;
and a fourth obtaining unit, configured to obtain dependency information among the plurality of first feature slices based on the full connection processing result.
In yet another implementation, the fourth obtaining unit is configured to:
Carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result; and
And replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices.
In yet another implementation, the second obtaining unit is configured to perform a graph convolution process on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.
In yet another implementation, the second obtaining unit is configured to perform graph convolution on an adjacency matrix included in the dependency information between the plurality of first feature slices, a degree matrix of the adjacency matrix, and the first feature sequence, to obtain a second feature sequence.
In yet another implementation, the third obtaining unit is configured to:
Based on the second feature sequence, obtaining a classification result of each second feature slice in a plurality of second feature slices included in the second feature sequence;
And obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.
In a third aspect, there is provided a text recognition apparatus, the apparatus comprising: input means, output means, memory and a processor; wherein the memory stores a set of program code and the processor is configured to invoke the program code stored in the memory to perform the method as described in the first aspect or any of the implementations of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform a method as described in the first aspect or any one of the first aspects.
In a fifth aspect, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method as described in the first aspect or any one of the first aspects.
The text recognition method and device provided by the disclosure have the following beneficial effects:
Carrying out pooling treatment on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices; obtaining dependency information among the plurality of first feature slices based on the first feature sequence; obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices; and obtaining a text recognition result of the text picture based on the second feature sequence. Therefore, text recognition is performed based on the dependency information among the plurality of feature slices learned by the text pictures, and the accuracy of text recognition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
Fig. 1 is a schematic flow chart of a text recognition method according to an embodiment of the disclosure;
FIG. 2 is a schematic diagram of a text recognition model provided by an embodiment of the present disclosure;
FIG. 3 is a flow chart of yet another text recognition method provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a model structure of graph convolution calculation provided by an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a model structure for constructing an adjacency matrix according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a text recognition device according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of still another text recognition device according to an embodiment of the present disclosure.
Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
Text recognition includes scene text recognition, and the like. The scene text recognition is to recognize the text in the natural scene, such as license plate recognition, road sign recognition and other natural scene text recognition, and sequence object recognition, and may also include document text line recognition. Because of the background, especially the complex background, in the scene, compared with the general text recognition, the method has a certain difficulty in realizing the text recognition in the scene.
The embodiment of the disclosure provides a text recognition method and a text recognition device, which are used for carrying out pooling treatment on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices; obtaining dependency information among the plurality of first feature slices based on the first feature sequence; obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices; and obtaining a text recognition result of the text picture based on the second feature sequence. Therefore, text recognition is performed based on the dependency information among the plurality of feature slices learned by the text pictures, and the accuracy of text recognition is improved.
In addition, compared with other text recognition modes based on joint timing classification (Connectionist Temporal Classification, CTC), the text recognition method provided by the embodiment of the disclosure does not need to input feature slices in a specific sequence, and is beneficial to realizing parallel calculation, so that the text recognition speed is improved.
Fig. 1 is a schematic flow chart of a text recognition method according to an embodiment of the disclosure.
S101, carrying out pooling treatment on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices.
In some embodiments, the text recognition method may employ the text recognition model shown in FIG. 2 into which text pictures are entered. The text recognition model includes: deep convolutional neural networks, graph convolutional neural networks, and fully-connected classifiers. And outputting text classification prediction through the processing of the text recognition model to obtain a text recognition result.
Specifically, in this step, the input text image is subjected to depth convolution by the depth convolution neural network shown in fig. 2, so as to obtain an initial feature map c×h×w, where C is the number of channels, H is the height of the image, and W is the width of the image. Deep convolutional neural networks can extract features of text in many complex contexts.
And then, carrying out feature concentration on the initial feature map with the size of C, H and W through an average pooling layer to obtain a first feature sequence, wherein the size of the first feature sequence is C, 1*W.
The first feature sequence comprises a plurality of first feature slices of the text picture, wherein each first feature slice is a slice of the first feature sequence in the W dimension, and the size of each first feature slice is C1*1 respectively. The text picture includes text in the scene and a scene background. The plurality of first feature slices are feature slices comprising the text, or text and background, described above.
S102, based on the first characteristic sequences, obtaining the dependency information among the plurality of first characteristic slices.
The dependency relationship exists among the plurality of first feature slices of the first feature sequence, and the dependency information among the plurality of first feature slices is learned, so that the accuracy of text recognition is improved.
In some embodiments, the coverage of the first feature sequence is enlarged, that is, the receptive field range of the model is enlarged, the relation between the adjacent first feature slices is learned, and the relation between the plurality of first feature slices is learned as a whole, so that the dependency information between the plurality of first feature slices is obtained.
S103, obtaining a second characteristic sequence based on the dependency information among the plurality of first characteristic slices and the first characteristic sequence.
Based on the dependency information among the plurality of first feature slices and the first feature sequence obtained through learning, feature fusion can be performed on the plurality of first feature slices, and a second feature sequence is obtained. The second feature sequence fuses features of different slices in close relationship.
The above steps S102 and S103 may be implemented in the graph roll-up neural network shown in fig. 2. The calculation is less time-consuming than the construction of the signature sequence by BiLSTM. Because the embodiment of the disclosure uses the graph convolution neural network, matrix operation is performed in the graph convolution neural network, so that parallel computation of the feature slice can be realized, and a faster recognition speed can be obtained. And in the embodiment of the disclosure, the dependency relationship before the feature slice can be automatically learned through network learning.
And S104, obtaining a text recognition result of the text picture based on the second feature sequence.
And inputting the second characteristic sequence into the full-connection classifier shown in fig. 2, obtaining the classification probability corresponding to each characteristic slice, and selecting the maximum classification probability of each characteristic slice to obtain the recognition result of the characteristic slice, namely recognizing whether the characteristic slice is text.
In some embodiments, a fully connected layer is used as a classifier, for example, the fully connected layer has C input neurons, the second feature sequence is c× 1*W, that is, W feature sequences, each feature sequence has a size of c×1, the classification number is N, and the recognition result of N neurons is output.
According to the text recognition method provided by the embodiment of the disclosure, the text recognition accuracy is improved by learning the dependency information among the plurality of feature slices based on the text picture and by learning the dependency information among the plurality of feature slices based on the text picture.
Referring to fig. 3, a flowchart of another text recognition method according to an embodiment of the disclosure is shown, where the method may include the following steps:
s201, carrying out pooling processing on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices.
Firstly, extracting features of a text picture through a deep convolutional neural network to obtain an initial feature map of the text picture. The deep convolutional neural network comprises a convolutional layer, a downsampling layer, a normalization layer and the like. The difference between the embodiment of the disclosure and the existing convolutional neural network is that the convolutional features of the scene text picture are subjected to maximum downsampling to obtain an initial feature map. And (3) downsampling the depth convolution neural network by using the maximum value of which the two scales are (2, 1) and the step length is (2, 1) to obtain an initial characteristic diagram. The size of the initial feature map is c×h×w. Wherein, C is R, G, B channels, H is the height of the picture or the characteristic slice, and W is the width of the picture or the characteristic slice. The scale of the sliding window for the maximum downsampling is (2, 1), i.e. the width of the sliding window is 2 and the height is 1, in pixel values. The step size of the sliding window for maximum downsampling is (2, 1), i.e. sliding 2 pixel values from left to right and 1 pixel value from top to bottom. Of course, the values for the dimensions and steps described above are merely examples, which are not limiting in this disclosure.
For example, the size of the input text picture is 3×64×160, and after convolution, maximum downsampling, and the like, the size of the initial feature map is 2048×4×40. I.e. the text pictures of 64 x 160 are scaled equally, the extraneous information is removed, so that the information is more concentrated. Regarding 4 x 40 as a plane formed by characteristic slices, for each point on the plane, the characteristics on the point are expressed through 2048 channels, so that the accuracy of characteristic extraction is improved.
And then, carrying out pooling treatment on the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices.
As shown in the schematic model structure of the graph convolution calculation in fig. 4, specifically, an initial feature map with a size of c×h×w is subjected to feature concentration in height by the average pooling layer, so as to obtain a first feature sequence X with a size of c× 1*W.
For example, the initial feature map with the size of 2048×4×40 is processed by the average pooling layer to obtain the first feature sequence with the size of 2048×1×40.
S202, carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result.
The following steps S202 to S204 are steps for obtaining dependency information between a plurality of first feature slices, i.e., constructing an adjacency matrix, and a model for constructing an adjacency matrix as shown in fig. 5 may be used. The first feature sequence obtained above is input into the model. The model includes a hole convolution layer and a full join layer. The model outputs an adjacency matrix.
Specifically, the first characteristic sequence is subjected to hole convolution filtering, so that the coverage range of convolution on the first characteristic sequence is enlarged. The hole convolution layer comprises (c×w) convolution filters, each of which may be, for example, 3*3 in size and a hole expansion scale of 2. The unit is the pixel value. For example, for 3 adjacent feature slices, 2 0's may be inserted at the set positions after or in the middle of the 3 feature slices, that is, 2 feature vectors are expanded to 5*5, so that the coverage area of the convolution on the first feature sequence becomes larger, and accordingly, the receptive field range of the model is also larger.
And S203, performing full-connection processing on the cavity convolution processing result to obtain a full-connection processing result.
By convolving the first feature sequence, a relationship between adjacent ones of a plurality of first feature slices of the first feature sequence may be obtained. The more the relation information between the adjacent first feature slices is obtained, as the coverage of the first feature sequence is enlarged.
Specifically, the first feature sequence c× 1*W is subjected to (c×w) convolution filters to obtain a feature vector matrix of w×w.
And S204, based on the full connection processing result, obtaining the dependency information among the plurality of first characteristic slices.
After the relationship between each set of adjacent first feature slices is obtained, the relationship between the plurality of first feature slices may also be obtained through the full connection layer shown in fig. 5, i.e., the relationship between the plurality of first feature slices may be learned as a whole.
Specifically, the fully-connected layer has (w×w) input neurons, and (w×w) output neurons. The feature vector matrix of w×w is input into the full-connection layer, so that the relationship between the first feature slices in the feature vector matrix of w×w can be learned. Based on the relationship between each group of adjacent first feature slices and the relationship between the plurality of first feature slices, an adjacency matrix A is obtained, wherein the size of the adjacency matrix A is W.
For the initial feature map obtained above, which has a size of 2048×4×40, the size of the obtained adjacency matrix is 40×40.
Wherein S204 specifically includes: carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result; and replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices.
In a specific implementation, the full-connection processing result needs to be constrained within a set range, and nonlinear mapping processing is performed on the full-connection processing result to obtain a mapping result, that is, a range of values of the adjacent matrix is preset, for example, the set range is 0-1. The value of the adjacent matrix obtained as described above may be out of the set range, and the value of the adjacent matrix may be constrained to be between 0 and 1 by a Sigmoid function.
In addition, for the above-mentioned adjacency matrix, the diagonal value is the relationship value between W first feature slices on the diagonal and the first feature slices, and the relationship between the first feature slices and the first feature slices is the closest, however, the actually calculated adjacency matrix may have a diagonal value slightly lower than 1, so that the diagonal value on the adjacency matrix may be uniformly replaced by 1 to correct the constructed adjacency matrix.
S205, carrying out graph convolution processing on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.
Based on the dependency information among the plurality of first feature slices and the first feature sequence obtained through learning, feature fusion can be performed on the plurality of first feature slices, and a second feature sequence is obtained. The second feature sequence fuses features of closely related slices.
Further, graph convolution may be performed on the adjacency matrix, the degree matrix of the adjacency matrix, and the first feature sequence, which are included in the dependency information between the plurality of first feature slices, to obtain a second feature sequence.
Specifically, based on the above-constructed adjacency matrix a, the degree matrix D of the adjacency matrix can be calculated. The calculation mode can refer to the calculation of the existing degree matrix.
As shown in fig. 4, based on the adjacency matrix a, the degree matrix D of the adjacency matrix, and the second feature map X, a second feature sequence f (X, a) =d-1 AX of the plurality of feature slices is calculated according to a map convolution calculation formula. The degree matrix is used for normalizing the adjacency matrix A, so that the value distribution of the adjacency matrix is more uniform. The second feature sequence fuses features of closely related slices.
The second feature sequence may include a value less than 0 and a value greater than 0. For values less than 0, typically the background in the scene, and not the text, values less than 0 in the second feature sequence may be set to 0 to cull out non-text feature slices.
Specifically, as shown in fig. 4, the non-text feature slice may be culled by setting a value less than 0 in the second feature sequence to 0 through the ReLU activation function.
The range of values for the feature sequence is typically 0-100, and values in the second feature sequence after the non-text feature slice is removed may exceed 100, e.g., 300, affecting the accuracy of the subsequent classification. The second feature sequence is multiplied by a set weight W such that the value of the second feature sequence given the set weight is within a suitable range, as shown in fig. 4. For example, the value of the second feature sequence is 300, and the value is multiplied by the set weight w=1/3, and then becomes 100, and the value is within a range that is more advantageous for accurately predicting the classification probability.
S206, based on the second feature sequence, obtaining a classification result of each second feature slice in the plurality of second feature slices included in the second feature sequence.
And inputting the second characteristic sequence into the full-connection classifier shown in fig. 2, obtaining the classification probability corresponding to each characteristic slice, and selecting the maximum classification probability of each characteristic slice to obtain the recognition result of the characteristic slice, namely recognizing whether the characteristic slice is text.
Specifically, a full-connection layer is used as a classifier, and the full-connection layer has C input neurons (the second feature sequence is c× 1*W, that is, W feature sequences, each feature sequence has a size of c×1, the classification number is N, and the recognition results of N neurons are output.
S207, obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.
After the classification result of each second characteristic slice is obtained, the classification results of the plurality of second characteristic slices are spliced, and then the text recognition result of the text picture can be obtained.
According to the text recognition method provided by the embodiment of the disclosure, the text recognition is performed by learning the dependency information among the plurality of feature slices based on the text picture, and the accuracy of the text recognition is improved by learning the dependency information among the plurality of feature slices based on the text picture; meanwhile, as matrix operation is adopted in the graph convolutional neural network, the parallel calculation of the characteristic slices can be realized, and the faster recognition speed can be obtained; the convolution characteristics of the text picture are subjected to maximum downsampling, so that the accuracy of characteristic extraction is improved; constructing an adjacent matrix through the hollow convolution and the full connection layer, and learning the dependency relationship between each characteristic slice, so that the automatic composition of the characteristic slices can be realized; by giving a set weight to the obtained second feature sequence, the value of the second feature sequence given the set weight is located in the range of the classification probability, and the classification probability corresponding to each feature slice can be accurately obtained.
Based on the same conception of the text recognition method, the embodiment of the disclosure also provides a text recognition device. As shown in fig. 6, a schematic structural diagram of a text recognition device according to an embodiment of the present disclosure, where the text recognition device 1000 includes: a pooling processing unit 11, a first acquisition unit 12, a second acquisition unit 13, and a third acquisition unit 14. Wherein:
The pooling processing unit 11 is configured to pool an initial feature map of a text picture to obtain a first feature sequence, where the first feature sequence includes a plurality of first feature slices;
A first obtaining unit 12, configured to obtain dependency information among the plurality of first feature slices based on the first feature sequence;
A second obtaining unit 13, configured to obtain a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices;
And a third obtaining unit 14, configured to obtain a text recognition result of the text picture based on the second feature sequence.
In one implementation, the first acquisition unit 12 includes:
A hole convolution unit 121, configured to perform hole convolution processing on the first feature sequence to obtain a hole convolution processing result;
A full connection processing unit 122, configured to perform full connection processing on the hole convolution processing result, so as to obtain a full connection processing result;
And a fourth obtaining unit 123, configured to obtain dependency information among the plurality of first feature slices based on the full connection processing result.
In yet another implementation, the fourth obtaining unit 123 is configured to:
Carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result; and
And replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices.
In yet another implementation, the second obtaining unit 13 is configured to perform a graph convolution process on the dependency information between the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.
In yet another implementation, the second obtaining unit 13 is configured to perform graph convolution on an adjacency matrix included in the dependency information between the plurality of first feature slices, a degree matrix of the adjacency matrix, and the first feature sequence, to obtain a second feature sequence.
In yet another implementation, the third obtaining unit 14 is configured to:
Based on the second feature sequence, obtaining a classification result of each second feature slice in a plurality of second feature slices included in the second feature sequence;
And obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.
For a specific implementation of the pooling processing unit 11, the first obtaining unit 12, the second obtaining unit 13 and the third obtaining unit 14 described above, reference may be made to the detailed description of the method embodiment shown in fig. 1 or fig. 3.
According to the text recognition device provided by the embodiment of the disclosure, the text recognition accuracy is improved by learning the dependency information among the plurality of feature slices based on the text picture and by performing text recognition by learning the dependency information among the plurality of feature slices based on the text picture.
The embodiment of the disclosure also provides a device for executing the text recognition method. Some or all of the methods described above may be implemented in hardware, or may be implemented in software or firmware.
Alternatively, the apparatus may be a chip or an integrated circuit when embodied.
Alternatively, when part or all of the text recognition method of the above embodiment is implemented by software or firmware, it may be implemented by an apparatus provided in fig. 7. As shown in fig. 7, the apparatus may include:
Input means, output means, memory and processors (the processor in the means may be one or more, one processor being an example in fig. 7). In the disclosed embodiments, the input device, output device, memory, and processor may be connected by a bus or other means, with the bus connection being exemplified in fig. 7.
Wherein the processor is configured to perform the method steps in the embodiments shown in fig. 1 or fig. 3 described above.
Alternatively, the program of the above text recognition method may be stored in the memory. The memory may be a physically separate unit or may be integrated with the processor. The memory may also be used to store data.
Alternatively, when part or all of the text recognition method of the above embodiment is implemented by software, the apparatus may include only the processor. The memory for storing the program is located outside the device, and the processor is connected to the memory through a circuit or a wire for reading and executing the program stored in the memory.
The processor may be a central processor (central processing unit, CPU), a network processor (network processor, NP), or a WLAN device.
The processor may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, generic array logic (GENERIC ARRAY logic, GAL), or any combination thereof.
The memory may include volatile memory (RAM), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a solid state disk (solid-state drive (SSD); the memory may also comprise a combination of the above types of memories.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the division of the unit is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present disclosure.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present disclosure, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium such as a digital versatile disk (DIGITAL VERSATILE DISC, DVD), or a semiconductor medium such as a Solid State Disk (SSD), or the like.

Claims (5)

1. A method of text recognition, comprising:
pooling the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices;
obtaining dependency information among the plurality of first feature slices based on the first feature sequence;
Obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices;
Based on the second feature sequence, obtaining a text recognition result of the text picture;
The obtaining dependency information among the plurality of first feature slices based on the first feature sequence includes:
carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result;
performing full connection processing on the cavity convolution processing result to obtain a full connection processing result;
based on the full connection processing result, obtaining dependency information among the plurality of first characteristic slices;
The obtaining the dependency information among the plurality of first feature slices based on the full connection processing result includes:
carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result;
Replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices;
The obtaining a second feature sequence based on the dependency information among the plurality of first feature slices and the first feature sequence includes:
carrying out graph convolution processing on the dependency information among the plurality of first characteristic slices and the first characteristic sequences to obtain second characteristic sequences;
the performing a graph convolution processing on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence, including:
Carrying out graph convolution on an adjacency matrix included in the dependency information among the plurality of first feature slices, a degree matrix of the adjacency matrix and the first feature sequence to obtain a second feature sequence;
the obtaining the text recognition result of the text picture based on the second feature sequence includes:
Based on the second feature sequence, obtaining a classification result of each second feature slice in a plurality of second feature slices included in the second feature sequence;
And obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.
2. A scene text recognition device, comprising:
The pooling processing unit is used for pooling the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices;
A first obtaining unit, configured to obtain dependency information between the plurality of first feature slices based on the first feature sequence;
The second acquisition unit is used for acquiring a second characteristic sequence based on the dependency information among the plurality of first characteristic slices and the first characteristic sequence;
the third acquisition unit is used for acquiring a text recognition result of the text picture based on the second feature sequence;
the first acquisition unit includes:
the cavity convolution unit is used for carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result;
The full-connection processing unit is used for carrying out full-connection processing on the cavity convolution processing result to obtain a full-connection processing result;
A fourth obtaining unit, configured to obtain dependency information among the plurality of first feature slices based on the full-connection processing result;
The fourth acquisition unit is configured to:
Carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result; and
Replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices;
the second obtaining unit is used for carrying out graph convolution processing on the dependency information among the plurality of first characteristic slices and the first characteristic sequences to obtain second characteristic sequences;
the second obtaining unit is used for carrying out graph convolution on an adjacent matrix included in the dependency information among the plurality of first feature slices, a degree matrix of the adjacent matrix and the first feature sequence to obtain a second feature sequence;
the third acquisition unit is used for:
Based on the second feature sequence, obtaining a classification result of each second feature slice in a plurality of second feature slices included in the second feature sequence;
And obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.
3. A text recognition device, the device comprising: input means, output means, memory and a processor; wherein the memory stores a set of program code and the processor is configured to invoke the program code stored in the memory to perform the method of claim 1.
4. A computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of claim 1.
5. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of claim 1.
CN201910983555.5A 2019-10-16 2019-10-16 Text recognition method and device Active CN112668600B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910983555.5A CN112668600B (en) 2019-10-16 2019-10-16 Text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910983555.5A CN112668600B (en) 2019-10-16 2019-10-16 Text recognition method and device

Publications (2)

Publication Number Publication Date
CN112668600A CN112668600A (en) 2021-04-16
CN112668600B true CN112668600B (en) 2024-05-21

Family

ID=75400214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910983555.5A Active CN112668600B (en) 2019-10-16 2019-10-16 Text recognition method and device

Country Status (1)

Country Link
CN (1) CN112668600B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9014481B1 (en) * 2014-04-22 2015-04-21 King Fahd University Of Petroleum And Minerals Method and apparatus for Arabic and Farsi font recognition
CN109145927A (en) * 2017-06-16 2019-01-04 杭州海康威视数字技术股份有限公司 The target identification method and device of a kind of pair of strain image
CN109726657A (en) * 2018-12-21 2019-05-07 万达信息股份有限公司 A kind of deep learning scene text recognition sequence method
CN109993164A (en) * 2019-03-20 2019-07-09 上海电力学院 A kind of natural scene character recognition method based on RCRNN neural network
CN110008961A (en) * 2019-04-01 2019-07-12 深圳市华付信息技术有限公司 Text real-time identification method, device, computer equipment and storage medium
CN110033000A (en) * 2019-03-21 2019-07-19 华中科技大学 A kind of text detection and recognition methods of bill images
CN110084240A (en) * 2019-04-24 2019-08-02 网易(杭州)网络有限公司 A kind of Word Input system, method, medium and calculate equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646202B2 (en) * 2015-01-16 2017-05-09 Sony Corporation Image processing system for cluttered scenes and method of operation thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9014481B1 (en) * 2014-04-22 2015-04-21 King Fahd University Of Petroleum And Minerals Method and apparatus for Arabic and Farsi font recognition
CN109145927A (en) * 2017-06-16 2019-01-04 杭州海康威视数字技术股份有限公司 The target identification method and device of a kind of pair of strain image
CN109726657A (en) * 2018-12-21 2019-05-07 万达信息股份有限公司 A kind of deep learning scene text recognition sequence method
CN109993164A (en) * 2019-03-20 2019-07-09 上海电力学院 A kind of natural scene character recognition method based on RCRNN neural network
CN110033000A (en) * 2019-03-21 2019-07-19 华中科技大学 A kind of text detection and recognition methods of bill images
CN110008961A (en) * 2019-04-01 2019-07-12 深圳市华付信息技术有限公司 Text real-time identification method, device, computer equipment and storage medium
CN110084240A (en) * 2019-04-24 2019-08-02 网易(杭州)网络有限公司 A kind of Word Input system, method, medium and calculate equipment

Also Published As

Publication number Publication date
CN112668600A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN110569899B (en) Dam face defect classification model training method and device
CN110232394B (en) Multi-scale image semantic segmentation method
US10726244B2 (en) Method and apparatus detecting a target
US10402680B2 (en) Methods and apparatus for image salient object detection
CN110111334B (en) Crack segmentation method and device, electronic equipment and storage medium
CN110795976A (en) Method, device and equipment for training object detection model
CN109034086B (en) Vehicle weight identification method, device and system
CN111914654B (en) Text layout analysis method, device, equipment and medium
CN114419381B (en) Semantic segmentation method and road ponding detection method and device applying same
CN111639607A (en) Model training method, image recognition method, model training device, image recognition device, electronic equipment and storage medium
CN112529897B (en) Image detection method, device, computer equipment and storage medium
CN114419570A (en) Point cloud data identification method and device, electronic equipment and storage medium
CN112990172B (en) Text recognition method, character recognition method and device
CN114549913A (en) Semantic segmentation method and device, computer equipment and storage medium
CN115482444A (en) Traffic sign detection method based on two-stage fusion neural network
CN112949706B (en) OCR training data generation method, device, computer equipment and storage medium
EP4343616A1 (en) Image classification method, model training method, device, storage medium, and computer program
CN115131695B (en) Training method of video detection model, video detection method and device
CN111814820A (en) Image processing method and device
CN111461211A (en) Feature extraction method for lightweight target detection and corresponding detection method
CN112132867B (en) Remote sensing image change detection method and device
CN112668600B (en) Text recognition method and device
CN113689383A (en) Image processing method, device, equipment and storage medium
CN111414952B (en) Noise sample recognition method, device, equipment and storage medium for pedestrian re-recognition
CN113742525A (en) Self-supervision video hash learning method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant