CN112668600B

CN112668600B - Text recognition method and device

Info

Publication number: CN112668600B
Application number: CN201910983555.5A
Authority: CN
Inventors: 胡文阳; 侯军; 蔡晓聪
Original assignee: Sensetime International Pte Ltd
Current assignee: Sensetime International Pte Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2024-05-21
Anticipated expiration: 2039-10-16
Also published as: CN112668600A

Abstract

The disclosure provides a text recognition method and device. In the method, an initial feature map of a text picture is subjected to pooling treatment to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices; obtaining dependency information among the plurality of first feature slices based on the first feature sequence; obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices; and obtaining a text recognition result of the text picture based on the second feature sequence. By learning the dependency information among the feature slices based on the text pictures to perform text recognition, the accuracy of text recognition is improved.

Description

Text recognition method and device

Technical Field

The present disclosure relates to the field of computers, and in particular, to a text recognition method and apparatus.

Background

Text recognition is widely applied in various fields, such as license plate recognition, natural scene text recognition, document text line recognition and the like. The recognition accuracy of the related text recognition model needs to be further improved.

Disclosure of Invention

The disclosure provides a text recognition method and device.

In a first aspect, a text recognition method is provided, including:

pooling the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices;

obtaining dependency information among the plurality of first feature slices based on the first feature sequence;

Obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices;

and obtaining a text recognition result of the text picture based on the second feature sequence.

In one implementation, the obtaining dependency information between the plurality of first feature slices based on the first feature sequence includes:

carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result;

performing full connection processing on the cavity convolution processing result to obtain a full connection processing result;

and obtaining the dependency information among the plurality of first characteristic slices based on the full connection processing result.

In yet another implementation, the obtaining dependency information between the plurality of first feature slices based on the full connection processing result includes:

carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result;

And replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices.

In yet another implementation, the obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices includes:

and carrying out graph convolution processing on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.

In yet another implementation, the performing a graph convolution process on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence includes:

and carrying out graph convolution on the adjacency matrix included in the dependency information among the plurality of first feature slices, the degree matrix of the adjacency matrix and the first feature sequence to obtain a second feature sequence.

In yet another implementation, the obtaining, based on the second feature sequence, a text recognition result of the text picture includes:

Based on the second feature sequence, obtaining a classification result of each second feature slice in a plurality of second feature slices included in the second feature sequence;

And obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.

In a second aspect, there is provided a scene text recognition apparatus, comprising:

The pooling processing unit is used for pooling the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices;

A first obtaining unit, configured to obtain dependency information between the plurality of first feature slices based on the first feature sequence;

The second acquisition unit is used for acquiring a second characteristic sequence based on the dependency information among the plurality of first characteristic slices and the first characteristic sequence;

and the third acquisition unit is used for acquiring a text recognition result of the text picture based on the second feature sequence.

In one implementation, the first acquisition unit includes:

the cavity convolution unit is used for carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result;

The full-connection processing unit is used for carrying out full-connection processing on the cavity convolution processing result to obtain a full-connection processing result;

and a fourth obtaining unit, configured to obtain dependency information among the plurality of first feature slices based on the full connection processing result.

In yet another implementation, the fourth obtaining unit is configured to:

Carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result; and

In yet another implementation, the second obtaining unit is configured to perform a graph convolution process on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.

In yet another implementation, the second obtaining unit is configured to perform graph convolution on an adjacency matrix included in the dependency information between the plurality of first feature slices, a degree matrix of the adjacency matrix, and the first feature sequence, to obtain a second feature sequence.

In yet another implementation, the third obtaining unit is configured to:

In a third aspect, there is provided a text recognition apparatus, the apparatus comprising: input means, output means, memory and a processor; wherein the memory stores a set of program code and the processor is configured to invoke the program code stored in the memory to perform the method as described in the first aspect or any of the implementations of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform a method as described in the first aspect or any one of the first aspects.

In a fifth aspect, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method as described in the first aspect or any one of the first aspects.

The text recognition method and device provided by the disclosure have the following beneficial effects:

Carrying out pooling treatment on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices; obtaining dependency information among the plurality of first feature slices based on the first feature sequence; obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices; and obtaining a text recognition result of the text picture based on the second feature sequence. Therefore, text recognition is performed based on the dependency information among the plurality of feature slices learned by the text pictures, and the accuracy of text recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a schematic flow chart of a text recognition method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a text recognition model provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of yet another text recognition method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a model structure of graph convolution calculation provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model structure for constructing an adjacency matrix according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a text recognition device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of still another text recognition device according to an embodiment of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

Text recognition includes scene text recognition, and the like. The scene text recognition is to recognize the text in the natural scene, such as license plate recognition, road sign recognition and other natural scene text recognition, and sequence object recognition, and may also include document text line recognition. Because of the background, especially the complex background, in the scene, compared with the general text recognition, the method has a certain difficulty in realizing the text recognition in the scene.

The embodiment of the disclosure provides a text recognition method and a text recognition device, which are used for carrying out pooling treatment on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices; obtaining dependency information among the plurality of first feature slices based on the first feature sequence; obtaining a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices; and obtaining a text recognition result of the text picture based on the second feature sequence. Therefore, text recognition is performed based on the dependency information among the plurality of feature slices learned by the text pictures, and the accuracy of text recognition is improved.

In addition, compared with other text recognition modes based on joint timing classification (Connectionist Temporal Classification, CTC), the text recognition method provided by the embodiment of the disclosure does not need to input feature slices in a specific sequence, and is beneficial to realizing parallel calculation, so that the text recognition speed is improved.

Fig. 1 is a schematic flow chart of a text recognition method according to an embodiment of the disclosure.

S101, carrying out pooling treatment on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices.

In some embodiments, the text recognition method may employ the text recognition model shown in FIG. 2 into which text pictures are entered. The text recognition model includes: deep convolutional neural networks, graph convolutional neural networks, and fully-connected classifiers. And outputting text classification prediction through the processing of the text recognition model to obtain a text recognition result.

Specifically, in this step, the input text image is subjected to depth convolution by the depth convolution neural network shown in fig. 2, so as to obtain an initial feature map c×h×w, where C is the number of channels, H is the height of the image, and W is the width of the image. Deep convolutional neural networks can extract features of text in many complex contexts.

And then, carrying out feature concentration on the initial feature map with the size of C, H and W through an average pooling layer to obtain a first feature sequence, wherein the size of the first feature sequence is C, 1*W.

The first feature sequence comprises a plurality of first feature slices of the text picture, wherein each first feature slice is a slice of the first feature sequence in the W dimension, and the size of each first feature slice is C1*1 respectively. The text picture includes text in the scene and a scene background. The plurality of first feature slices are feature slices comprising the text, or text and background, described above.

S102, based on the first characteristic sequences, obtaining the dependency information among the plurality of first characteristic slices.

The dependency relationship exists among the plurality of first feature slices of the first feature sequence, and the dependency information among the plurality of first feature slices is learned, so that the accuracy of text recognition is improved.

In some embodiments, the coverage of the first feature sequence is enlarged, that is, the receptive field range of the model is enlarged, the relation between the adjacent first feature slices is learned, and the relation between the plurality of first feature slices is learned as a whole, so that the dependency information between the plurality of first feature slices is obtained.

S103, obtaining a second characteristic sequence based on the dependency information among the plurality of first characteristic slices and the first characteristic sequence.

Based on the dependency information among the plurality of first feature slices and the first feature sequence obtained through learning, feature fusion can be performed on the plurality of first feature slices, and a second feature sequence is obtained. The second feature sequence fuses features of different slices in close relationship.

The above steps S102 and S103 may be implemented in the graph roll-up neural network shown in fig. 2. The calculation is less time-consuming than the construction of the signature sequence by BiLSTM. Because the embodiment of the disclosure uses the graph convolution neural network, matrix operation is performed in the graph convolution neural network, so that parallel computation of the feature slice can be realized, and a faster recognition speed can be obtained. And in the embodiment of the disclosure, the dependency relationship before the feature slice can be automatically learned through network learning.

And S104, obtaining a text recognition result of the text picture based on the second feature sequence.

And inputting the second characteristic sequence into the full-connection classifier shown in fig. 2, obtaining the classification probability corresponding to each characteristic slice, and selecting the maximum classification probability of each characteristic slice to obtain the recognition result of the characteristic slice, namely recognizing whether the characteristic slice is text.

In some embodiments, a fully connected layer is used as a classifier, for example, the fully connected layer has C input neurons, the second feature sequence is c× 1*W, that is, W feature sequences, each feature sequence has a size of c×1, the classification number is N, and the recognition result of N neurons is output.

According to the text recognition method provided by the embodiment of the disclosure, the text recognition accuracy is improved by learning the dependency information among the plurality of feature slices based on the text picture and by learning the dependency information among the plurality of feature slices based on the text picture.

Referring to fig. 3, a flowchart of another text recognition method according to an embodiment of the disclosure is shown, where the method may include the following steps:

s201, carrying out pooling processing on an initial feature map of a text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices.

Firstly, extracting features of a text picture through a deep convolutional neural network to obtain an initial feature map of the text picture. The deep convolutional neural network comprises a convolutional layer, a downsampling layer, a normalization layer and the like. The difference between the embodiment of the disclosure and the existing convolutional neural network is that the convolutional features of the scene text picture are subjected to maximum downsampling to obtain an initial feature map. And (3) downsampling the depth convolution neural network by using the maximum value of which the two scales are (2, 1) and the step length is (2, 1) to obtain an initial characteristic diagram. The size of the initial feature map is c×h×w. Wherein, C is R, G, B channels, H is the height of the picture or the characteristic slice, and W is the width of the picture or the characteristic slice. The scale of the sliding window for the maximum downsampling is (2, 1), i.e. the width of the sliding window is 2 and the height is 1, in pixel values. The step size of the sliding window for maximum downsampling is (2, 1), i.e. sliding 2 pixel values from left to right and 1 pixel value from top to bottom. Of course, the values for the dimensions and steps described above are merely examples, which are not limiting in this disclosure.

For example, the size of the input text picture is 3×64×160, and after convolution, maximum downsampling, and the like, the size of the initial feature map is 2048×4×40. I.e. the text pictures of 64 x 160 are scaled equally, the extraneous information is removed, so that the information is more concentrated. Regarding 4 x 40 as a plane formed by characteristic slices, for each point on the plane, the characteristics on the point are expressed through 2048 channels, so that the accuracy of characteristic extraction is improved.

And then, carrying out pooling treatment on the initial feature map of the text picture to obtain a first feature sequence, wherein the first feature sequence comprises a plurality of first feature slices.

As shown in the schematic model structure of the graph convolution calculation in fig. 4, specifically, an initial feature map with a size of c×h×w is subjected to feature concentration in height by the average pooling layer, so as to obtain a first feature sequence X with a size of c× 1*W.

For example, the initial feature map with the size of 2048×4×40 is processed by the average pooling layer to obtain the first feature sequence with the size of 2048×1×40.

S202, carrying out cavity convolution processing on the first characteristic sequence to obtain a cavity convolution processing result.

The following steps S202 to S204 are steps for obtaining dependency information between a plurality of first feature slices, i.e., constructing an adjacency matrix, and a model for constructing an adjacency matrix as shown in fig. 5 may be used. The first feature sequence obtained above is input into the model. The model includes a hole convolution layer and a full join layer. The model outputs an adjacency matrix.

Specifically, the first characteristic sequence is subjected to hole convolution filtering, so that the coverage range of convolution on the first characteristic sequence is enlarged. The hole convolution layer comprises (c×w) convolution filters, each of which may be, for example, 3*3 in size and a hole expansion scale of 2. The unit is the pixel value. For example, for 3 adjacent feature slices, 2 0's may be inserted at the set positions after or in the middle of the 3 feature slices, that is, 2 feature vectors are expanded to 5*5, so that the coverage area of the convolution on the first feature sequence becomes larger, and accordingly, the receptive field range of the model is also larger.

And S203, performing full-connection processing on the cavity convolution processing result to obtain a full-connection processing result.

By convolving the first feature sequence, a relationship between adjacent ones of a plurality of first feature slices of the first feature sequence may be obtained. The more the relation information between the adjacent first feature slices is obtained, as the coverage of the first feature sequence is enlarged.

Specifically, the first feature sequence c× 1*W is subjected to (c×w) convolution filters to obtain a feature vector matrix of w×w.

And S204, based on the full connection processing result, obtaining the dependency information among the plurality of first characteristic slices.

After the relationship between each set of adjacent first feature slices is obtained, the relationship between the plurality of first feature slices may also be obtained through the full connection layer shown in fig. 5, i.e., the relationship between the plurality of first feature slices may be learned as a whole.

Specifically, the fully-connected layer has (w×w) input neurons, and (w×w) output neurons. The feature vector matrix of w×w is input into the full-connection layer, so that the relationship between the first feature slices in the feature vector matrix of w×w can be learned. Based on the relationship between each group of adjacent first feature slices and the relationship between the plurality of first feature slices, an adjacency matrix A is obtained, wherein the size of the adjacency matrix A is W.

For the initial feature map obtained above, which has a size of 2048×4×40, the size of the obtained adjacency matrix is 40×40.

Wherein S204 specifically includes: carrying out nonlinear mapping treatment on the full-connection treatment result to obtain a mapping result; and replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices.

In a specific implementation, the full-connection processing result needs to be constrained within a set range, and nonlinear mapping processing is performed on the full-connection processing result to obtain a mapping result, that is, a range of values of the adjacent matrix is preset, for example, the set range is 0-1. The value of the adjacent matrix obtained as described above may be out of the set range, and the value of the adjacent matrix may be constrained to be between 0 and 1 by a Sigmoid function.

In addition, for the above-mentioned adjacency matrix, the diagonal value is the relationship value between W first feature slices on the diagonal and the first feature slices, and the relationship between the first feature slices and the first feature slices is the closest, however, the actually calculated adjacency matrix may have a diagonal value slightly lower than 1, so that the diagonal value on the adjacency matrix may be uniformly replaced by 1 to correct the constructed adjacency matrix.

S205, carrying out graph convolution processing on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.

Based on the dependency information among the plurality of first feature slices and the first feature sequence obtained through learning, feature fusion can be performed on the plurality of first feature slices, and a second feature sequence is obtained. The second feature sequence fuses features of closely related slices.

Further, graph convolution may be performed on the adjacency matrix, the degree matrix of the adjacency matrix, and the first feature sequence, which are included in the dependency information between the plurality of first feature slices, to obtain a second feature sequence.

Specifically, based on the above-constructed adjacency matrix a, the degree matrix D of the adjacency matrix can be calculated. The calculation mode can refer to the calculation of the existing degree matrix.

As shown in fig. 4, based on the adjacency matrix a, the degree matrix D of the adjacency matrix, and the second feature map X, a second feature sequence f (X, a) =d-1 AX of the plurality of feature slices is calculated according to a map convolution calculation formula. The degree matrix is used for normalizing the adjacency matrix A, so that the value distribution of the adjacency matrix is more uniform. The second feature sequence fuses features of closely related slices.

The second feature sequence may include a value less than 0 and a value greater than 0. For values less than 0, typically the background in the scene, and not the text, values less than 0 in the second feature sequence may be set to 0 to cull out non-text feature slices.

Specifically, as shown in fig. 4, the non-text feature slice may be culled by setting a value less than 0 in the second feature sequence to 0 through the ReLU activation function.

The range of values for the feature sequence is typically 0-100, and values in the second feature sequence after the non-text feature slice is removed may exceed 100, e.g., 300, affecting the accuracy of the subsequent classification. The second feature sequence is multiplied by a set weight W such that the value of the second feature sequence given the set weight is within a suitable range, as shown in fig. 4. For example, the value of the second feature sequence is 300, and the value is multiplied by the set weight w=1/3, and then becomes 100, and the value is within a range that is more advantageous for accurately predicting the classification probability.

S206, based on the second feature sequence, obtaining a classification result of each second feature slice in the plurality of second feature slices included in the second feature sequence.

Specifically, a full-connection layer is used as a classifier, and the full-connection layer has C input neurons (the second feature sequence is c× 1*W, that is, W feature sequences, each feature sequence has a size of c×1, the classification number is N, and the recognition results of N neurons are output.

S207, obtaining a text recognition result of the text picture based on the classification result of each second characteristic slice in the plurality of second characteristic slices.

After the classification result of each second characteristic slice is obtained, the classification results of the plurality of second characteristic slices are spliced, and then the text recognition result of the text picture can be obtained.

According to the text recognition method provided by the embodiment of the disclosure, the text recognition is performed by learning the dependency information among the plurality of feature slices based on the text picture, and the accuracy of the text recognition is improved by learning the dependency information among the plurality of feature slices based on the text picture; meanwhile, as matrix operation is adopted in the graph convolutional neural network, the parallel calculation of the characteristic slices can be realized, and the faster recognition speed can be obtained; the convolution characteristics of the text picture are subjected to maximum downsampling, so that the accuracy of characteristic extraction is improved; constructing an adjacent matrix through the hollow convolution and the full connection layer, and learning the dependency relationship between each characteristic slice, so that the automatic composition of the characteristic slices can be realized; by giving a set weight to the obtained second feature sequence, the value of the second feature sequence given the set weight is located in the range of the classification probability, and the classification probability corresponding to each feature slice can be accurately obtained.

Based on the same conception of the text recognition method, the embodiment of the disclosure also provides a text recognition device. As shown in fig. 6, a schematic structural diagram of a text recognition device according to an embodiment of the present disclosure, where the text recognition device 1000 includes: a pooling processing unit 11, a first acquisition unit 12, a second acquisition unit 13, and a third acquisition unit 14. Wherein:

The pooling processing unit 11 is configured to pool an initial feature map of a text picture to obtain a first feature sequence, where the first feature sequence includes a plurality of first feature slices;

A first obtaining unit 12, configured to obtain dependency information among the plurality of first feature slices based on the first feature sequence;

A second obtaining unit 13, configured to obtain a second feature sequence based on the first feature sequence and dependency information among the plurality of first feature slices;

And a third obtaining unit 14, configured to obtain a text recognition result of the text picture based on the second feature sequence.

In one implementation, the first acquisition unit 12 includes:

A hole convolution unit 121, configured to perform hole convolution processing on the first feature sequence to obtain a hole convolution processing result;

A full connection processing unit 122, configured to perform full connection processing on the hole convolution processing result, so as to obtain a full connection processing result;

And a fourth obtaining unit 123, configured to obtain dependency information among the plurality of first feature slices based on the full connection processing result.

In yet another implementation, the fourth obtaining unit 123 is configured to:

In yet another implementation, the second obtaining unit 13 is configured to perform a graph convolution process on the dependency information between the plurality of first feature slices and the first feature sequence to obtain a second feature sequence.

In yet another implementation, the second obtaining unit 13 is configured to perform graph convolution on an adjacency matrix included in the dependency information between the plurality of first feature slices, a degree matrix of the adjacency matrix, and the first feature sequence, to obtain a second feature sequence.

In yet another implementation, the third obtaining unit 14 is configured to:

For a specific implementation of the pooling processing unit 11, the first obtaining unit 12, the second obtaining unit 13 and the third obtaining unit 14 described above, reference may be made to the detailed description of the method embodiment shown in fig. 1 or fig. 3.

According to the text recognition device provided by the embodiment of the disclosure, the text recognition accuracy is improved by learning the dependency information among the plurality of feature slices based on the text picture and by performing text recognition by learning the dependency information among the plurality of feature slices based on the text picture.

The embodiment of the disclosure also provides a device for executing the text recognition method. Some or all of the methods described above may be implemented in hardware, or may be implemented in software or firmware.

Alternatively, the apparatus may be a chip or an integrated circuit when embodied.

Alternatively, when part or all of the text recognition method of the above embodiment is implemented by software or firmware, it may be implemented by an apparatus provided in fig. 7. As shown in fig. 7, the apparatus may include:

Input means, output means, memory and processors (the processor in the means may be one or more, one processor being an example in fig. 7). In the disclosed embodiments, the input device, output device, memory, and processor may be connected by a bus or other means, with the bus connection being exemplified in fig. 7.

Wherein the processor is configured to perform the method steps in the embodiments shown in fig. 1 or fig. 3 described above.

Alternatively, the program of the above text recognition method may be stored in the memory. The memory may be a physically separate unit or may be integrated with the processor. The memory may also be used to store data.

Alternatively, when part or all of the text recognition method of the above embodiment is implemented by software, the apparatus may include only the processor. The memory for storing the program is located outside the device, and the processor is connected to the memory through a circuit or a wire for reading and executing the program stored in the memory.

The processor may be a central processor (central processing unit, CPU), a network processor (network processor, NP), or a WLAN device.

The processor may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, generic array logic (GENERIC ARRAY logic, GAL), or any combination thereof.

The memory may include volatile memory (RAM), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a solid state disk (solid-state drive (SSD); the memory may also comprise a combination of the above types of memories.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the division of the unit is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present disclosure.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present disclosure, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium such as a digital versatile disk (DIGITAL VERSATILE DISC, DVD), or a semiconductor medium such as a Solid State Disk (SSD), or the like.

Claims

1. A method of text recognition, comprising:

Based on the second feature sequence, obtaining a text recognition result of the text picture;

The obtaining dependency information among the plurality of first feature slices based on the first feature sequence includes:

based on the full connection processing result, obtaining dependency information among the plurality of first characteristic slices;

The obtaining the dependency information among the plurality of first feature slices based on the full connection processing result includes:

Replacing corresponding values of two identical characteristic slices in the plurality of first characteristic slices included in the mapping result with 1 to obtain an adjacency matrix, wherein the adjacency matrix comprises dependency values between any two characteristic slices in the plurality of first characteristic slices;

The obtaining a second feature sequence based on the dependency information among the plurality of first feature slices and the first feature sequence includes:

carrying out graph convolution processing on the dependency information among the plurality of first characteristic slices and the first characteristic sequences to obtain second characteristic sequences;

the performing a graph convolution processing on the dependency information among the plurality of first feature slices and the first feature sequence to obtain a second feature sequence, including:

Carrying out graph convolution on an adjacency matrix included in the dependency information among the plurality of first feature slices, a degree matrix of the adjacency matrix and the first feature sequence to obtain a second feature sequence;

the obtaining the text recognition result of the text picture based on the second feature sequence includes:

2. A scene text recognition device, comprising:

the third acquisition unit is used for acquiring a text recognition result of the text picture based on the second feature sequence;

the first acquisition unit includes:

A fourth obtaining unit, configured to obtain dependency information among the plurality of first feature slices based on the full-connection processing result;

The fourth acquisition unit is configured to:

the second obtaining unit is used for carrying out graph convolution processing on the dependency information among the plurality of first characteristic slices and the first characteristic sequences to obtain second characteristic sequences;

the second obtaining unit is used for carrying out graph convolution on an adjacent matrix included in the dependency information among the plurality of first feature slices, a degree matrix of the adjacent matrix and the first feature sequence to obtain a second feature sequence;

the third acquisition unit is used for:

3. A text recognition device, the device comprising: input means, output means, memory and a processor; wherein the memory stores a set of program code and the processor is configured to invoke the program code stored in the memory to perform the method of claim 1.

4. A computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of claim 1.

5. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of claim 1.