WO2023273196A1 - 一种文本识别方法及相关装置 - Google Patents

一种文本识别方法及相关装置 Download PDF

Info

Publication number
WO2023273196A1
WO2023273196A1 PCT/CN2021/138066 CN2021138066W WO2023273196A1 WO 2023273196 A1 WO2023273196 A1 WO 2023273196A1 CN 2021138066 W CN2021138066 W CN 2021138066W WO 2023273196 A1 WO2023273196 A1 WO 2023273196A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
information
recognition
sequence information
picture
Prior art date
Application number
PCT/CN2021/138066
Other languages
English (en)
French (fr)
Inventor
李明
付彬
乔宇
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023273196A1 publication Critical patent/WO2023273196A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of scene text recognition (STR) technology, in particular to a text recognition method and related devices.
  • STR scene text recognition
  • Scene text recognition means that by inputting text pictures containing text information in a specific scene into the program, the program converts the input text pictures containing text information into computer-understandable text symbols.
  • Scene text recognition is an important branch in the field of computer vision, and has an important role and prospects in application scenarios such as automatic driving and blind assistance.
  • the more commonly used scene text recognition method is to input the text picture into the convolutional neural network, and extract the local visual information of the text picture. Then input the local visual information of the text picture into the recurrent neural network to obtain the final recognition result of the text sequence.
  • the embodiment of the present application provides a text recognition method and related devices.
  • the local visual information and context sequence information of the text pictures are extracted in parallel, and the local area visual information and context sequence information of the text pictures are extracted.
  • the interactive fusion of domain visual information and context sequence information enables the dual information of text and pictures to be used at all levels of the text recognition network at the same time, which solves the problem of missing or misplaced text characters in the recognition process and improves the accuracy of text recognition. and efficiency.
  • the embodiment of the present application provides a text recognition method, the method includes:
  • the text picture is a picture including target text
  • each level of the text recognition network uses the local information and sequence information of the text picture to recognize the target text at the same time, the
  • the local area information includes structural information of the target text, and the sequence information includes context sequence information of the target text.
  • the text recognition method that is more commonly used for scene text first inputs the text picture into a complete convolutional neural network, extracts the high-level features of the entire text picture, and then directly sends the high-level features into a cyclic neural network. Each character in the whole text is classified to obtain the recognition result of the final target text sequence.
  • this recognition mode ignores the role of contextual sequence information on low-level features.
  • the text recognition method in the embodiment of the present application can introduce the context sequence information of the text at the low-level features, and enable long-term and short-term information to interact from the low-level.
  • the duality of the text images can be used at all levels of the text recognition network at the same time. information, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • the inputting the text image into a text recognition network for recognition to obtain the target text includes:
  • the target text is obtained according to a fusion processing result of the local information and the sequence information.
  • each level of the text recognition network can simultaneously extract the local information and sequence information of the text picture, and fuse the two, and finally use the attention-based
  • the decoder of the force mechanism gets the recognition result of the final target text.
  • the embodiment of the present application adopts a parallel binary relation extraction mode, so that at each level of the text recognition network The binary information of the text image can be used at the same time, which improves the accuracy and efficiency of text recognition.
  • the obtaining the target text according to the fusion processing result of the local information and the sequence information includes:
  • the target text is obtained according to the result of the weighted summation of the local area information and the sequence information.
  • each level of the text recognition network can be directly added when the two are fused, or it can be weighted and summed in the form of a gate, and the target can be identified according to the result of the summation. text.
  • the obtaining the local area information includes:
  • the visual features of the text picture are extracted based on the topological structure to obtain the local area information.
  • the information extraction mode based on the topology structure extracts the visual features of the text picture to obtain local information.
  • the extraction method in the embodiment of the present application achieves The accuracy of the obtained local information is higher.
  • the obtaining the sequence information includes:
  • feature compression is performed on the text image first, and then the structural features of the compressed text image are extracted to obtain sequence information. Compared with the existing one-dimensional feature map simply stacked on the extracted
  • the cyclic neural network is used to extract the sequence feature extraction mode.
  • the extraction mode in the embodiment of the present application uses different feature compression modes and sequence information extraction modes, which can meet different text and picture recognition requirements.
  • the embodiment of the present application provides a text recognition device, which includes:
  • An acquisition unit configured to acquire a text picture; the text picture is a picture including target text;
  • a recognition unit configured to input the text picture to a text recognition network for recognition to obtain the target text; each level of the text recognition network uses the local information and sequence information of the text picture to identify the target text For identification, the local area information includes structural information of the target text, and the sequence information includes context sequence information of the target text.
  • a text recognition method for scene text is provided. Specifically, a text picture is obtained, the text picture is a picture including target text information in a specific scene, the text picture is input into a text recognition network, and target text recognition is performed on it to obtain the target text contained in the text picture.
  • each level of the text recognition network can simultaneously use the local information and sequence information of the text picture to recognize the target text, the local information includes the structure information of the target text, and the sequence information includes the context sequence information of the target text.
  • the text recognition method that is more commonly used for scene text first inputs the text picture into a complete convolutional neural network, extracts the high-level features of the entire text picture, and then directly sends the high-level features into a cyclic neural network. Each character in the whole text is classified to obtain the recognition result of the final target text sequence.
  • this recognition mode ignores the role of contextual sequence information on low-level features.
  • the text recognition method in the embodiment of the present application can introduce the context sequence information of the text at the low-level features, and enable long-term and short-term information to interact from the low-level.
  • the local visual information and context sequence information of text pictures are extracted in parallel, and the local visual information and context sequence information of text pictures are interactively fused, so that At all levels of the text recognition network, the binary information of the text image can be used at the same time, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • the acquiring unit is further configured to acquire the local area information, and acquire the sequence information;
  • the recognition unit is specifically configured to obtain the target text according to a fusion processing result of the local information and the sequence information.
  • each level of the text recognition network can simultaneously extract the local information and sequence information of the text picture, and fuse the two, and finally use the attention-based
  • the decoder of the force mechanism gets the recognition result of the final target text.
  • the embodiment of the present application adopts a parallel binary relation extraction mode, so that at each level of the text recognition network The binary information of the text image can be used at the same time, which improves the accuracy and efficiency of text recognition.
  • the identification unit is specifically configured to weight and sum the local area information and the sequence information
  • the recognition unit is further configured to obtain the target text according to the result of the weighted summation of the local area information and the sequence information.
  • each level of the text recognition network can be directly added when the two are fused, or it can be weighted and summed in the form of a gate, and the target can be identified according to the result of the summation. text.
  • the acquiring unit is specifically configured to extract visual features of the text picture based on a topology structure to obtain the local area information.
  • the information extraction mode based on the topology structure extracts the visual features of the text picture to obtain local information.
  • the extraction method in the embodiment of the present application achieves The accuracy of the obtained local information is higher.
  • the acquiring unit is further configured to compress the features of the text picture
  • the acquiring unit is specifically further configured to extract structural features of the compressed text picture to obtain the sequence information.
  • feature compression is performed on the text image first, and then the structural features of the compressed text image are extracted to obtain sequence information. Compared with the existing one-dimensional feature map simply stacked on the extracted
  • the cyclic neural network is used to extract the sequence feature extraction mode.
  • the extraction mode in the embodiment of the present application uses different feature compression modes and sequence information extraction modes, which can meet different text and picture recognition requirements.
  • an embodiment of the present application provides a text recognition device, the text recognition device includes a processor and a memory; the memory is used to store computer-executable instructions; the processor is used to execute the computer-executable instructions stored in the memory Instructions, so that the text recognition device executes the method according to the above first aspect and any possible implementation manner.
  • the text recognition device further includes a transceiver, configured to receive signals or send signals.
  • an embodiment of the present application provides a computer-readable storage medium for storing instructions or computer programs; when the instructions or the computer programs are executed, the first aspect and the The method described in any one of the possible embodiments is implemented.
  • the embodiment of the present application provides a computer program product, the computer program product includes instructions or computer programs; when the instructions or the computer programs are executed, the first aspect and any possible implementation The method described in the manner is implemented.
  • an embodiment of the present application provides a chip, the chip includes a processor, the processor is configured to execute instructions, and when the processor executes the instructions, the chip executes the first aspect and any possible The method described in the embodiment.
  • the chip further includes a communication interface, and the communication interface is used for receiving signals or sending signals.
  • an embodiment of the present application provides a system, the system comprising at least one text recognition device according to the second aspect or the third aspect, or the chip according to the sixth aspect.
  • the process of sending information and/or receiving information in the above method can be understood as the process of outputting information by the processor, And/or, the process by which the processor receives input information.
  • the processor may output information to a transceiver (or a communication interface, or a sending module) for transmission by the transceiver. After the information is output by the processor, additional processing may be required before reaching the transceiver.
  • the transceiver or communication interface, or sending module
  • the information may require other processing before being input to the processor.
  • the sending information mentioned in the foregoing method can be understood as the processor outputting information.
  • receiving information may be understood as the processor receiving input information.
  • the above-mentioned processor may be a processor specially used to execute these methods, or may be executed by a computer in the memory instructions to perform these methods, such as a general-purpose processor.
  • the above-mentioned memory can be a non-transitory (non-transitory) memory, such as a read-only memory (Read Only Memory, ROM), which can be integrated with the processor on the same chip, or can be respectively arranged on different chips.
  • ROM read-only memory
  • the embodiment does not limit the type of the memory and the arrangement of the memory and the processor.
  • the above at least one memory is located outside the device.
  • the at least one memory is located within the device.
  • part of the memory of the at least one memory is located inside the device, and another part of the memory is located outside the device.
  • processor and the memory may also be integrated into one device, that is, the processor and the memory may also be integrated together.
  • the local visual information and context sequence information of text pictures are extracted in parallel, and the local visual information and context sequence information of text pictures are interactively fused , so that the binary information of the text image can be used at all levels of the text recognition network at the same time, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • FIG. 1 is a schematic diagram of a text recognition architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a binary relationship module provided in an embodiment of the present application.
  • FIG. 3 is a schematic flow diagram of a text recognition method provided in an embodiment of the present application.
  • FIG. 4a is a schematic structural diagram of a sequence information extraction module provided in an embodiment of the present application.
  • Fig. 4b is a schematic structural diagram of another sequence information extraction module provided by the embodiment of the present application.
  • Fig. 4c is a schematic structural diagram of another sequence information extraction module provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a local information extraction module provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a text recognition effect provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a text recognition device provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • an embodiment means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
  • At least one (item) means one or more
  • “multiple” means two or more
  • at least two (items) means two or three And three or more
  • "and/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B” can mean: only A exists, only B exists, and A exists at the same time and B, where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an "or” relationship.
  • “At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • Text image refers to an image that contains text information.
  • Scene text recognition refers to inputting text pictures containing text information in a specific scene into the program, and the program converts the input text pictures containing text information into computer-understandable text symbols.
  • Scene text recognition is an important branch in the field of computer vision, and has an important role and prospects in application scenarios such as automatic driving and blind assistance.
  • the more commonly used scene text recognition method is to input the text picture into the convolutional neural network, extract the local visual information of the text picture, and then input the local visual information of the text picture into the cyclic neural network to obtain the final The recognition result of the text sequence.
  • the above scene text recognition method is prone to focus errors when predicting text characters, resulting in missing or misplaced text characters, resulting in low text recognition accuracy and efficiency.
  • the present application provides a text recognition architecture, and proposes a new text recognition method based on the text recognition architecture.
  • the provided text recognition architecture and text recognition method can extract the local visual information and context sequence information of the text picture in parallel during the process of recognizing the text picture based on the text recognition network, and combine the local visual information of the text picture with
  • the interactive fusion of contextual sequence information enables the dual information of text and images to be used at all levels of the text recognition network at the same time, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • FIG. 1 is a schematic diagram of a text recognition architecture provided by an embodiment of the present application.
  • the text recognition architecture mainly includes several modules based on parallel extraction patterns of binary relations, which replace the residual modules in the original convolutional neural network. After the text image is corrected to an approximate level, it is input into the text recognition network for target text recognition.
  • Each of the above-mentioned binary relationship modules will simultaneously extract the context sequence information and local visual information of the entire text image, and integrate the two information interactively, so that each layer in the text recognition network can simultaneously use the binary information of the text image. meta information.
  • the residual module in the convolutional neural network used for text recognition often ignores the role of contextual sequence information in the text image in the low-level features, and only adds a cyclic neural network at the end to increase the sequence information.
  • due to the nature of the text is a
  • the sequence arrangement of characters has an obvious sequence structure even at the low level, and the low-level features are manifested as the regular alternating appearance of characters and the extension direction of the entire text. Therefore, if the low-level sequence information is ignored, it is likely to cause the text recognition model to fail Focus errors occur when predicting characters, resulting in missing or misplaced characters.
  • the binary relational network in the embodiment of this application introduces sequence information at the lower layer, and as the network deepens, each layer of the network will fuse the overall sequence information and local visual information in stages, thus ensuring that the two kinds of information Guide each other and promote each other.
  • Figure 2 is a binary relationship provided by the embodiment of the present application Schematic diagram of the module architecture.
  • the process of processing a text image by any of the binary relationship modules in FIG. 1 is described, taking the binary relationship module 1 as an example.
  • the feature map first undergoes a 1x1 convolution, and then extracts sequence information and local area information respectively.
  • feature compression is performed on the feature map, and then sequence extraction is performed on the feature map after feature compression.
  • a feature map of 1 is compressed into a feature map with a height of 1; sequence information extraction often uses a two-way long-term short-term memory network, or a time-series convolutional network can also be used.
  • the local information extraction adopts the information extraction mode based on the topology structure, extracts the visual features of the text image, and then undergoes 1x1 convolution to obtain the local information.
  • After extracting the feature map in parallel to obtain the binary relationship between local information and sequence information the two are processed separately and then fused, and the result of the fusion of the two is used as the feature input of the next binary relationship module for further local visual information. and sequence information feature extraction processing and fusion of the two.
  • each layer of the network will fuse the overall sequence information and local visual information in stages, and finally use the decoder based on the attention mechanism to obtain the target text. recognition result.
  • the present application also provides a new text recognition method, which will be described below with reference to FIG. 3 .
  • Figure 3 is a schematic flow chart of a text recognition method provided in the embodiment of the present application, the method includes but is not limited to the following steps:
  • Step 301 Obtain a text image.
  • the electronic device acquires a text picture, where the text picture is a picture including the target text.
  • the electronic device in the embodiment of the present application is a device equipped with a processor that can be used to execute computer-executed instructions.
  • the electronic device can be a computer, a server, etc., and is used to perform target text recognition on acquired text pictures and improve text quality. recognition accuracy and efficiency.
  • Step 302 Input the text picture to the text recognition network for recognition to obtain the target text.
  • the electronic device inputs the text picture into the text recognition network for recognition to obtain the target text.
  • each level of the text recognition network can simultaneously use the local area information and sequence information of the text picture to identify the target text
  • the local area information includes the structural information of the target text
  • the structural information specifically includes the components that make up the target text.
  • Character structure information the sequence information includes the context sequence information of the target text.
  • Fig. 4a to Fig. 4c are schematic structural diagrams of three different sequence information extraction modules provided by the embodiment of the present application respectively.
  • the sequence information extraction method of the text picture first compresses the features of the text picture, and then extracts the structural features of the compressed text picture to obtain the sequence information.
  • the purpose is to compress the original feature map with a height of not 1 into a feature map with a height of 1.
  • the feature compression of the group convolutional network For the feature map of H*W*C input by the current layer, the scheme first performs a reshaping operation to convert the feature map to the size of 1*W*(H*C), and then Use the 1*3 convolution kernel of group C to perform convolution operation, and finally obtain the feature map of 1*W*C. Finally, a two-way long-short-term memory network is used to extract sequence information from the feature map of size 1*W*C.
  • X is the input feature map of the current layer
  • Y is the feature map after extracting sequence information
  • Reshape the feature map after reshaping (Reshape) transformation
  • X is the input feature map of the current layer
  • Y is the feature map after extracting sequence information
  • Reshape the feature map after reshaping (Reshape) transformation
  • X is the input feature map of the current layer
  • Y is the feature map after extracting sequence information
  • Reshape reshaping
  • X is the feature map after group convolution feature extraction.
  • i, j, k represent the values in the H, W, and C directions, respectively.
  • this scheme uses a pooling kernel of H*1 size, and selects one of them in each H*1 area with a horizontal step of 1 pixel The maximum value of , so as to obtain the compressed 1*W*C feature map.
  • H, W, and C correspond to the height, width, and channel of the feature map, respectively.
  • X is the input feature map of the current layer
  • Y is the feature map after extracting sequence information through a bidirectional long-term short-term memory network (BiLSTM). It is the feature map after the maximum pooling of H*1.
  • this scheme uses a convolution kernel of H*3 size to perform convolution calculation with a step size of 1 pixel horizontally, and finally obtains 1*W
  • the feature map of *C is the compressed feature.
  • X is the input feature map of the current layer
  • Y is the feature map after extracting the sequence information through the bidirectional long short-term memory network (BiLSTM). It is the feature map after the conventional convolution operation of H*3.
  • time-series convolutional networks can also be used to extract sequence information from feature maps of 1*W*C size obtained after feature compression.
  • the extraction method in the embodiment of this application uses different feature compression modes and sequence information extraction modes , which can meet different text and image recognition requirements.
  • the visual features of text images are extracted to obtain local information.
  • FIG. 5 is a schematic structural diagram of a local information extraction module provided in an embodiment of the present application.
  • the method for extracting local information from a text image uses a topology-based information extraction mode to extract the visual features of the text image.
  • X is the input feature map of the current layer
  • Y represents the output feature map after topology extraction
  • f( ), g( ) represent two different linear transformation layers
  • f( xi ) represents X after The value of position i in the feature after f( ⁇ ) linear transformation
  • X(R i ) represents the pixel in the 3*3 area adjacent to the pixel at position i in X
  • g(x j ) represents X passing through g( ⁇ ) Pixels in the 3*3 area adjacent to the pixel at position i in the linearly transformed feature.
  • j ⁇ R(i) ⁇ is the topological weight calculated by f(x i ) and g(x j ), ⁇ is the dot product and accumulation of X(R i ) and topological weight, the calculation formula of topological weight Can be expressed as follows:
  • N is the total number of pixels in R i
  • exp() is an exponential operation.
  • the output feature map extracted by the above topology is then subjected to 1x1 convolution to obtain local information.
  • the extraction mode in the embodiment of the present application obtains higher accuracy of local information.
  • the two After extracting the feature map in parallel to obtain the binary relationship between local information and sequence information, the two are fused, and the result of the fusion of the two is used as the feature input of the next binary relationship module for further feature extraction and integration.
  • the network is deepened, and each layer of the network will fuse the overall sequence information and local visual information in stages, and finally use the decoder based on the attention mechanism to obtain the recognition result of the target text.
  • each level of the text recognition network can simultaneously extract the local information and sequence information of the text picture, and then fuse the two separately to obtain the target text recognition results.
  • the embodiment of the present application adopts a parallel binary relation extraction mode, so that at each level of the text recognition network The binary information of the text image can be used at the same time, which improves the accuracy and efficiency of text recognition.
  • FIG. 6 is a schematic diagram of an effect of text recognition provided by an embodiment of the present application.
  • the real target text included is different, and the accuracy of the recognition result is low; however, through the scene text recognition method provided in the embodiment of the present application, the recognition results obtained are "jur", “anniversary”, “ground”, “ spa”, “temt”, “beauty”, “first”, “restaurant”, which significantly improved the accuracy and efficiency of text recognition.
  • the text recognition method in the embodiment of the present application can be applied in various fields.
  • various text signs on the roadside need to be correctly recognized to ensure driving stability, and for vehicles that are driving, it is extremely common to capture blurred road signs , through the text recognition method in the embodiment of the present application, the content of the road sign can be effectively and correctly recognized, and the safety of automatic driving can be improved.
  • the stable recognition of scene text can become the eyes of the blind and bring great convenience to the blind. Based on this, the recognition of menus, recognition of courier orders, recognition of documents, etc. can be greatly improved. Lives of the blind.
  • identifying product packaging is widely used in unmanned supermarkets.
  • the text on the product packaging is easily affected by the viewing angle and is distorted, through the text recognition method in the embodiment of this application, This problem can be solved to a large extent, and the accuracy and efficiency of recognition can be improved.
  • FIG. 7 is a schematic structural diagram of a text recognition device provided in an embodiment of the present application.
  • the text recognition device 70 may include an acquisition unit 701 and a recognition unit 702, wherein each unit is described as follows:
  • An acquisition unit 701 configured to acquire a text picture; the text picture is a picture including target text;
  • the recognition unit 702 is configured to input the text picture to the text recognition network for recognition to obtain the target text; each level of the text recognition network simultaneously utilizes the local information and sequence information of the text picture to identify the target text
  • the local area information includes structural information of the target text, and the sequence information includes context sequence information of the target text.
  • a text recognition method for scene text is provided. Specifically, a text picture is obtained, the text picture is a picture including target text information in a specific scene, the text picture is input into a text recognition network, and target text recognition is performed on it to obtain the target text contained in the text picture.
  • each level of the text recognition network can simultaneously use the local information and sequence information of the text picture to recognize the target text, the local information includes the structure information of the target text, and the sequence information includes the context sequence information of the target text.
  • the text recognition method that is more commonly used for scene text first inputs the text picture into a complete convolutional neural network, extracts the high-level features of the entire text picture, and then directly sends the high-level features into a cyclic neural network. Each character in the whole text is classified to obtain the recognition result of the final target text sequence.
  • this recognition mode ignores the role of contextual sequence information on low-level features.
  • the text recognition method in the embodiment of the present application can introduce the context sequence information of the text at the low-level features, and enable long-term and short-term information to interact from the low-level.
  • the local visual information and context sequence information of text pictures are extracted in parallel, and the local visual information and context sequence information of text pictures are interactively fused, so that At all levels of the text recognition network, the binary information of the text image can be used at the same time, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • the acquiring unit 701 is further configured to acquire the local area information, and acquire the sequence information;
  • the recognition unit 702 is specifically configured to obtain the target text according to the fusion processing result of the local area information and the sequence information.
  • each level of the text recognition network can simultaneously extract the local information and sequence information of the text picture, and fuse the two, and finally use the attention-based
  • the decoder of the force mechanism gets the recognition result of the final target text.
  • the embodiment of the present application adopts a parallel binary relation extraction mode, so that at each level of the text recognition network The binary information of the text image can be used at the same time, which improves the accuracy and efficiency of text recognition.
  • the identification unit 702 is specifically configured to weight and sum the local area information and the sequence information;
  • the recognition unit 702 is further configured to obtain the target text according to the result of the weighted summation of the local area information and the sequence information.
  • each level of the text recognition network can be directly added when the two are fused, or it can be weighted and summed in the form of a gate, and the target can be identified according to the result of the summation. text.
  • the acquiring unit 701 is specifically configured to extract visual features of the text picture based on a topology structure to obtain the local area information.
  • the information extraction mode based on the topology structure extracts the visual features of the text picture to obtain local information.
  • the extraction method in the embodiment of the present application achieves The accuracy of the obtained local information is higher.
  • the acquiring unit 701 is further configured to compress the features of the text picture
  • the acquiring unit 701 is specifically further configured to extract structural features of the compressed text image to obtain the sequence information.
  • feature compression is performed on the text image first, and then the structural features of the compressed text image are extracted to obtain sequence information. Compared with the existing one-dimensional feature map simply stacked on the extracted
  • the cyclic neural network is used to extract the sequence feature extraction mode.
  • the extraction mode in the embodiment of the present application uses different feature compression modes and sequence information extraction modes, which can meet different text and picture recognition requirements.
  • each unit in the device shown in FIG. 7 can be separately or all combined into one or several other units to form, or one (some) units can be split into more functional units. It is composed of multiple small units, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application.
  • the above-mentioned units are divided based on logical functions. In practical applications, the functions of one unit may also be realized by multiple units, or the functions of multiple units may be realized by one unit. In other embodiments of the present application, the network-based device may also include other units. In practical applications, these functions may also be assisted by other units, and may be implemented cooperatively by multiple units.
  • the local visual information and context sequence information of the text pictures are extracted in parallel, and the local visual information of the text pictures Interaction and fusion with context sequence information enables the dual information of text and images to be used at all levels of the text recognition network at the same time, solving the problem of missing or misplaced text characters in the recognition process, and improving the accuracy and efficiency of text recognition.
  • FIG. 8 is a schematic structural diagram of an electronic device 80 provided in an embodiment of the present application.
  • the electronic device 80 may include a memory 801 and a processor 802 .
  • a communication interface 803 and a bus 804 may also be included, wherein the memory 801 , the processor 802 and the communication interface 803 are connected to each other through the bus 804 .
  • the communication interface 803 is used for data interaction with the above-mentioned text recognition device 70 .
  • the memory 801 is used to provide a storage space, in which data such as operating systems and computer programs can be stored.
  • Memory 801 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (read-only memory, ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM).
  • the processor 802 is a module for performing arithmetic operations and logic operations, and may be in a processing module such as a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or a microprocessor (microprocessor unit, MPU). one or a combination of more.
  • a processing module such as a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or a microprocessor (microprocessor unit, MPU). one or a combination of more.
  • a computer program is stored in the memory 801, and the processor 802 calls the computer program stored in the memory 801 to perform the text recognition method shown in FIG. 3 above:
  • the text picture is a picture including target text
  • each level of the text recognition network uses the local information and sequence information of the text picture to recognize the target text at the same time, the
  • the local area information includes structural information of the target text, and the sequence information includes context sequence information of the target text.
  • the processor 802 calls the computer program stored in the memory 801, which can also be used to execute the method steps performed by the various units in the text recognition device 70 shown in FIG. I won't repeat them here.
  • the local visual information and the context sequence information of the text image are extracted in parallel, and the local visual information and the context sequence information of the text image are combined.
  • the interactive fusion of contextual sequence information enables the dual information of text and images to be used at all levels of the text recognition network at the same time, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • the embodiment of the present application also provides a computer-readable storage medium, in which a computer program is stored in the above-mentioned computer-readable storage medium, and when the above-mentioned computer program is run on one or more processors, the above-mentioned method shown in Figure 3 can be implemented .
  • An embodiment of the present application further provides a computer program product, where the computer program product includes a computer program, and when the computer program product runs on a processor, the method shown in FIG. 3 above can be implemented.
  • the embodiment of the present application also provides a chip, the chip includes a processor, and the processor is configured to execute instructions, and when the processor executes the instructions, the above method shown in FIG. 3 can be implemented.
  • the chip also includes a communication interface, which is used for inputting signals or outputting signals.
  • the embodiment of the present application also provides a system, which includes at least one text recognition device 70 or electronic device 80 or chip as described above.
  • the local visual information and context sequence information of text pictures are extracted in parallel, and the local visual information and context sequence information of text pictures are interactively fused, so that At all levels of the text recognition network, the binary information of the text image can be used at the same time, which solves the problem of missing or misplaced text characters in the recognition process, and improves the accuracy and efficiency of text recognition.
  • the processes can be completed by hardware related to computer programs, and the computer programs can be stored in computer-readable storage media.
  • the computer programs When executed, , may include the processes of the foregoing method embodiments.
  • the aforementioned storage medium includes: various media capable of storing computer program codes such as read-only memory ROM or random access memory RAM, magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种文本识别方法及相关装置。该方法包括:获取文本图片;文本图片为包括目标文本的图片;将文本图片输入至文本识别网络进行识别,得到目标文本;文本识别网络的各个层级同时利用文本图片的局域信息和序列信息对目标文本进行识别,局域信息包括目标文本的结构信息,序列信息包括目标文本的上下文序列信息。本方法在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。

Description

一种文本识别方法及相关装置 技术领域
本申请涉及场景文本识别(scene text recognition,STR)技术领域,尤其涉及一种文本识别方法及相关装置。
背景技术
场景文本识别指的是通过将特定场景中包含文本信息的文本图片输入到程序中,由程序将输入的包含文本信息的文本图片转换成计算机可理解的文本符号。场景文本识别在计算机视觉领域中为一个重要的分支,在自动驾驶、盲人辅助等应用场景中有着重要作用及前景。
目前,较为常用的场景文本识别方法是将文本图片输入到卷积神经网络中,提取得到文本图片的局域视觉信息。然后再将文本图片的局域视觉信息输入到循环神经网络中,得到最终的文本序列的识别结果。
但是,上述场景文本识别方法在预测文本字符时容易出现聚焦错误,导致文本字符遗漏或是错位的问题,从而导致文本识别准确率及效率较低。
发明内容
本申请实施例提供了一种文本识别方法及相关装置,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
第一方面,本申请实施例提供了一种文本识别方法,该方法包括:
获取文本图片;所述文本图片为包括目标文本的图片;
将所述文本图片输入至文本识别网络进行识别,得到所述目标文本;所述文本识别网络的各个层级同时利用所述文本图片的局域信息和序列信息对所述目标文本进行识别,所述局域信息包括所述目标文本的结构信息,所述序列信息包括所述目标文本的上下文序列信息。
目前,对场景文本较为常用的文本识别方法,先将文本图片输入一个完整的卷积神经网络中,提取得到整个文本图片的高层特征,然后将这高层特征直接送入一个循环神经网络中,对整个文本中的每个字符进行分类,得到最终的目标文本序列的识别结果。然而,这种识别模式忽略了上下文序列信息在低层特征上的作用。
本申请实施例中的文本识别方法,与目前常用的文本识别方法相比,能够在低层特征时就引入文本的上下文序列信息,能够使长期与短期信息从低层就开始交互。具体表现为,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
在一种可能的实施方式中,所述将所述文本图片输入至文本识别网络进行识别,得到所述目标文本,包括:
获取所述局域信息,以及获取所述序列信息;
根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本。
在本申请实施例中,通过将文本图片输入至文本识别网络中,使文本识别网络的每一层级都能同时提取文本图片的局域信息和序列信息,并将二者融合,最后使用基于注意力机制的解码器得到最终目标文本的识别结果。与现有方法中遵循先提取局域信息,最后再利用序列信息的串行二元关系提取模式相比,本申请实施例中采用并行的二元关系提取模式,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,提高了文本识别的准确率及效率。
在一种可能的实施方式中,所述根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本,包括:
将所述局域信息和所述序列信息加权求和;
根据所述局域信息和所述序列信息加权求和的结果,得到所述目标文本。
在本申请实施例中,提供了一种对局域信息和序列信息融合处理的方法。在获取局域信息和序列信息之后,文本识别网络的各个层级均能在将二者融合时直接相加,也可以用门的形式对其进行加权求和,根据求和的结果,识别得到目标文本。
在一种可能的实施方式中,所述获取所述局域信息,包括:
基于拓扑结构提取所述文本图片的视觉特征,得到所述局域信息。
在本申请实施例中,基于拓扑结构的信息提取模式,提取文本图片的视觉特征,得到局域信息,相比于现有的采用常规卷积的提取模式,本申请实施例中的提取方式所得到的局域信息准确性更高。
在一种可能的实施方式中,所述获取所述序列信息,包括:
对所述文本图片的特征压缩;
提取压缩后的所述文本图片的结构特征,得到所述序列信息。
在本申请实施例中,先对文本图片进行特征压缩,然后再对压缩后的文本图片提取其结构特征,得到序列信息,相比于现有的单纯在提取得到的一维特征图上堆叠一个循环神经网络用以提取序列特征的提取模式,本申请实施例中的提取方式使用了不同的特征压缩模式以及序列信息提取模式,可以满足不同的文本图片识别需求。
第二方面,本申请实施例提供了一种文本识别装置,该装置包括:
获取单元,用于获取文本图片;所述文本图片为包括目标文本的图片;
识别单元,用于将所述文本图片输入至文本识别网络进行识别,得到所述目标文本;所述文本识别网络的各个层级同时利用所述文本图片的局域信息和序列信息对所述目标文本进行识别,所述局域信息包括所述目标文本的结构信息,所述序列信息包括所述目标文本的上下文序列信息。
本申请实施例中,提供了一种对场景文本的文本识别方法。具体为,获取文本图片,该文本图片为包括了特定场景中包含目标文本信息的图片,将该文本图片输入至文本识别网络中,对其进行目标文本识别,得到文本图片中包含的目标文本。其中,该文本识别网络的各个层级均能同时利用文本图片的局域信息和序列信息对目标文本进行识别,该局域信息包括目标文本的结构信息,该序列信息包括目标文本的上下文序列信息。
目前,对场景文本较为常用的文本识别方法,先将文本图片输入一个完整的卷积神经网络中,提取得到整个文本图片的高层特征,然后将这高层特征直接送入一个循环神经网络中,对整个文本中的每个字符进行分类,得到最终的目标文本序列的识别结果。然而,这种识别模式忽略了上下文序列信息在低层特征上的作用。
本申请实施例中的文本识别方法,与目前常用的文本识别方法相比,能够在低层特征时 就引入文本的上下文序列信息,能够使长期与短期信息从低层就开始交互。具体表现为,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
在一种可能的实施方式中,所述获取单元,还用于获取所述局域信息,以及获取所述序列信息;
所述识别单元,具体用于根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本。
在本申请实施例中,通过将文本图片输入至文本识别网络中,使文本识别网络的每一层级都能同时提取文本图片的局域信息和序列信息,并将二者融合,最后使用基于注意力机制的解码器得到最终目标文本的识别结果。与现有方法中遵循先提取局域信息,最后再利用序列信息的串行二元关系提取模式相比,本申请实施例中采用并行的二元关系提取模式,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,提高了文本识别的准确率及效率。
在一种可能的实施方式中,所述识别单元,具体用于将所述局域信息和所述序列信息加权求和;
所述识别单元,具体还用于根据所述局域信息和所述序列信息加权求和的结果,得到所述目标文本。
在本申请实施例中,提供了一种对局域信息和序列信息融合处理的方法。在获取局域信息和序列信息之后,文本识别网络的各个层级均能在将二者融合时直接相加,也可以用门的形式对其进行加权求和,根据求和的结果,识别得到目标文本。
在一种可能的实施方式中,所述获取单元,具体用于基于拓扑结构提取所述文本图片的视觉特征,得到所述局域信息。
在本申请实施例中,基于拓扑结构的信息提取模式,提取文本图片的视觉特征,得到局域信息,相比于现有的采用常规卷积的提取模式,本申请实施例中的提取方式所得到的局域信息准确性更高。
在一种可能的实施方式中,所述获取单元,具体还用于对所述文本图片的特征压缩;
所述获取单元,具体还用于提取压缩后的所述文本图片的结构特征,得到所述序列信息。
在本申请实施例中,先对文本图片进行特征压缩,然后再对压缩后的文本图片提取其结构特征,得到序列信息,相比于现有的单纯在提取得到的一维特征图上堆叠一个循环神经网络用以提取序列特征的提取模式,本申请实施例中的提取方式使用了不同的特征压缩模式以及序列信息提取模式,可以满足不同的文本图片识别需求。
第三方面,本申请实施例提供一种文本识别装置,所述文本识别装置包括处理器和存储器;所述存储器用于存储计算机执行指令;所述处理器用于执行所述存储器所存储的计算机执行指令,以使所述文本识别装置执行如上述第一方面以及任一项可能的实施方式的方法。可选的,所述文本识别装置还包括收发器,所述收发器,用于接收信号或者发送信号。
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储指令或计算机程序;当所述指令或所述计算机程序被执行时,使得第一方面以及任一项可能的实施方式所述的方法被实现。
第五方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括指令或计 算机程序;当所述指令或所述计算机程序被执行时,使得第一方面以及任一项可能的实施方式所述的方法被实现。
第六方面,本申请实施例提供一种芯片,该芯片包括处理器,所述处理器用于执行指令,当该处理器执行所述指令时,使得该芯片执行如第一方面以及任一项可能的实施方式所述的方法。可选的,该芯片还包括通信接口,所述通信接口用于接收信号或发送信号。
第七方面,本申请实施例提供一种系统,所述系统包括至少一个如第二方面或第三方面所述的文本识别装置或第六方面所述的芯片。
此外,在执行上述第一方面以及任一项可能的实施方式所述的方法的过程中,上述方法中有关发送信息和/或接收信息等的过程,可以理解为由处理器输出信息的过程,和/或,处理器接收输入的信息的过程。在输出信息时,处理器可以将信息输出给收发器(或者通信接口、或发送模块),以便由收发器进行发射。信息在由处理器输出之后,还可能需要进行其他的处理,然后才到达收发器。类似的,处理器接收输入的信息时,收发器(或者通信接口、或发送模块)接收信息,并将其输入处理器。更进一步的,在收发器收到该信息之后,该信息可能需要进行其他的处理,然后才输入处理器。
基于上述原理,举例来说,前述方法中提及的发送信息可以理解为处理器输出信息。又例如,接收信息可以理解为处理器接收输入的信息。
可选的,对于处理器所涉及的发射、发送和接收等操作,如果没有特殊说明,或者,如果未与其在相关描述中的实际作用或者内在逻辑相抵触,则均可以更加一般性的理解为处理器输出和接收、输入等操作。
可选的,在执行上述第一方面以及任一项可能的实施方式所述的方法的过程中,上述处理器可以是专门用于执行这些方法的处理器,也可以是通过执行存储器中的计算机指令来执行这些方法的处理器,例如通用处理器。上述存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(Read Only Memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限定。
在一种可能的实施方式中,上述至少一个存储器位于装置之外。
在又一种可能的实施方式中,上述至少一个存储器位于装置之内。
在又一种可能的实施方式之中,上述至少一个存储器的部分存储器位于装置之内,另一部分存储器位于装置之外。
本申请中,处理器和存储器还可能集成于一个器件中,即处理器和存储器还可以被集成在一起。
本申请实施例中,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种文本识别的架构示意图;
图2为本申请实施例提供的一种二元关系模块的架构示意图;
图3为本申请实施例提供的一种文本识别方法的流程示意图;
图4a为本申请实施例提供的一种序列信息提取模块的结构示意图;
图4b为本申请实施例提供的另一种序列信息提取模块的结构示意图;
图4c为本申请实施例提供的又一种序列信息提取模块的结构示意图;
图5为本申请实施例提供的一种局域信息提取模块的结构示意图;
图6为本申请实施例提供的一种文本识别的效果示意图;
图7为本申请实施例提供的一种文本识别装置的结构示意图;
图8为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图对本申请实施例进行描述。
本申请的说明书、权利要求书及附图中的术语“第一”和“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备等,没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元等,或可选地还包括对于这些过程、方法、产品或设备等固有的其它步骤或单元。
在本文中提及的“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员可以显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上,“至少两个(项)”是指两个或三个及三个以上,“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
本申请提供了一种文本识别方法,为了更清楚地描述本申请的方案,下面先介绍一些与文本识别相关的知识。
文本图片:指的是包含了文本信息的图片。
场景文本识别:指的是通过将特定场景中包含文本信息的文本图片输入到程序中,由程序将输入的包含文本信息的文本图片转换成计算机可理解的文本符号。场景文本识别在计算机视觉领域中为一个重要的分支,在自动驾驶、盲人辅助等应用场景中有着重要作用及前景。
目前,较为常用的场景文本识别方法是将文本图片输入到卷积神经网络中,提取得到文本图片的局域视觉信息,然后再将文本图片的局域视觉信息输入到循环神经网络中,得到最终的文本序列的识别结果。但是,上述场景文本识别方法在预测文本字符时容易出现聚焦错误,导致文本字符遗漏或是错位的问题,从而导致文本识别准确率及效率较低。
针对上述文本识别方法中存在的文本识别准确率及效率较低的问题,本申请提供了一种文本识别架构,并基于该文本识别架构提出了一种新的文本识别方法,通过实施本申请所提 供的文本识别架构和文本识别方法,可以在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
下面结合本申请实施例中的附图对本申请实施例进行描述。
请参阅图1,图1为本申请实施例提供的一种文本识别的架构示意图。
如图1所示,本文本识别架构主要包括了若干个基于二元关系并行提取模式的模块,这些模块代替了原来卷积神经网络中的残差模块。对文本图片矫正为近似水平的程度之后,输入至文本识别网络中进行目标文本识别。上述每个二元关系模块都会同时提取整个文本图片的上下文序列信息以及局域视觉信息,并将二者信息交互融合,从而达到在文本识别网络中的每一层都能同时利用文本图片的二元信息。
通常文本识别使用的卷积神经网络中的残差模块,往往忽略文本图片中上下文序列信息在低层特征中的作用,仅仅在最后增加一个循环神经网络来增加序列信息,然而由于文本的本质是一个字符的序列排列,即使在低层也有着明显的序列结构,在低层特征的表现为字符有规律的交替出现以及整个文本的延伸方向,因此若忽略了低层的序列信息,很可能导致文本识别模型在预测字符时出现聚焦错误,从而出现字符遗漏或是错位的问题。本申请实施例中的二元关系网络在低层就引入了序列信息,并且随着网络加深,每一层网络都会对整体序列信息以及局域视觉信息进行阶段性融合,从而保证了这两种信息互相引导互相促进。
具体的,对于文本识别网络中每一层级提取文本图片中的局域信息和序列信息,并进行阶段性融合的过程,可参阅图2,图2为本申请实施例提供的一种二元关系模块的架构示意图。
如图2所示,对上述图1中的任意一个二元关系模块处理文本图片的流程进行说明,以二元关系模块1为例。对文本图片矫正之后,将特征图输入至文本识别网络中进行目标文本识别。特征图首先经过一个1x1卷积,然后分别对其进行序列信息提取和局域信息提取。具体为,对特征图进行特征压缩,再对特征压缩后的特征图进行序列提取,其中,特征压缩有三种模式,分别为池化、组卷机以及常规卷积,目的在于将原来高度不为1的特征图压缩为高度为1的特征图;序列信息提取常采用双向的长短期记忆网络,也可以采用时序卷积网络等方式。局域信息提取采用基于拓扑结构的信息提取模式,提取文本图片的视觉特征,再经过1x1卷积,得到局域信息。在并行提取特征图得到局域信息和序列信息二元关系之后,将二者分别处理后再融合,并将二者融合的结果作为下一个二元关系模块的特征输入,进行进一步局域视觉信息和序列信息的特征提取处理并将二者融合,随着网络加深,每一层网络都会对整体序列信息以及局域视觉信息进行阶段性融合,最终使用基于注意力机制的解码器得到目标文本的识别结果。
基于上述图1和图2中的文本识别架构,本申请还提供了一种新的文本识别方法,下面将结合图3对其进行说明。
请参阅图3,图3为本申请实施例提供的一种文本识别方法的流程示意图,该方法包括但不限于如下步骤:
步骤301:获取文本图片。
电子设备获取文本图片,该文本图片为包括目标文本的图片。
其中,本申请实施例中的电子设备为搭载了可用于执行计算机执行指令的处理器的设备,该电子设备可以是计算机、服务器等,用于对获取到的文本图片进行目标文本识别,提高文 本识别的准确率及效率。
步骤302:将文本图片输入至文本识别网络进行识别,得到目标文本。
电子设备将文本图片输入至文本识别网络中进行识别,得到目标文本。其中,该文本识别网络的各个层级均能同时利用该文本图片的局域信息和序列信息对目标文本进行识别,该局域信息包括目标文本的结构信息,该结构信息具体包括组成目标文本的各个字符的结构信息,该序列信息包括目标文本的上下文序列信息。
具体的,下面将分别对文本图片的局域信息提取和序列信息提取进行说明。
文本图片的序列信息提取可参阅图4a至图4c,图4a至图4c分别为本申请实施例提供的三种不同的序列信息提取模块的结构示意图。文本图片的序列信息提取方法,先对文本图片进行特征压缩,然后再对压缩后的文本图片提取其结构特征,得到序列信息。其中,特征压缩有三种模式,分别为池化、组卷积以及常规卷积,目的在于将原来高度不为1的特征图压缩为高度为1的特征图。
组卷积网络的特征压缩,对于当前层输入的H*W*C的特征图,该方案先进行了一个重塑操作,将特征图转换为1*W*(H*C)的大小,然后使用C组1*3大小的卷积核,进行卷积操作,最终得到1*W*C的特征图。最后使用一个双向的长短期记忆网络对1*W*C大小的特征图进行序列信息提取。
如图4a所示,X为当前层的输入特征图,Y为提取了序列信息后的特征图,
Figure PCTCN2021138066-appb-000001
为经过重塑(Reshape)变换之后的特征图,
Figure PCTCN2021138066-appb-000002
为经过组卷积特征提取之后的特征图。其中,重塑前后X与
Figure PCTCN2021138066-appb-000003
的对应关系可以如下表示:
Figure PCTCN2021138066-appb-000004
其中,i、j、k分别表示H、W、C方向的值。
最大池化的特征压缩,对于输入的H*W*C的特征图,该方案使用H*1大小的池化核,以横向1像素的步长,在每个H*1的区域中选取其中的最大值,以此得到压缩后的1*W*C的特征图。其中,上述H、W、C分别对应特征图的高、宽以及通道。
如图4b所示,X为当前层的输入特征图,Y为经过双向的长短期记忆网络(BiLSTM)提取了序列信息后的特征图,
Figure PCTCN2021138066-appb-000005
为经过H*1的最大池化之后的特征图。
常规卷积网络的特征压缩,对于输入的H*W*C的特征图,该方案使用H*3大小的卷积核,以横向1像素的步长进行卷积计算,最终得到的1*W*C的特征图即为压缩后的特征。
如图4c所示,X为当前层的输入特征图,Y为经过双向的长短期记忆网络(BiLSTM)提取了序列信息后的特征图,
Figure PCTCN2021138066-appb-000006
为经过H*3的常规卷积操作之后的特征图。
此外,序列信息提取除了常采用双向的长短期记忆网络,也可以采用时序卷积网络等方式,对特征压缩后得到的1*W*C大小的特征图进行序列信息提取。相比于现有的单纯在提取得到的一维特征图上堆叠一个循环神经网络用以提取序列特征的提取模式,本申请实施例中的提取方式使用了不同的特征压缩模式以及序列信息提取模式,可以满足不同的文本图片识别需求。
与此同时,基于拓扑结构的信息提取模式,提取文本图片的视觉特征,得到局域信息。
文本图片的局域信息提取可参阅图5,图5为本申请实施例提供的一种局域信息提取模块的结构示意图。如图5所示,文本图片的局域信息提取方法,采用基于拓扑结构的信息提取模式,提取文本图片的视觉特征。
其中,X为当前层的输入特征图,Y表示经过拓扑结构提取后的输出特征图,f(·),g(·)分 别表示两个不同的线性变换层,f(x i)表示X经过f(·)线性变换之后的特征中i位置的值,X(R i)表示在X中i位置的像素相邻的3*3区域的像素,g(x j)表示X经过g(·)线性变换后的特征中i位置的像素相邻的3*3区域的像素。其中,j∈R(i),α为由f(x i)和g(x j)计算得到的拓扑权重,⊙为X(R i)与拓扑权重的点乘与累加,拓扑权重的计算公式可以如下表示:
Figure PCTCN2021138066-appb-000007
其中,N为R i中的pixel总数,exp()为指数运算。
计算得到拓扑权重后,i位置的输出值计算公式可以如下表示:
Figure PCTCN2021138066-appb-000008
经过上述拓扑结构提取后的输出特征图,再经过1x1卷积,得到局域信息。相比于现有的采用常规卷积的提取模式,本申请实施例中的提取方式所得到的局域信息准确性更高。
在并行提取特征图得到局域信息和序列信息二元关系之后,将二者融合,并将二者融合的结果作为下一个二元关系模块的特征输入,进行进一步特征提取处理并融入,随着网络加深,每一层网络都会对整体序列信息以及局域视觉信息进行阶段性融合,最终使用基于注意力机制的解码器得到目标文本的识别结果。
由于本申请实施例中的二元关系模块中的局域信息提取分支和序列信息提取分支有着不同的配置,因此可以根据不同的实验、部署环境来选用不同的配置,以满足不同条件下的文本识别应用。
本申请实施例通过将文本图片输入至文本识别网络中,使文本识别网络的每一层级都能同时提取文本图片的局域信息和序列信息,并将二者分别处理后再融合,得到目标文本的识别结果。与现有方法中遵循先提取局域信息,最后再利用序列信息的串行二元关系提取模式相比,本申请实施例中采用并行的二元关系提取模式,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,提高了文本识别的准确率及效率。
基于上述图1和图2的文本识别架构,以及上述图3的文本识别方法中所述的提高对文本图片的文本识别准确率及效率,下面将结合图6对其进行说明。
请参阅图6,图6为本申请实施例提供的一种文本识别的效果示意图。
如图6所示,示例性的显示了八张在特定场景中包含文本信息的文本图片,利用场景文本识别技术,分别将这些文本图片输入到程序中,由程序将其转换成计算机可理解的文本符号。现有的场景文本识别技术,得到的识别结果分别为“jlir”、“annuversary”、“f_ound”、“xi”、“them_”、“beaut_”、“farst”、“result”,这与文本图片包含的真实目标文本有所出入,识别结果的准确率较低;而通过本申请实施例所提供的场景文本识别方法,得到的识别结果分别为“jur”、“anniversary”、“ground”、“spa”、“temt”、“beauty”、“first”、“restaurant”,明显提高了文本识别的准确率及效率。
本申请实施例中的文本识别方法,可以应用在各个不同领域。比如,自动驾驶方面,在自动驾驶的过程中,路边的各种文字标牌需要正确的识别才能保证驾驶的稳定性,而对于正在行驶的车辆来说,拍摄到模糊的路牌是及其常见的,通过本申请实施例中的文本识别方法,能够有效的正确识别路牌的内容,提高自动驾驶的安全性。比如,盲人辅助方面,稳定的识别场景文字,能够成为盲人的眼睛,给盲人带来巨大的便捷,以此为基础引申出来的诸如识 别菜单、识别快递单、识别单据等都可以极大的提高盲人的生活体验。再比如,识别产品包装方面,识别产品包装在无人超市中有着很广泛的应用,然而由于产品包装上的文字很容易受到视角的影响而产生畸变,通过本申请实施例中的文本识别方法,能够很大程度上解决这个问题,提高识别的准确率及效率。
上述详细阐述了本申请实施例的方法,下面提供本申请实施例的装置。
请参阅图7,图7为本申请实施例提供的一种文本识别装置的结构示意图,该文本识别装置70可以包括获取单元701以及识别单元702,其中,各个单元的描述如下:
获取单元701,用于获取文本图片;所述文本图片为包括目标文本的图片;
识别单元702,用于将所述文本图片输入至文本识别网络进行识别,得到所述目标文本;所述文本识别网络的各个层级同时利用所述文本图片的局域信息和序列信息对所述目标文本进行识别,所述局域信息包括所述目标文本的结构信息,所述序列信息包括所述目标文本的上下文序列信息。
在本申请实施例中,提供了一种对场景文本的文本识别方法。具体为,获取文本图片,该文本图片为包括了特定场景中包含目标文本信息的图片,将该文本图片输入至文本识别网络中,对其进行目标文本识别,得到文本图片中包含的目标文本。其中,该文本识别网络的各个层级均能同时利用文本图片的局域信息和序列信息对目标文本进行识别,该局域信息包括目标文本的结构信息,该序列信息包括目标文本的上下文序列信息。
目前,对场景文本较为常用的文本识别方法,先将文本图片输入一个完整的卷积神经网络中,提取得到整个文本图片的高层特征,然后将这高层特征直接送入一个循环神经网络中,对整个文本中的每个字符进行分类,得到最终的目标文本序列的识别结果。然而,这种识别模式忽略了上下文序列信息在低层特征上的作用。
本申请实施例中的文本识别方法,与目前常用的文本识别方法相比,能够在低层特征时就引入文本的上下文序列信息,能够使长期与短期信息从低层就开始交互。具体表现为,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
在一种可能的实施方式中,所述获取单元701,还用于获取所述局域信息,以及获取所述序列信息;
所述识别单元702,具体用于根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本。
在本申请实施例中,通过将文本图片输入至文本识别网络中,使文本识别网络的每一层级都能同时提取文本图片的局域信息和序列信息,并将二者融合,最后使用基于注意力机制的解码器得到最终目标文本的识别结果。与现有方法中遵循先提取局域信息,最后再利用序列信息的串行二元关系提取模式相比,本申请实施例中采用并行的二元关系提取模式,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,提高了文本识别的准确率及效率。
在一种可能的实施方式中,所述识别单元702,具体用于将所述局域信息和所述序列信息加权求和;
所述识别单元702,具体还用于根据所述局域信息和所述序列信息加权求和的结果,得 到所述目标文本。
在本申请实施例中,提供了一种对局域信息和序列信息融合处理的方法。在获取局域信息和序列信息之后,文本识别网络的各个层级均能在将二者融合时直接相加,也可以用门的形式对其进行加权求和,根据求和的结果,识别得到目标文本。
在一种可能的实施方式中,所述获取单元701,具体用于基于拓扑结构提取所述文本图片的视觉特征,得到所述局域信息。
在本申请实施例中,基于拓扑结构的信息提取模式,提取文本图片的视觉特征,得到局域信息,相比于现有的采用常规卷积的提取模式,本申请实施例中的提取方式所得到的局域信息准确性更高。
在一种可能的实施方式中,所述获取单元701,具体还用于对所述文本图片的特征压缩;
所述获取单元701,具体还用于提取压缩后的所述文本图片的结构特征,得到所述序列信息。
在本申请实施例中,先对文本图片进行特征压缩,然后再对压缩后的文本图片提取其结构特征,得到序列信息,相比于现有的单纯在提取得到的一维特征图上堆叠一个循环神经网络用以提取序列特征的提取模式,本申请实施例中的提取方式使用了不同的特征压缩模式以及序列信息提取模式,可以满足不同的文本图片识别需求。
根据本申请实施例,图7所示的装置中的各个单元可以分别或全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请的实施例的技术效果的实现。上述单元是基于逻辑功能划分的,在实际应用中,一个单元的功能也可以由多个单元来实现,或者多个单元的功能由一个单元实现。在本申请的其它实施例中,基于网络设备也可以包括其它单元,在实际应用中,这些功能也可以由其它单元协助实现,并且可以由多个单元协作实现。
需要说明的是,各个单元的实现还可以对应参照上述图3所示的方法实施例的相应描述。
在图7所描述的文本识别装置70中,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
请参阅图8,图8为本申请实施例提供的一种电子设备80的结构示意图。该电子设备80可以包括存储器801、处理器802。进一步可选的,还可以包含通信接口803以及总线804,其中,存储器801、处理器802以及通信接口803通过总线804实现彼此之间的通信连接。通信接口803用于与上述文本识别装置70进行数据交互。
其中,存储器801用于提供存储空间,存储空间中可以存储操作系统和计算机程序等数据。存储器801包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM)。
处理器802是进行算术运算和逻辑运算的模块,可以是中央处理器(central processing unit,CPU)、显卡处理器(graphics processing unit,GPU)或微处理器(microprocessor unit,MPU)等处理模块中的一种或者多种的组合。
存储器801中存储有计算机程序,处理器802调用存储器801中存储的计算机程序,以执行上述图3所示的文本识别方法:
获取文本图片;所述文本图片为包括目标文本的图片;
将所述文本图片输入至文本识别网络进行识别,得到所述目标文本;所述文本识别网络的各个层级同时利用所述文本图片的局域信息和序列信息对所述目标文本进行识别,所述局域信息包括所述目标文本的结构信息,所述序列信息包括所述目标文本的上下文序列信息。
上述处理器802执行方法的具体内容可参阅上述图3,此处不再赘述。
相应的,处理器802调用存储器801中存储的计算机程序,还可以用于执行上述图7所示的文本识别装置70中的各个单元所执行的方法步骤,其具体内容可参阅上述图7,此处不再赘述。
在图8所描述的电子设备80中,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
本申请实施例还提供一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,当上述计算机程序在一个或多个处理器上运行时,可以实现上述图3所示的方法。
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括计算机程序,当上述计算机程序产品在处理器上运行时,可以实现上述图3所示的方法。
本申请实施例还提供一种芯片,该芯片包括处理器,所述处理器用于执行指令,当该处理器执行所述指令时,可以实现上述图3所示的方法。可选的,该芯片还包括通信接口,该通信接口用于输入信号或输出信号。
本申请实施例还提供了一种系统,该系统包括了至少一个如上述文本识别装置70或电子设备80或芯片。
综上上述,在基于文本识别网络对文本图片进行识别的过程中,通过并行提取文本图片的局域视觉信息和上下文序列信息,并将文本图片的局域视觉信息和上下文序列信息交互融合,使得在文本识别网络的各个层级都能同时利用文本图片的二元信息,解决了识别过程中文本字符遗漏或是错位的问题,提高了文本识别的准确率及效率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序相关的硬件完成,该计算机程序可存储于计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:只读存储器ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储计算机程序代码的介质。

Claims (10)

  1. 一种文本识别方法,其特征在于,包括:
    获取文本图片;所述文本图片为包括目标文本的图片;
    将所述文本图片输入至文本识别网络进行识别,得到所述目标文本;所述文本识别网络的各个编码层级同时利用提取所述文本图片所得到的局域信息和序列信息对所述目标文本进行识别,所述局域信息包括所述目标文本的结构信息,所述序列信息包括所述目标文本的上下文序列信息。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述文本图片输入至文本识别网络进行识别,得到所述目标文本,包括:
    获取所述局域信息,以及获取所述序列信息;
    根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本,包括:
    将所述局域信息和所述序列信息加权求和;
    根据所述局域信息和所述序列信息加权求和的结果,得到所述目标文本。
  4. 根据权利要求2或3所述的方法,其特征在于,所述获取所述局域信息,包括:
    基于拓扑结构提取所述文本图片的视觉特征,得到所述局域信息。
  5. 根据权利要求4所述的方法,其特征在于,所述获取所述序列信息,包括:
    对所述文本图片的特征进行压缩;
    提取压缩后的所述文本图片的结构特征,得到所述序列信息。
  6. 一种文本识别装置,其特征在于,包括:
    获取单元,用于获取文本图片;所述文本图片为包括目标文本的图片;
    识别单元,用于将所述文本图片输入至文本识别网络进行识别,得到所述目标文本;所述文本识别网络的各个层级同时利用所述文本图片的局域信息和序列信息对所述目标文本进行识别,所述局域信息包括所述目标文本的结构信息,所述序列信息包括所述目标文本的上下文序列信息。
  7. 根据权利要求6所述的装置,其特征在于,所述获取单元,还用于获取所述局域信息,以及获取所述序列信息;
    所述识别单元,具体用于根据所述局域信息和所述序列信息融合处理的结果,得到所述目标文本。
  8. 一种文本识别装置,其特征在于,包括:处理器和存储器;
    所述存储器用于存储计算机执行指令;
    所述处理器用于执行所述存储器所存储的计算机执行指令,以使所述文本识别装置执行如权利要求1至5中任一项所述的方法。
  9. 一种计算机可读存储介质,其特征在于,包括:
    所述计算机可读存储介质用于存储指令或计算机程序;当所述指令或所述计算机程序被执行时,使如权利要求1至5中任一项所述的方法被实现。
  10. 一种计算机程序产品,其特征在于,包括:指令或计算机程序;
    所述指令或所述计算机程序被执行时,使如权利要求1至5中任一项所述的方法被实现。
PCT/CN2021/138066 2021-06-30 2021-12-14 一种文本识别方法及相关装置 WO2023273196A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110740206.8A CN113627243B (zh) 2021-06-30 2021-06-30 一种文本识别方法及相关装置
CN202110740206.8 2021-06-30

Publications (1)

Publication Number Publication Date
WO2023273196A1 true WO2023273196A1 (zh) 2023-01-05

Family

ID=78378722

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138066 WO2023273196A1 (zh) 2021-06-30 2021-12-14 一种文本识别方法及相关装置

Country Status (2)

Country Link
CN (1) CN113627243B (zh)
WO (1) WO2023273196A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627243B (zh) * 2021-06-30 2022-09-30 中国科学院深圳先进技术研究院 一种文本识别方法及相关装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508377A (zh) * 2018-11-26 2019-03-22 南京云思创智信息科技有限公司 基于融合模型的文本特征提取方法、装置、聊天机器人和存储介质
CN111428593A (zh) * 2020-03-12 2020-07-17 北京三快在线科技有限公司 一种文字识别方法、装置、电子设备及存储介质
US20210081695A1 (en) * 2018-05-30 2021-03-18 Samsung Electronics Co., Ltd. Image processing method, apparatus, electronic device and computer readable storage medium
CN112784841A (zh) * 2021-02-26 2021-05-11 北京市商汤科技开发有限公司 文本识别方法及装置
CN112990172A (zh) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 一种文本识别方法、字符识别方法及装置
CN113627243A (zh) * 2021-06-30 2021-11-09 中国科学院深圳先进技术研究院 一种文本识别方法及相关装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210081695A1 (en) * 2018-05-30 2021-03-18 Samsung Electronics Co., Ltd. Image processing method, apparatus, electronic device and computer readable storage medium
CN109508377A (zh) * 2018-11-26 2019-03-22 南京云思创智信息科技有限公司 基于融合模型的文本特征提取方法、装置、聊天机器人和存储介质
CN112990172A (zh) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 一种文本识别方法、字符识别方法及装置
CN111428593A (zh) * 2020-03-12 2020-07-17 北京三快在线科技有限公司 一种文字识别方法、装置、电子设备及存储介质
CN112784841A (zh) * 2021-02-26 2021-05-11 北京市商汤科技开发有限公司 文本识别方法及装置
CN113627243A (zh) * 2021-06-30 2021-11-09 中国科学院深圳先进技术研究院 一种文本识别方法及相关装置

Also Published As

Publication number Publication date
CN113627243B (zh) 2022-09-30
CN113627243A (zh) 2021-11-09

Similar Documents

Publication Publication Date Title
US11482023B2 (en) Method and apparatus for detecting text regions in image, device, and medium
US11423701B2 (en) Gesture recognition method and terminal device and computer readable storage medium using the same
CN111476067B (zh) 图像的文字识别方法、装置、电子设备及可读存储介质
US11741578B2 (en) Method, system, and computer-readable medium for improving quality of low-light images
US20210174135A1 (en) Method of matching image and apparatus thereof, device, medium and program product
US11398016B2 (en) Method, system, and computer-readable medium for improving quality of low-light images
WO2023174098A1 (zh) 一种实时手势检测方法及装置
CN111191582B (zh) 三维目标检测方法、检测装置、终端设备及计算机可读存储介质
WO2022001091A1 (zh) 一种危险驾驶行为识别方法、装置、电子设备及存储介质
CN110852311A (zh) 一种三维人手关键点定位方法及装置
CN114429637B (zh) 一种文档分类方法、装置、设备及存储介质
CN114519858B (zh) 文档图像的识别方法、装置、存储介质以及电子设备
CN110599455A (zh) 显示屏缺陷检测网络模型、方法、装置、电子设备及存储介质
WO2023273196A1 (zh) 一种文本识别方法及相关装置
CN115810197A (zh) 一种多模态电力表单识别方法及装置
CN111104941B (zh) 图像方向纠正方法、装置及电子设备
CN115482529A (zh) 近景色水果图像识别方法、设备、存储介质及装置
CN117746467A (zh) 一种模态增强和补偿的跨模态行人重识别方法
CN112287945A (zh) 碎屏确定方法、装置、计算机设备及计算机可读存储介质
CN113727050B (zh) 面向移动设备的视频超分辨率处理方法、装置、存储介质
CN113887470B (zh) 基于多任务注意力机制的高分辨率遥感图像地物提取方法
CN116266259A (zh) 图像文字结构化输出方法、装置、电子设备和存储介质
CN114842482A (zh) 一种图像分类方法、装置、设备和存储介质
CN113971830A (zh) 一种人脸识别方法、装置、存储介质及电子设备
Shumeng et al. A semantic segmentation method for remote sensing images based on multiple contextual feature extraction

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE