CN115984888A

CN115984888A - Information generation method, information processing apparatus, electronic device, and medium

Info

Publication number: CN115984888A
Application number: CN202310023575.4A
Authority: CN
Inventors: 于海鹏; 李煜林; 钦夏孟; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-01-06
Filing date: 2023-01-06
Publication date: 2023-04-18

Abstract

The present disclosure provides an information generating method, an information processing apparatus, an electronic device, and a medium, which relate to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing, and computer vision, and can be applied to scenes such as OCR optical character recognition. The specific implementation scheme is as follows: performing text detection on the text image to obtain detection information, wherein the detection information comprises first detection information and second detection information, the first detection information comprises category information and first position information of each of the plurality of first text regions, and the second detection information comprises second position information of each of at least one second text region; acquiring text region images corresponding to the first text regions according to the first position information and the text images; performing text recognition on the text area image to obtain recognition information; and generating the structured information of the text image according to the category information, the second detection information and the identification information.

Description

Information generation method, information processing apparatus, electronic device, and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to the field of deep learning technology, image processing technology, and computer vision technology, and is applicable to scenes such as OCR. In particular, it relates to an information generating method, an information processing method, an apparatus, an electronic device, and a medium.

Background

With the development of computer technology, artificial intelligence technology has also been developed. For example, artificial intelligence technology can be used to perform entity recognition and relationship extraction on an image containing text data to obtain text structured information in the image.

Disclosure of Invention

The disclosure provides an information generation method, an information processing device, an electronic apparatus, and a medium.

According to an aspect of the present disclosure, there is provided an information generating method including: performing text detection on a text image to obtain detection information, wherein the detection information comprises first detection information and second detection information, the first detection information comprises category information and first position information of a plurality of first text regions, the second detection information comprises second position information of at least one second text region, and the second text region comprises two first text regions, the category information of the two first text regions meets a preset condition; acquiring text region images corresponding to the first text regions according to the first position information and the text images; performing text recognition on the text region images to obtain recognition information, wherein the recognition information comprises text recognition information of each of the text region images; and generating the structured information of the text image based on the category information, the second detection information, and the identification information.

According to another aspect of the present disclosure, there is provided an information processing method including: processing a text image to be processed by using the information generation method, and acquiring the structural information of the text image to be processed; and carrying out information processing by using the structural information of the text image to be processed.

According to another aspect of the present disclosure, there is provided an information generating apparatus including: the text detection module is used for performing text detection on a text image to obtain detection information, wherein the detection information comprises first detection information and second detection information, the first detection information comprises category information and first position information of a plurality of first text regions, the second detection information comprises second position information of at least one second text region, and the second text region comprises two first text regions of which the category information meets a preset condition; a first obtaining module, configured to obtain, according to the first position information and the text image, text region images corresponding to the plurality of first text regions, respectively; a text recognition module, configured to perform text recognition on the text region images to obtain recognition information, where the recognition information includes text recognition information of each of the text region images; and a generation module configured to generate structured information of the text image according to the category information, the second detection information, and the identification information.

According to another aspect of the present disclosure, there is provided an information processing apparatus including: the second acquisition module is used for processing the text image to be processed by using the information generation device and acquiring the structural information of the text image to be processed; and the information processing module is used for processing information by using the structural information of the text image to be processed.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described in the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically shows an exemplary system architecture to which the information generation method, the information processing method, and the apparatus according to the embodiments of the present disclosure can be applied;

FIG. 2 schematically shows a flow diagram of an information generation method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a method for text detection of a text image to obtain detection information according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates an example schematic diagram of a process of feature extraction on a text image to obtain a first feature map of at least one scale according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a flowchart of a method of obtaining a text region image corresponding to each of a plurality of first text regions from first location information and a text image according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a flow chart of a structured information method of generating a text image from category information, second detection information, and identification information according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates an example schematic of an information generation process according to an embodiment of this disclosure;

FIG. 8 schematically shows an example schematic of an information generation process according to another embodiment of the disclosure;

FIG. 9 schematically shows a flow chart of an information processing method according to an embodiment of the present disclosure;

fig. 10 schematically shows a block diagram of an information generating apparatus according to an embodiment of the present disclosure;

fig. 11 schematically shows a block diagram of an information processing apparatus according to an embodiment of the present disclosure; and

fig. 12 schematically shows a block diagram of an electronic device adapted to implement an information generating method and an information processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the common customs of public order.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

Fig. 1 schematically shows an exemplary system architecture to which the information generation method, the information processing method, and the apparatus according to the embodiments of the present disclosure can be applied.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the information generation method, the information processing method, and the information processing apparatus may be applied may include a terminal device, but the terminal device may implement the information generation method, the information processing method, and the information processing apparatus provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is used to provide a medium of communication links between the first terminal device 101, the second terminal device 102, the third terminal device 103 and the server 105. The network 104 may include various connection types. E.g., at least one of wired and wireless communication links, etc. The terminal device may comprise at least one of the first terminal device 101, the second terminal device 102 and the third terminal device 103.

The user may interact with the server 105 via the network 104 using at least one of the first terminal device 101, the second terminal device 102 and the third terminal device 103 to receive or send messages or the like. At least one of the first terminal device 101, the second terminal device 102, and the third terminal device 103 may be installed with various communication client applications. For example, at least one of a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like.

The first terminal device 101, the second terminal device 102, and the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing. For example, the electronic device may include at least one of a smartphone, a tablet, a laptop portable computer, a desktop computer, and the like.

The server 105 may be a server that provides various services. For example, the Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and a VPS service (Virtual Private Server).

It should be noted that the information generating method and the information processing method provided by the embodiments of the present disclosure may be generally executed by one of the first terminal device 101, the second terminal device 102, and the third terminal device 103. Accordingly, the information generating apparatus and the information processing apparatus provided in the embodiments of the present disclosure may also be provided in one of the first terminal device 101, the second terminal device 102, and the third terminal device 103.

Alternatively, the information generating method and the information processing method provided by the embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the information generating apparatus and the information processing apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The information generation method and the information processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. Accordingly, the information generating apparatus and the information processing apparatus provided in the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105.

It should be understood that the number of first terminal devices, second terminal devices, third terminal device networks and servers in fig. 1 is merely illustrative. There may be any number of first terminal device, second terminal device, third terminal device, network and server, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely used as a representation of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

Fig. 2 schematically shows a flow chart of an information generation method according to an embodiment of the present disclosure.

As shown in FIG. 2, the method 200 includes operations S210-S240.

In operation S210, text detection is performed on the text image to obtain detection information.

In operation S220, a text region image corresponding to each of the plurality of first text regions is acquired according to the first position information and the text image.

In operation S230, text recognition is performed on the text region image to obtain recognition information.

In operation S240, structured information of the text image is generated according to the category information, the second detection information, and the identification information.

According to an embodiment of the present disclosure, the detection information may include first detection information and second detection information. The first detection information may include category information and first position information of each of the plurality of first text regions. The second detection information may include second location information of each of the at least one second text region. The second text region may include two first text regions whose category information satisfies a predetermined condition. The identification information may include text identification information of each of the plurality of text region images.

According to an embodiment of the present disclosure, a text image may refer to an image including text content. The text content in the text image belongs to unstructured information, and the unstructured text content in the text image can be extracted according to the information generation method provided by the embodiment of the disclosure to generate the structured information of the text image.

According to an embodiment of the present disclosure, the type of the text image may include various types, for example, the text image may include a medical text image, a goods list text image, or a financial text image, etc. The File Format of the text Image may include JPG (Joint Photographic Experts Group), TIFF (Tag Image File Format), PNG (Portable Network Graphics), PDF (Portable Document Format), GIF (Graphics Interchange Format), and the like. The file format of the text image is not limited in the embodiment of the present disclosure.

According to the embodiment of the disclosure, the text image may be acquired by real-time acquisition, for example, the text image may be acquired by shooting or scanning the entity text. Alternatively, the text image may be stored in the database in advance, for example, for an electronic document including text information, the text image is obtained by capturing a picture of the document. Alternatively, the text image may be received from other terminal devices. The embodiment of the present disclosure does not limit the manner of acquiring the text image.

According to the embodiment of the disclosure, after the text image is obtained, the text detection may be performed on the text image by using the first text detection model, so as to obtain first detection information corresponding to the text image. The first text detection model may be trained on a first predetermined model using a first training sample set and a first label set. The first predetermined model may comprise a deep learning model or a traditional model. The deep learning model may include a first text detection model based on candidate boxes, a first text detection model based on segmentation, or a first text detection model based on a mixture of the two, etc. The legacy model may include a first text detection model based on SWT (Stroke Width Transform), a first text detection model based on EdgeBox (i.e., edge box), or the like.

According to an embodiment of the present disclosure, the first detection information may include category information and first position information of each of the plurality of first text regions. The category information may characterize a category of the text content included in the first text region. The category information may include at least one of: a keyword category or a numerical category. The keyword category may characterize a category attribute of the text content included in the first text region. The numerical category may characterize content attributes of the text content included in the first text region.

For example, if a first text area includes text content "hospital in city a", the category information of the first text area is a numerical category. A first text area includes text content of "name", and the category information of the first text area is a keyword category. If the text content included in a first text area is 'Zhang III', the category information of the first text area is a numerical value category.

According to embodiments of the present disclosure, the first location information may characterize a location at which the first text region is located. The first position information may be used as a basis for extracting a text region image corresponding to the first text region from the text image. The first location information may be characterized using a first text detection box. The first text detection box may comprise a four-corner box, i.e. the first position information may be characterized using four coordinates.

According to the embodiment of the disclosure, after the text image is obtained, the text detection may be performed on the text image by using the second text detection model, so as to obtain second detection information corresponding to the text image. The second text detection model may be trained on a second predetermined model using a second training sample set and a second label set.

According to an embodiment of the present disclosure, the second predetermined model may comprise at least one of: a second predetermined model based on a Connection Text Region network (CTPN), a second predetermined model based on a Rotation Region candidate network (RRPN), a second predetermined model based on a Fused Text Segmentation Network (FTSN), a second predetermined model based on a Scene Text recognition pipeline (EAST), a second predetermined model based on a SegLink, and a second predetermined model based on a pixelink.

According to an embodiment of the present disclosure, the second text region may include two first text regions whose category information satisfies a predetermined condition. The predetermined condition may be configured according to an actual service requirement, and is not limited herein. For example, the predetermined condition may be set such that the category information of the two first text regions is different, in which case the second text region may include one first text region whose category information is a keyword category (e.g., a first text region including the text content as "name") and one first text region whose category information is a numeric category (e.g., a first text region including the text content as "zhangsan").

Alternatively, the predetermined condition may be set such that the category information of the two first text regions is the same, in which case the second text region may include the first text regions whose two category information are keyword categories (e.g., the first text region including the text content as "name" and the first text region including the text content as "gender"). Alternatively, the second text region may include a first text region in which the two categories of information are numerical categories (e.g., a first text region including text content "zhangsan" and a first text region including text content "male").

According to an embodiment of the present disclosure, the second detection information may include second location information of each of the at least one second text region. The second location information may characterize a location where the second text region is located. The second location information may be characterized using a second text detection box. The second text detection box may comprise a four-corner box, i.e. the second position information may be characterized using four coordinates.

According to an embodiment of the present disclosure, after obtaining the first position information of each of the plurality of first text regions, the text region image corresponding to each of the plurality of first text regions may be obtained by performing image segmentation processing on the text image according to the first position information using a predetermined image segmentation method. The predetermined image segmentation method may include at least one of: a threshold-based image segmentation method, a region-based image segmentation method, an edge-based image segmentation method, a specific theory-based image segmentation method, a gene coding-based image segmentation method, a wavelet transform-based image segmentation method, and a neural network-based image segmentation method.

According to the embodiment of the disclosure, after the text region images corresponding to the plurality of first text regions are obtained, the text region images may be subjected to text recognition by using a text recognition model, so as to obtain recognition information. The identification information may include text identification information of each of the plurality of text region images. The text identification information may be used to characterize the text content corresponding to the field composed of continuous characters in the text region image. The text recognition model may be trained on a third predetermined model using a third training sample set and a third label set. The third predetermined model may include a pattern matching model, a machine learning model, or a deep learning model. The deep learning model may include a text recognition model based on single character recognition or a text recognition model based on whole body recognition.

According to an embodiment of the present disclosure, after obtaining the identification information, structured information of the text image may be generated according to the category information, the second detection information, and the identification information. For example, second position information corresponding to the second text region may be determined based on the second detection information. And determining the first text area corresponding to the second position information according to the second position information. And generating the structured information of the text image according to the category information corresponding to the first text region and the text identification information corresponding to the text region image.

According to an embodiment of the present invention, operations S210-S240 may be performed by an electronic device. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, the third terminal device 103 in fig. 1.

According to the embodiments of the present invention, since the second detection information includes the second position information of each of the at least one second text region including two first text regions whose category information satisfies the predetermined condition, it is achieved that the text regions corresponding to the category information belonging to the same category are aggregated. Further, since the structured information of the text image is generated from the category information, the second detection information, and the identification information, the accuracy of the structured information is improved.

But is not limited to, only an exemplary embodiment, and other information generating methods known in the art may be further included as long as the accuracy of the structured information can be improved.

The method shown in fig. 2 is further described with reference to fig. 3-8 in conjunction with specific embodiments.

According to an embodiment of the present disclosure, the text image includes a medical text image.

According to the embodiment of the disclosure, the medical text is an important way for saving information in a medical scene, and contains a lot of structured information of a user, and the acquisition of the structured information is helpful for understanding the health condition of the user, and then the targeted analysis and processing are performed. At the same time, a complete database and user representation may also be established. The medical text can exist in an image form, how to extract the required structural information from the medical text image is a technical difficulty in a medical scene, and the method can be realized by utilizing the information generation scheme provided by the embodiment of the disclosure.

Fig. 3 schematically shows a flowchart of a method for performing text detection on a text image to obtain detection information according to an embodiment of the present disclosure.

As shown in fig. 3, the method 300 is further limited to operation S210 in fig. 2, and the method 300 may include operations S311 to S315.

In operation S311, feature extraction is performed on the text image to obtain a first feature map of at least one scale.

In operation S312, a second feature map is obtained according to the first feature map of at least one scale.

In operation S313, a third feature map is obtained according to the first feature map of at least one scale.

In operation S314, first detection information is acquired according to the second feature map.

In operation S315, second detection information is acquired according to the third feature map.

According to embodiments of the present disclosure, a scale may refer to an image resolution. Each scale may have at least one first feature map corresponding to the scale.

According to the embodiment of the disclosure, the text image can be processed based on a single-stage tandem method, and the first feature map of at least one scale is obtained. Alternatively, the text image may be processed based on a multi-stage tandem method, resulting in a first feature map of at least one scale. Alternatively, the text image may be processed based on a multi-stage parallel method, resulting in a first feature map of at least one scale.

According to the embodiment of the disclosure, after the first feature map of at least one scale is obtained, the second feature map can be obtained according to the first feature map of at least one scale. For example, the first feature map of at least one scale may be fused to obtain a first fused feature map. And acquiring a second feature map according to the first fusion feature map. For example, the first fused feature map may be determined as the second feature map. Alternatively, the first fused feature map may be processed to obtain the second feature map.

According to the embodiment of the disclosure, after the first feature map of at least one scale is obtained, the third feature map can be obtained according to the first feature map of at least one scale. For example, the first feature map of at least one scale may be fused to obtain a second fused feature map. And acquiring a third feature map according to the second fusion feature map. For example, the second fused feature map may be determined as the third feature map. Alternatively, the second fused feature map may be processed to obtain a third feature map.

According to the embodiment of the disclosure, there are different parts of the feature information represented by the second feature map and the third feature map, that is, the second feature map may represent the first feature information of the first feature map. The third profile may characterize the second profile information of the first profile. The first characteristic information and the second characteristic information have at least different parts. For example, the first feature information may characterize feature information corresponding to category information and first position information used to determine the first text region. The second feature information may characterize feature information corresponding to second location information for determining the second text region. For example, feature extraction may be performed on a text image by using a backbone network, so as to obtain a first feature map of at least one scale. And processing the first feature map of at least one scale by using the first branch network to obtain a second feature map. And processing the first feature map of at least one scale by using the second branch network to obtain a third feature map. There are at least different parts of the model parameters of the first and second branch networks.

According to an embodiment of the present disclosure, after obtaining the second feature map, the first detection information may be acquired according to the second feature map. For example, a thermodynamic diagram of each of the plurality of first text regions may be obtained according to the second feature map. The category information and the first position information of each of the plurality of first text regions are determined according to the thermodynamic diagram of each of the plurality of first text regions. Alternatively, the second feature map may be processed based on a regression location method to obtain category information and first location information of each of the plurality of first text regions.

According to the embodiment of the present disclosure, after the third feature map is obtained, the second detection information may be acquired according to the third feature map. For example, a thermodynamic diagram of each of the at least one second text region may be derived based on the third feature map. And determining second position information of each of the at least one second text region according to the thermodynamic diagram of each of the at least one second text region. Alternatively, the third feature map may be processed based on a regression location method to obtain second location information of each of the at least one second text region.

Operations S311 to S315 may be performed by an electronic device according to an embodiment of the present invention. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, the third terminal device 103 in fig. 1.

According to the embodiment of the disclosure, since the first feature map of at least one scale can provide richer information, the first detection information and the second detection information are obtained by using the first feature map of at least one scale, and the accuracy of the detection information is improved.

Operation S315 may include the following operations according to an embodiment of the present disclosure.

And fusing the second characteristic diagram and the third characteristic diagram to obtain a fused characteristic diagram. And acquiring second detection information according to the fusion feature map.

According to the embodiment of the disclosure, after the second feature map and the third feature map are obtained, the second feature map and the third feature map may be fused to obtain a fused feature map. And acquiring second detection information according to the fusion feature map. For example, a thermodynamic diagram of each of the plurality of second text regions may be obtained from the fused feature map. And determining second position information of each of the plurality of second text regions according to the thermodynamic diagram of each of the plurality of second text regions. Alternatively, the fused feature map may be processed based on a regression location method to obtain second location information of each of the plurality of second text regions.

Operation S311 may include the following operations according to an embodiment of the present disclosure.

And performing M stages of feature extraction on the text image to obtain at least one first feature map corresponding to the M stage. And obtaining a first feature map of at least one scale according to at least one first feature map corresponding to the Mth stage.

According to an embodiment of the disclosure, the mth stage has T _m And in the parallel levels, the image resolutions of the first characteristic graphs of the same parallel level are the same, and the image resolutions of the first characteristic graphs of different parallel levels are different.

According to an embodiment of the present disclosure, M may be an integer greater than 1 or equal to 1, M may be an integer greater than or equal to 1 and less than or equal to M, T _m May be an integer greater than or equal to 1.

According to an embodiment of the present disclosure, the M-phase may include an input phase, an intermediate phase, and an output phase. The input phase may be referred to as phase 1. The output phase may be referred to as the mth phase. The intermediate stage may refer to the 2 nd to M-1 st stages. The number of parallel levels of each stage may be the same or different. In the 1 st to M-1 st stages, the current stage may have at least one more parallel hierarchy than the previous stage. The mth stage may be the same number of parallel levels as the M-1 st stage. M may be configured according to actual service requirements, and is not limited herein. For example, M =4. In the 1 st to 3 rd stages, the current stage may have at least one more parallel hierarchy than the previous stage. Stage 1 having T ₁ =2 parallel levels. Stage 2 having T ₂ =3 parallel levels. Stage 3 having T ₃ =4 parallel levels. Stage 4 having T ₄ =4 parallel levels.

According to the embodiment of the disclosure, the image resolutions of the first feature maps of the same parallel level are the same. The image resolution of the first feature map of different parallel levels is different, for example, the image resolution of the first feature map of the current parallel level is smaller than that of the first feature map of the previous parallel level. The image resolution of the first feature map of the current parallel level of the current stage may be determined according to the image resolution of the first feature map of the last parallel level of the previous stage. For example, the image resolution of the first feature map of the current stage may be obtained by down-sampling the image resolution of the first feature map of the upper parallel level of the previous stage.

According to an embodiment of the present disclosure, in a case that M > 1, performing M stages of feature extraction on a text image to obtain at least one first feature map corresponding to an M-th stage may include: and responding to m =1, performing feature extraction on the text image to obtain an intermediate first feature map of at least one scale corresponding to the 1 st stage. And obtaining the first feature map of at least one scale corresponding to the 1 st stage according to the intermediate first feature map of at least one scale corresponding to the 1 st stage. And responding to the condition that M is more than 1 and less than or equal to M, and performing feature extraction on the first feature map of at least one scale corresponding to the M-1 stage to obtain an intermediate first feature map of at least one scale corresponding to the M-1 stage. And obtaining the first feature map of at least one scale corresponding to the mth stage according to the intermediate first feature map of at least one scale corresponding to the mth stage.

According to an embodiment of the present disclosure, in the case that M =1, performing M-stage feature extraction on a text image to obtain at least one first feature map corresponding to an M-th stage may include: and performing feature extraction on the text image to obtain an intermediate first feature map of at least one scale corresponding to the 1 st stage. And obtaining the first feature map of at least one scale corresponding to the 1 st stage according to the intermediate first feature map of at least one scale corresponding to the 1 st stage.

According to an embodiment of the present disclosure, obtaining the first feature map of at least one scale according to at least one first feature map corresponding to the mth stage may include: and determining at least one first feature map corresponding to the Mth stage as the first feature map of at least one scale.

According to the embodiment of the disclosure, since the image resolutions of the first feature maps of the same parallel level are the same, and the image resolutions of the first feature maps of different parallel levels are different, it is possible to maintain the feature representation of high resolution throughout the feature extraction process and gradually increase the parallel level from high resolution to low resolution. Deep semantic information is directly extracted on the feature representation of high resolution instead of being used as supplement of low-level feature information of the image, so that the method has enough classification capability and avoids loss of effective spatial resolution. At least one parallel hierarchy can capture context information and obtain rich global and local information. In addition, the information is repeatedly exchanged on the parallel hierarchy to realize multi-scale fusion of the features, and more accurate position information can be obtained, so that the accuracy of the first detection information and the second detection information is improved.

According to the embodiment of the disclosure, in the case that M is an integer greater than 1, performing M stages of feature extraction on the text image to obtain at least one first feature map corresponding to an M-th stage may include the following operations.

And performing convolution processing on at least one first feature map corresponding to the m-1 stage to obtain at least one intermediate feature map corresponding to the m-1 stage. And performing feature fusion on the at least one intermediate feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage.

According to an embodiment of the present disclosure, M may be an integer greater than 1 and less than or equal to M.

According to the embodiment of the disclosure, for the m-1 th stage, for a first feature map in at least one first feature map, the first feature map may be subjected to convolution processing to obtain an intermediate first feature map of the m-th stage, and thus, at least one intermediate first feature map of the m-th stage may be obtained.

According to an embodiment of the present disclosure, performing feature fusion on at least one intermediate first feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage may include: and for the intermediate first feature diagram in at least one intermediate first feature diagram corresponding to the m-th stage, fusing the intermediate first feature diagram of the m-th stage and the intermediate first feature diagrams of the parallel levels except the parallel level where the intermediate first feature diagram is located in the m-th stage to obtain the first feature diagram corresponding to the intermediate first feature diagram in the m-th stage. Other parallel levels may refer to at least some parallel levels of the mth stage other than the parallel level at which the intermediate first feature map resides.

According to the embodiment of the disclosure, performing feature fusion on at least one intermediate feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage may include the following operations.

For T _m And obtaining a first characteristic diagram corresponding to the ith parallel level according to other intermediate characteristic diagrams corresponding to the ith parallel level and the intermediate characteristic diagram corresponding to the ith parallel level in the ith parallel level.

Other intermediate profiles corresponding to the ith parallel level may be T according to embodiments of the present disclosure _m Intermediate characteristic diagrams corresponding to at least part of parallel levels except the ith parallel level in the parallel levels, i can be greater than or equal to 1 and less than or equal to T _m Is an integer of (1).

According to an embodiment of the present disclosure, the other intermediate first profile corresponding to the ith parallel level may be T _m And intermediate first characteristic diagrams corresponding to at least part of the parallel levels except the ith parallel level in the parallel levels. i may be greater than or equal to 1 and less than or equal to T _m Is an integer of (1).

According to an embodiment of the disclosure, in case 1 < I, the at least one first further intermediate first feature map is upsampled, resulting in an upsampled first feature map corresponding to the at least one first further intermediate first feature map. The at least one second further intermediate first feature map is downsampled to obtain a downsampled first feature map corresponding to the at least one second further intermediate first feature map. The first other intermediate first feature map may be referred to as T _m Other intermediate first profiles in the number of parallel levels greater than the ith parallel level. The second other intermediate first feature map may be referred to as T _m Other intermediate first feature maps in the i-th parallel level are smaller than the i-th parallel level. The image resolution of the upsampled first feature map is the same as the resolution of the intermediate first feature map of the ith parallel level. The resolution of the down-sampled first feature map is the same as the resolution of the intermediate first feature map of the i parallel levels.

According to an embodiment of the present disclosure, in case i =1, the at least one second further intermediate first feature map is upsampled, resulting in a downsampled first feature map corresponding to the at least one first further intermediate first feature map. The first other intermediate first feature map may be referred to as T _m Other intermediate first profiles in the 1 st parallel level are greater than the first parallel level. The image resolution of the upsampled first feature map is the same as the resolution of this intermediate first feature map of the 1 st parallel level.

According to an embodiment of the present disclosure, when I = I, the at least one second other intermediate first feature map is downsampled to obtain a downsampled first feature map corresponding to the at least one second other intermediate first feature map. The second other intermediate first feature map may be referred to as T _m Other intermediate first feature maps in the i-th parallel level are smaller than the i-th parallel level. The resolution of the down-sampled first feature map is the same as the resolution of the intermediate first feature map of the i parallel levels.

According to an embodiment of the present disclosure, a first feature map corresponding to an ith parallel level is obtained from an up-sampled first feature map corresponding to at least one first other intermediate first feature map, a down-sampled first feature map corresponding to at least one second other intermediate first feature map, and an intermediate first feature map of the ith parallel level. For example, an up-sampled first feature map corresponding to at least one first further intermediate first feature map, a down-sampled first feature map corresponding to at least one second further intermediate first feature map, and an i-th parallel level intermediate first feature map may be fused to obtain a first feature map corresponding to the i-th parallel level. The fusing may include at least one of: splicing and adding.

Fig. 4 schematically illustrates an example schematic diagram of a process of performing M stages of feature extraction on a text image to obtain at least one first feature map corresponding to an M-th stage according to an embodiment of the disclosure.

As shown in fig. 4, in 400, M =4, for example, a 1 st stage 401, a 2 nd stage 402, a 3 rd stage 403, and a 4 th stage 404. The 1 st stage 401 has two parallel levels, e.g., a 1 st parallel level 405 and a 2 nd parallel level 406. The 2 nd stage 402 has three parallel levels, e.g., a 1 st parallel level 405, a 2 nd parallel level 406, and a 3 rd parallel level 407. The 3 rd stage 403 is specific to four parallel levels, e.g., a 1 st parallel level 405, a 2 nd parallel level 406, a 3 rd parallel level 407, and a 4 th parallel level 408.

The at least one first profile corresponding to stage 4 may include a first profile 409, a first profile 410, a first profile 411, and a first profile 412. Furthermore, the "upper right arrow" between the last two columns of each stage in fig. 4 represents "upsampling". The "lower left arrow" characterizes "down-sampling".

And performing N cascade level feature extraction on the text image to obtain a first feature map of at least one scale.

According to an embodiment of the present disclosure, N may be an integer greater than 1. N may be configured according to an actual service requirement, which is not limited herein. For example, N =4.

According to the embodiment of the disclosure, feature extraction of N cascade levels can be performed on a text image, and at least one first feature map corresponding to the N cascade levels is obtained. And obtaining a first feature map of at least one scale according to at least one first feature map corresponding to the N cascade levels. For example, for the nth cascade level of the N cascade levels, the first feature map of the scale corresponding to the nth cascade level is obtained according to the first feature maps of other cascade levels and the first feature map corresponding to the nth cascade level. The other cascade levels may refer to at least a portion of the N cascade levels other than the nth cascade level.

According to the embodiment of the disclosure, since the first feature map of at least one scale can provide richer information, the first detection information and the second detection information are subsequently determined according to the first feature map of at least one scale, and the accuracy of the first detection information and the second detection information can be improved.

Fig. 5 schematically shows a flowchart of a method of acquiring a text region image corresponding to each of a plurality of first text regions from first position information and a text image according to an embodiment of the present disclosure.

As shown in fig. 5, the method 500 is a further limitation of operation S220 in fig. 2, and the method 500 may include operations S521-S522.

In operation S521, the first location information is converted into target location information using affine transformation.

In operation S522, images corresponding to the respective first text regions are extracted from the text image according to the target position information, resulting in text region images corresponding to the respective first text regions.

According to an embodiment of the present disclosure, the affine transformation is a linear transformation between two-dimensional coordinates to two-dimensional coordinates for maintaining "straightness" and "parallelism" of a two-dimensional figure. The straightness can be understood as straight line or straight line after transformation, no bending, and circular arc or circular arc. The parallelism can be understood as keeping the relative position relationship between different two-dimensional patterns unchanged, and whether parallel lines or parallel lines, and the included angle of the intersected straight lines are unchanged. The affine transformation may be achieved by at least one of translation, scaling, flipping, rotation, and shearing, among others.

According to an embodiment of the present disclosure, converting the first location information corresponding to the first text region into the target location information using affine transformation may include: the first text region in the form of a quadrangular box may be converted into the first text region in the form of a rectangular box using affine transformation, and the first position information corresponding to the first text region in the form of a rectangular box is determined as target position information, so that text region images corresponding to a plurality of the first text regions may be extracted from the text image based on the target position information.

For example, the first text region is a four-corner box, which may be defined by { P } ₁ ，P ₂ ，P ₃ ，P ₄ Characterization, P ₁ Points representing the upper left corner of the four-corner box, P ₂ Point representing the upper right corner of the four-corner block, P ₃ Points characterizing the lower left corner of the four-corner box, P ₄ Points representing the lower right hand corner of the four-corner box. P ₁ Can be characterized as { x } ₁ ，y ₁ }，P ₂ Can be characterized as { x } ₂ ，y ₂ }，P ₃ Can be characterized as { x } ₃ ，y ₃ }，P ₄ Can be characterized as { x } ₄ ，y ₄ }. Transforming P by affine ₁ →P ₁ ′，P ₂ →P′ ₂ ，P ₃ →P′ ₃ ，P ₄ →P′ ₄ To obtain a rectangular frame { P ₁ ′，P′ ₂ ，P′ ₃ ，P′ ₄ }。P ₁ 'can be characterized as { x' ₁ ，y′ ₁ }，P′ ₂ Can be characterized as { x' ₂ ，y′ ₂ }，P′ ₃ Can be characterized as { x' ₃ ，y′ ₃ }，P′ ₄ Can be characterized as { x' ₄ ，y′ ₄ }。

According to an embodiment of the present invention, operations S521 through S522 may be performed by an electronic device. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, the third terminal device 103 in fig. 1.

According to the embodiment of the disclosure, since the target position information is obtained by converting the first position information by affine transformation, the target position information can be automatically determined, and thus the efficiency and accuracy of determining the target position information are improved. On the basis, the text region images are obtained by extracting the images corresponding to the plurality of first text regions from the text images according to the target position information, so that the automatic extraction of the text region images can be realized, and the efficiency and the accuracy of obtaining the text region images are improved.

Fig. 6 schematically shows a flowchart of a structured information method of generating a text image from the category information, the second detection information, and the identification information according to an embodiment of the present disclosure.

As shown in fig. 6, the method 600 is further limited to operation S240 in fig. 2, and the method 600 may include operations S641 through S642.

In operation S641, category information and text identification information, each corresponding to at least one second text region, are determined from the category information and the identification information according to the second detection information.

In operation S642, structured information of the text image is generated according to the category information and the text recognition information corresponding to each of the at least one second text region.

According to an embodiment of the present disclosure, the category information may include one of a keyword category and a numerical category.

According to an embodiment of the present disclosure, second position information of each of the at least one second text region may be determined according to the second detection information. For the respective second position information of each second text region, two first text regions included in the second text region may be determined according to the second position information. And determining category information and text identification information corresponding to the two first text regions respectively from the category information and the identification information according to the two first text regions. The category information and the text identification information corresponding to each of the two first text regions are determined as the category information and the text identification information corresponding to each of the second text regions.

According to an embodiment of the present disclosure, after determining the category information and the text identification information corresponding to each of the at least one second text region, for the second position information of each second text region, the category information and the text identification information corresponding to the second text region may be combined according to the second position information of the second text region to generate the region structured information corresponding to the second text region. And generating the structural information of the text image according to the region structural information respectively corresponding to each second text region.

Operations S641 through S642 may be performed by an electronic device, according to an embodiment of the present invention. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, the third terminal device 103 in fig. 1.

According to the embodiment of the present disclosure, by determining the category information and the text recognition information corresponding to each of the at least one second text region from the category information and the recognition information according to the second detection information, since the structured information of the text image is generated according to the category information and the text recognition information corresponding to each of the at least one second text region, the accuracy of the structured information generation of the text image is improved.

Fig. 7 schematically shows an example schematic of an information generation process according to an embodiment of the disclosure.

As shown in fig. 7, in 700, feature extraction may be performed on a text image 701 to obtain a first feature map 702 of at least one scale.

The second feature map 703 may be obtained from the first feature map 702 in at least one dimension. After obtaining the second feature map 703, the first detection information 705 may be obtained according to the second feature map 703. The first detection information 705 may include first location information 7051 and category information 7052 for each of the plurality of first text regions.

A third feature map 704 may be obtained based on the first feature map 702 in at least one dimension. After obtaining the third feature map 704, second detection information 706 may be obtained based on the third feature map 704. The second detection information 706 can include second location information 7061 for each of the at least one second text region.

The first location information 7051 may be converted to the target location information 707 using an affine transformation. Based on the target position information 707, images corresponding to the respective plurality of first text regions are extracted from the text image 701, resulting in text region images 708 corresponding to the respective plurality of first text regions.

After the text region images 708 corresponding to the respective plurality of first text regions are obtained, text recognition may be performed on the text region images 708 to obtain recognition information 709.

After obtaining the identification information 709, category information 710 corresponding to each of the at least one second text region may be determined from the category information 7052 and the second location information 7061. Text recognition information 711 corresponding to each of the at least one second text region is determined based on the recognition information 709 and the second position information 7061.

After obtaining the category information 710 corresponding to each of the at least one second text region and the text recognition information 711 corresponding to each of the at least one second text region, the structured information 712 of the text image may be generated according to the category information 710 corresponding to each of the at least one second text region and the text recognition information 711 corresponding to each of the at least one second text region.

Fig. 8 schematically shows an example schematic of an information generation process according to another embodiment of the present disclosure.

As shown in fig. 8, in an information generating process 800, a first text detection model 802 and a second text detection model 803 may perform text detection on a text image 801 to obtain detection information. For example, the first text detection model 802 may perform text detection on the text image 801 to obtain first detection information. The first detection information may include category information and first position information of each of the plurality of first text regions. The plurality of first text regions may include a first text region 8041, a first text region 8042, a first text region 8043, a first text region 8044, a first text region 8045, a first text region 8046, a first text region 8047, and a first text region 8048.

For example, the category information may include a keyword category or a numerical value category. The category information of the first text region 8041, the first text region 8043, the first text region 8045, and the first text region 8047 may be a keyword category. The category information of the first text region 8042, the first text region 8044, the first text region 8046, and the first text region 8048 is a numerical value category.

The second text detection model 803 may perform text detection on the text image 801 to obtain second detection information. The second detection information may include second location information of each of the at least one second text region. The at least one second text region can include a second text region 8049, a second text region 80410, a second text region 80411, and a second text region 80412.

For example, the second text region 8049 may include a first text region 8041 and a first text region 8042 in which category information satisfies a predetermined condition. The second text region 80410 may include a first text region 8043 and a first text region 8044 in which category information satisfies a predetermined condition. The second text area 80411 may include a first text area 8045 and a first text area 8046 in which category information satisfies a predetermined condition. Second text region 80412 may include first text region 804_7 and first text region 804_8 whose category information satisfies a predetermined condition.

After the first detection information and the second detection information are obtained, a text region image corresponding to each of the plurality of first text regions may be acquired based on the position information of each of the plurality of first text regions and the text image 801. The text region images corresponding to the respective plurality of first text regions may include the text region image 805_1 (i.e., a text region image including the text content "name:"), the text region image 805_2 (i.e., a text region image including the text content "zhang:"), the text region image 805_3 (i.e., a text region image including the text content "gender:"), the text region image 805_4 (i.e., a text region image including the text content "man"), the text region image 805_5 (i.e., a text region image including the text content "age:"), the text region image 805 u 6 (i.e., a text region image including the text content "42 years"), the text region image 805_7 (i.e., a text region image including the text content "detection result:"), and the text region image 805 u 8 (i.e., a text region image including the text content "XX").

For example, the text region image 805\ u 1 corresponding to the text region 804 \ u 1 may be acquired from the text image 801 and the position information of the first text region 804 \ u 1. A text region image 805_2 corresponding to the text region 804_2 is acquired from the position information of the text region 804_2 and the text image 801. By analogy, the text region image 805\ u 8 corresponding to the text region 804 \\ u 8 is acquired from the position information of the text image 801 and the text region 804 \\ u 8.

After obtaining the text region images corresponding to the first text regions, the text recognition model 806 may perform text recognition on the text region images to obtain text recognition information of the text region images. The text identification information of each of the plurality of text region images may include text identification information 807_1 (i.e., "name:"), text identification information 807_2 (i.e., "zhang three"), text identification information 807_3 (i.e., "sex:"), text identification information 807_4 (i.e., "man"), text identification information 807_5 (i.e., "age:"), text identification information 807_6 (i.e., "42 years"), text identification information 807_7 (i.e., "detection result:"), and text identification information 807_8 (i.e., "XX").

For example, the text recognition model 806 may be used to perform text recognition on the text region image 805_1, resulting in text recognition information 807_1 of the text region image 805_1. The text recognition model 806 performs text recognition on the text region image 805_2, and obtains text recognition information 807_2 of the text region image 805_2. By analogy, the text recognition model 806 is used to perform text recognition on the text region image 805 \ "u 8, and the text recognition information 806 \" u 8 of the text region image 805 \ "u 8 is obtained.

After obtaining the text identification information of each of the plurality of text region images, structured information of the text image may be generated based on the category information, the second detection information, and the identification information.

For example, from the second text region 804_9, it is possible to determine that the category information corresponding to the first text region 804_1 is the keyword category and text identification information 807_1, and the category information corresponding to the first text region 804_2 is the numerical category and text identification information 807_2. In this case, the region structured information 808_1 (i.e., "name: zhang three") of the text image can be generated based on the fact that the category information corresponding to the first text region 804_1 is the keyword category and text identification information 807_1 and the category information corresponding to the first text region 804_2 is the numeric category and text identification information 807_2.

For example, from the second text region 804_10, it is possible to determine that the category information corresponding to the first text region 804 _3is the keyword category and text recognition information 807_3, and the category information corresponding to the first text region 804 _4is the numeric category and text recognition information 807_4. In this case, the region structured information 808_2 (i.e., "sex: male") of the text image can be generated based on the fact that the category information corresponding to the first text region 804_3 is the keyword category and text identification information 807_3 and the category information corresponding to the first text region 804_4 is the numerical category and text identification information 807_4.

For example, from the second text region 804_11, it is possible to determine that the category information corresponding to the first text region 804 _5is the keyword category and text recognition information 807_5, and the category information corresponding to the first text region 804 _6is the numeric category and text recognition information 807_6. In this case, the region structured information 808_3 (i.e., "age: 42 years") of the text image can be generated based on the category information corresponding to the first text region 804 _5being the keyword category and text recognition information 807 _5and the category information corresponding to the first text region 804 _6being the numeric category and text recognition information 807_6.

For example, from the second text region 804_12, it is possible to determine that the category information corresponding to the first text region 804 _7is the keyword category and text recognition information 807_7, and the category information corresponding to the first text region 804 _8is the numeric category and text recognition information 807_8. In this case, the region structured information 808_4 (i.e., "detection result: XX") of the text image can be generated based on the fact that the category information corresponding to the first text region 804 _7is the keyword category and text recognition information 807 _7and the category information corresponding to the first text region 804 _8is the numeric category and text recognition information 807_8.

The structured information of the text image can be generated from the area structured information 808_1 of the text image, the area structured information 808 _2of the text image, the area structured information 808 _3of the text image, and the area structured information 808 _4of the text image.

Fig. 9 schematically shows a flow chart of an information processing method according to an embodiment of the present disclosure.

As shown in fig. 9, the method 900 includes operations S910 to S920.

In operation S910, the text image to be processed is processed by using the information processing method 200, and structured information of the text image to be processed is acquired.

In operation S920, information processing is performed using the structured information of the text image to be processed.

According to an embodiment of the present disclosure, the structural information of the text image to be processed may be determined by using the information generation method according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, the text image to be processed can be processed by using the information generation method, and the structural information of the text image to be processed is acquired. And carrying out information processing by using the structural information of the text image to be processed.

Operations S910 to S920 may be performed by an electronic device according to an embodiment of the present invention. The electronic device may comprise a server or a terminal device. The server may be the server 105 in fig. 1. The terminal devices may be the first terminal device 101, the second terminal device 102, the third terminal device 103 in fig. 1.

According to the embodiment of the invention, the structural information of the text image to be processed is acquired by processing the text image to be processed by using the information generation method, and the structural information is generated by using the semantic relation information and the visual information, so that the accuracy of the structural information is improved. On the basis, the information processing is carried out by utilizing the structured information of the text image to be processed, so that the accuracy of the information processing is improved.

But is not limited to this, and other information processing methods known in the art may be included as long as the accuracy of information processing can be improved.

Fig. 10 schematically shows a block diagram of an information generating apparatus according to an embodiment of the present disclosure.

As shown in fig. 10, the information generating apparatus 1000 may include a text detecting module 1010, a first obtaining module 1020, a text recognizing module 1030, and a generating module 1040.

The text detection module 1010 is configured to perform text detection on the text image to obtain detection information, where the detection information includes first detection information and second detection information, the first detection information includes category information and first position information of each of the plurality of first text regions, the second detection information includes second position information of each of at least one second text region, and the second text region includes two first text regions whose category information satisfies a predetermined condition.

A first obtaining module 1020, configured to obtain, according to the first position information and the text image, a text region image corresponding to each of the plurality of first text regions.

The text recognition module 1030 is configured to perform text recognition on the text region images to obtain recognition information, where the recognition information includes text recognition information of each of the text region images.

The generating module 1040 is configured to generate the structured information of the text image according to the category information, the second detection information, and the identification information.

According to an embodiment of the present disclosure, the text detection module 1010 may include a feature extraction sub-module, a first obtaining sub-module, a second obtaining sub-module, a third obtaining sub-module, and a fourth obtaining sub-module.

And the feature extraction submodule is used for extracting features of the text image to obtain a first feature map of at least one scale.

And the first acquisition submodule is used for acquiring a second characteristic diagram according to the first characteristic diagram of at least one scale.

And the second obtaining submodule is used for obtaining a third characteristic diagram according to the first characteristic diagram.

And the third obtaining submodule is used for obtaining the first detection information according to the second characteristic diagram.

And the fourth obtaining submodule is used for obtaining the second detection information according to the third characteristic diagram.

According to an embodiment of the present disclosure, the fourth acquisition submodule may include a fusion unit and an acquisition unit.

And the fusion unit is used for fusing the second feature map and the third feature map to obtain a fusion feature map.

And the acquisition unit is used for acquiring second detection information according to the fusion characteristic diagram.

According to an embodiment of the present disclosure, the feature extraction sub-module may include a first feature extraction unit and an obtaining unit.

And the first feature extraction unit is used for performing feature extraction on the text image in M stages to obtain at least one first feature map corresponding to the M-th stage.

And the obtaining unit is used for obtaining the first feature map of at least one scale according to the at least one first feature map corresponding to the M stage.

According to an embodiment of the present disclosure, the mth phase has T _m And in each parallel level, the image resolution of the first characteristic maps in the same parallel level is the same, and the image resolution of the first characteristic maps in different parallel levels is different.

According to an embodiment of the present disclosure, M is an integer greater than or equal to 1 and less than or equal to M, T _m Is an integer greater than or equal to 1.

According to an embodiment of the present disclosure, in the case where M is an integer greater than 1, the feature extraction unit may include a convolution processing subunit and a feature fusion subunit.

And the convolution processing subunit is used for performing convolution processing on the at least one first feature map corresponding to the m-1 th stage to obtain at least one intermediate feature map corresponding to the m-th stage.

And the characteristic fusion subunit is used for performing characteristic fusion on the at least one intermediate characteristic diagram corresponding to the mth stage to obtain at least one first characteristic diagram corresponding to the mth stage.

According to an embodiment of the present disclosure, M is an integer greater than 1 and less than or equal to M.

According to an embodiment of the present disclosure, the feature fusion subunit may include:

for T _m The ith of the parallel levels,

and obtaining a first feature map corresponding to the ith parallel level according to other intermediate feature maps corresponding to the ith parallel level and the intermediate feature map corresponding to the ith parallel level.

According to embodiments of the present disclosure, the other intermediate feature map corresponding to the ith parallel level is the feature map corresponding to T _m Intermediate characteristic diagrams corresponding to at least part of parallel levels except the ith parallel level in the parallel levels, i is greater than or equal to 1 and less than or equal to T _m Is an integer of (2).

According to an embodiment of the present disclosure, the feature extraction sub-module may include a second feature extraction unit.

And the second feature extraction unit is used for extracting features of N cascade levels of the text image to obtain a first feature map of at least one scale, wherein N is an integer greater than 1.

According to an embodiment of the present disclosure, the first obtaining module 1020 may include a conversion sub-module and an obtaining sub-module.

A conversion sub-module for converting the first location information into the target location information using affine transformation.

And the obtaining sub-module is used for extracting images corresponding to the first text regions from the text image according to the target position information to obtain text region images corresponding to the first text regions.

According to an embodiment of the present disclosure, the category information includes one of a keyword category and a numerical category.

According to an embodiment of the present disclosure, the generating module 1040 may include a determining submodule and a generating submodule.

And the determining sub-module is used for determining the category information and the text identification information which respectively correspond to the at least one second text region from the category information and the identification information according to the second detection information.

And the generation sub-module is used for generating structural information of the text image according to the category information and the text identification information which respectively correspond to the at least one second text area, wherein the structural information comprises a keyword category, the text identification information which corresponds to the keyword category, a numerical value category and the text identification information which corresponds to the numerical value category.

Fig. 11 schematically shows a block diagram of an information processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 11, the information processing apparatus 1100 may include a second acquisition module 1110 and an information processing module 1120.

The second obtaining module 1110 is configured to process the text image to be processed by using the information generating apparatus 1100, and obtain the structural information of the text image to be processed.

And an information processing module 1120, configured to perform information processing by using the structured information of the text image to be processed.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the present disclosure.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform a method as described in the present disclosure.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the disclosure.

Fig. 12 schematically shows a block diagram of an electronic device adapted to implement an information generating method and an information processing method according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 1201 executes the respective methods and processes described above, such as the information generation method and the information processing method. For example, in some embodiments, the information generation method and the information processing method may be implemented as computer software programs that are tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the information generating method and the information processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured in any other suitable way (e.g., by means of firmware) to perform the information generation method and the information processing method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An information generating method, comprising:

performing text detection on the text image to obtain detection information, wherein the detection information comprises first detection information and second detection information, the first detection information comprises category information and first position information of a plurality of first text regions, the second detection information comprises second position information of at least one second text region, and the second text region comprises two first text regions of which the category information meets a preset condition;

acquiring text region images corresponding to the plurality of first text regions respectively according to the first position information and the text images;

performing text recognition on the text region images to obtain recognition information, wherein the recognition information comprises text recognition information of each of the text region images; and

and generating the structural information of the text image according to the category information, the second detection information and the identification information.

2. The method of claim 1, wherein the text detection of the text image to obtain detection information comprises:

performing feature extraction on the text image to obtain a first feature map of at least one scale;

acquiring a second feature map according to the first feature map of at least one scale;

acquiring a third feature map according to the first feature map of at least one scale;

acquiring the first detection information according to the second feature map; and

and acquiring the second detection information according to the third characteristic diagram.

3. The method of claim 2, wherein the obtaining the second detection information according to the third feature map comprises:

fusing the second characteristic diagram and the third characteristic diagram to obtain a fused characteristic diagram; and

and acquiring the second detection information according to the fusion feature map.

4. The method according to claim 2 or 3, wherein the extracting the features of the text image to obtain a first feature map of at least one scale comprises:

performing feature extraction on the text image in M stages to obtain at least one first feature map corresponding to the M-th stage; and

obtaining a first feature map of at least one scale according to at least one first feature map corresponding to the Mth stage;

wherein the mth stage has T _m The image resolutions of the first characteristic graphs of the same parallel level are the same, and the image resolutions of the first characteristic graphs of different parallel levels are different;

wherein M is an integer greater than 1 or equal to 1, M is an integer greater than or equal to 1 and less than or equal to M, T _m Is an integer greater than or equal to 1.

5. The method according to claim 4, wherein, in a case that M is an integer greater than 1, the performing M stages of feature extraction on the text image to obtain at least one first feature map corresponding to an M-th stage comprises:

performing convolution processing on at least one first feature map corresponding to the m-1 stage to obtain at least one intermediate feature map corresponding to the m stage; and

performing feature fusion on at least one intermediate feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage;

wherein M is an integer greater than 1 and less than or equal to M.

6. The method according to claim 5, wherein the feature fusing the at least one intermediate feature map corresponding to the mth stage to obtain at least one first feature map corresponding to the mth stage comprises:

for the T _m The ith of the parallel levels,

obtaining a first feature map corresponding to the ith parallel level according to other intermediate feature maps corresponding to the ith parallel level and the intermediate feature map corresponding to the ith parallel level;

wherein the other intermediate feature map corresponding to the ith parallel hierarchy is the one corresponding to the T _m Intermediate profiles corresponding to at least some of the parallel levels other than the ith parallel level, i being greater than or equal to 1 and less than or equal to T _m Of (2)And (4) counting.

7. The method according to claim 2 or 3, wherein the extracting the features of the text image to obtain a first feature map of at least one scale comprises:

and performing N cascade levels of feature extraction on the text image to obtain a first feature map of the at least one scale, wherein N is an integer greater than 1.

8. The method according to any one of claims 1 to 7, wherein acquiring a text region image corresponding to each of the plurality of first text regions based on the first position information and the text image comprises:

converting the first location information into target location information using affine transformation; and

and extracting images corresponding to the first text regions from the text image according to the target position information to obtain text region images corresponding to the first text regions.

9. The method of any of claims 1-8, wherein the category information includes one of a keyword category and a numerical category;

wherein the generating of the structured information of the text image according to the category information, the second detection information, and the identification information includes:

determining category information and text identification information corresponding to the at least one second text region from the category information and the identification information according to the second detection information; and

and generating structured information of the text image according to the category information and the text identification information which respectively correspond to the at least one second text region, wherein the structured information comprises the keyword category, the text identification information which corresponds to the keyword category, the numerical value category and the text identification information which corresponds to the numerical value category.

10. The method of any of claims 1-9, wherein the text image comprises a medical text image.

11. An information processing method comprising:

processing a text image to be processed by using the method according to any one of claims 1 to 10, and acquiring structural information of the text image to be processed; and

and performing information processing by using the structural information of the text image to be processed.

12. An information generating apparatus comprising:

the text detection module is used for performing text detection on a text image to obtain detection information, wherein the detection information comprises first detection information and second detection information, the first detection information comprises category information and first position information of a plurality of first text regions, the second detection information comprises second position information of at least one second text region, and the second text region comprises two first text regions of which the category information meets a preset condition;

a first obtaining module, configured to obtain, according to the first location information and the text image, text region images corresponding to the plurality of first text regions, respectively;

the text recognition module is used for performing text recognition on the text region images to obtain recognition information, wherein the recognition information comprises respective text recognition information of the text region images; and

and the generating module is used for generating the structural information of the text image according to the category information, the second detection information and the identification information.

13. An information processing apparatus comprising:

a second obtaining module, configured to process the text image to be processed by using the apparatus according to claim 12, and obtain structural information of the text image to be processed; and

and the information processing module is used for processing information by using the structural information of the text image to be processed.

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.

15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of claims 1 to 11.

16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.