CN111914654A

CN111914654A - Text layout analysis method, device, equipment and medium

Info

Publication number: CN111914654A
Application number: CN202010635621.2A
Authority: CN
Inventors: 王波; 张百灵; 周炬; 朱华柏
Original assignee: Auntec Co ltd
Current assignee: Auntec Co ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-11-10

Abstract

The embodiment of the invention discloses a text layout analysis method, a text layout analysis device, a text layout analysis medium and electronic equipment, wherein the method comprises the following steps: acquiring a text image to be analyzed, and preprocessing the text image; inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image; wherein the semantic segmentation algorithm model comprises an encoding stage and a decoding stage; the encoding stage is used for carrying out feature fusion on high semantic features of different stages in the residual error network model and high-resolution semantic features in the high-resolution network branches in an element addition mode; and the decoder stage is used for performing feature fusion on the high-resolution semantic features extracted in the last stage of the encoding stage after the high-resolution semantic features are up-sampled and output by the last feature fusion unit in the encoding stage in a splicing mode so as to determine layout elements in the text image. The embodiment of the invention improves the identification effect of layout analysis.

Description

Text layout analysis method, device, equipment and medium

Technical Field

The embodiment of the invention relates to an image processing technology based on deep learning, in particular to a text layout analysis method, a text layout analysis device, text layout analysis equipment and a text layout analysis medium.

Background

With the exponential rise of the production and storage requirements of a large number of electronic documents, higher requirements are put forward on automatic retrieval and layout analysis of the documents, and problems such as image recognition is limited due to poor image robustness and generalization caused by complex background images, so that the layout analysis problem is solved by more and more pixel-level semantic segmentation methods.

In the prior art, a semantic segmentation technology can perform accurate positioning at a pixel level, and a general semantic segmentation network is divided into an encoding stage and a decoding stage: in the encoding stage, down-sampling is carried out for obtaining a large-field image, so that spatial information is lost; in the decoding stage, multilayer jump layer connection is carried out for realizing high-precision and high-efficiency semantic segmentation, so that the problems of low feature fusion efficiency, low-layer low-semantic feature coverage of high-layer semantic features, slow memory access, long reasoning time and the like are caused.

Disclosure of Invention

The embodiment of the invention provides a text layout analysis method, a text layout analysis device, text layout analysis equipment and a text layout analysis medium, so that the loss of spatial information is reduced in the layout analysis process, and the aim of improving the recognition result of the layout analysis is fulfilled.

In a first aspect, an embodiment of the present invention provides a text layout analysis method, including:

acquiring a text image to be analyzed, and preprocessing the text image;

inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image;

wherein the semantic segmentation algorithm model comprises an encoding stage and a decoding stage;

the encoding stage is used for carrying out feature fusion on high semantic features of different stages in the residual error network model and high-resolution semantic features in the high-resolution network branches in an element addition mode;

and the decoder stage is used for performing feature fusion on the high-resolution semantic features extracted in the last stage of the encoding stage after the high-resolution semantic features are up-sampled and output by the last feature fusion unit in the encoding stage in a splicing mode so as to determine layout elements in the text image.

In a second aspect, an embodiment of the present invention provides a text layout analysis apparatus, including:

the preprocessing module is used for acquiring a text image to be analyzed and preprocessing the text image;

the layout analysis module is used for inputting the text image into a semantic segmentation algorithm model for layout analysis so as to determine layout elements in the text image;

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a text layout analysis method as provided by any of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text layout analysis method provided in any embodiment of the present invention.

According to the text layout analysis method, firstly, a text image to be analyzed is subjected to sexual pretreatment, then layout analysis is carried out through a semantic segmentation algorithm model, layout elements in the text image are determined, the technical effect that low-level features are kept and more high-level semantic features are blended is achieved, the problem that the low-level semantic features cover the semantic features and the high-level semantic features are gradually blurred due to a feature fusion mode of top-to-bottom layer connection in the existing layout analysis technology is solved, and the recognition effect of layout analysis is improved.

Drawings

Fig. 1 is a flowchart of a text layout analysis method according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a text layout analysis method according to an embodiment of the present invention;

fig. 3 is a flowchart of a text layout analysis method in the second embodiment of the present invention;

fig. 4 is a flowchart of a text layout analysis method according to a third embodiment of the present invention;

fig. 5 is a flowchart of a text layout analysis method in a fourth embodiment of the present invention;

fig. 6 is a schematic structural diagram of a text layout analysis apparatus in a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a text layout analysis method according to an embodiment of the present invention, where the present embodiment is applicable to various text complex background layout analysis scenarios, and the method may be implemented by a text layout analysis recognition device, and the text layout analysis recognition device may be implemented in a software and/or hardware manner, and may specifically inherit and have storage and calculation capabilities in an electronic device for performing text layout analysis.

As shown in fig. 1, a text layout analysis method is provided, which specifically includes the following steps:

step 110, acquiring a text image to be analyzed, and preprocessing the text image;

the text image may be a text image containing complex backgrounds such as multiple elements, multiple structures, multiple scales, etc., and is used as an original analysis image for extracting image features.

The preprocessing of the text image can be a training process of the semantic segmentation algorithm model, and specifically includes: preprocessing a sample image; inputting the preprocessed sample image into the semantic segmentation algorithm model for training; as shown in fig. 2, wherein the preprocessing of the text image includes at least one of: random rotation, random scaling, random clipping, random flipping, random contrast/brightness enhancement, random RGB-grayscale-RGB color space conversion, random addition of different gaussian or salt-and-pepper noise, image normalization, and gaussian bilateral filtering.

It should be noted that the connection manner shown in fig. 2 is only an example, and is not further limited to the order and combination.

Step 120, inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image;

and the encoding stage is a partial algorithm in the semantic segmentation algorithm model and is used for encoding the preprocessed text image to obtain high semantic features of different stages and high semantic features in high-resolution network branches, and feature fusion is performed in an element addition mode. In the encoding stage, high semantic features of different stages can be acquired, and the acquired high semantic features of each stage have high-resolution semantic features corresponding to the high semantic features in the high-resolution network branches.

And the high semantic features of different stages in the residual error network model are not extracted simultaneously but extracted sequentially.

The high-resolution network branches all have high-resolution semantic features corresponding to the high-resolution network branches, which are semantic features of the high-resolution network branches that need to be subjected to fusion processing, before fusion, the high-resolution network branches all have high-resolution semantic features corresponding to the high-resolution network branches, which can be high-resolution semantic features of the text image after preprocessing, and after fusion, the semantic features to be fused can be semantic features obtained by fusing at least one stage of high-resolution semantic features with the corresponding high-resolution semantic features, and have attributes of the high-resolution semantic features.

And performing feature fusion in an element addition mode, namely performing corresponding addition process of semantic features on the at least one stage high semantic feature and the corresponding high-resolution semantic feature according to the elements.

The decoding stage is a partial algorithm in the semantic segmentation algorithm model and can be an image feature subsequent processing algorithm extracted in the encoding stage, specifically, the high semantic features extracted in the last stage in the encoding stage are subjected to upsampling processing and spliced with the high-resolution semantic features output by the last feature fusion unit, and the upsampling processing is performed according to the high semantic features extracted in the last stage in the encoding stage and the spliced result of the high-resolution semantic features output by the last feature fusion unit so as to determine layout elements.

The image elements obtained after the image semantic segmentation algorithm model is performed on the layout elements can be image features output by the semantic segmentation algorithm model, and segmentation recognition results of different layout elements of the text image are obtained.

According to the text layout analysis method, firstly, a text image to be analyzed is subjected to sexual pretreatment, then layout analysis is carried out through a semantic segmentation algorithm model, layout elements in the text image are determined, the technical effect that low-layer features keep high resolution and more high-semantic features are integrated is achieved, the problem that the high-layer semantic features are gradually blurred due to the fact that the low-layer semantic features cover the high-layer semantic features due to a feature integration mode of top-down layer connection in the existing layout analysis technology is solved, and the recognition effect of layout analysis is improved.

Example two

The embodiment of the invention provides a text layout analysis method, which specifically comprises the following steps:

step 210, acquiring a text image to be analyzed, and preprocessing the text image;

step 211, performing random data enhancement operation on the text image in a deep network model training stage;

step 212 performs image normalization and gaussian bilateral filtering on the training and test text images.

Step 220, inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image;

Wherein, the encoding stage is composed of a residual error network model and a DenseASPP model, and then high semantic features of different stages are extracted through the residual error network model, and the method comprises the following steps: the residual network model Resnet-50 contains 4 network elements; each network unit is used for extracting high semantic features of corresponding stages; the extracting of the high semantic features of the corresponding stage comprises the following steps: the network units in 4 different stages respectively comprise a plurality of bottleneck residual error modules; the 1 st bottleneck residual error module in the first 3 network units performs down-sampling on the input text image characteristics to update the resolution of the input characteristics of the current network unit, wherein the subsequent bottleneck residual error module in each unit extracts high semantic characteristics, and the high semantic characteristics extracted by the current network unit are respectively input to the next network unit and the characteristic fusion unit and serve as the high semantic characteristics output by the first stage; a bottleneck residual error module in the fourth network unit adopts expansion/cavity convolution operation to expand the receptive field while keeping the fourth stage characteristic resolution; and inputting the high semantic features output by the fourth network unit into a DenseASPP model to perform multi-scale feature fusion operation so as to extract the high semantic features.

Each network unit comprises a plurality of bottleneck residual error modules, wherein the first network unit comprises 3 bottleneck residual error modules, the second network unit comprises 4 bottleneck residual error modules, the third network unit comprises 6 bottleneck residual error modules, and the fourth network unit comprises 3 bottleneck residual error modules.

As shown in fig. 3, the encoding stage in the semantic segmentation algorithm model specifically includes:

inputting the preprocessed text image features extracted in step S212 into FLN (Forward folder Network, converged Network), sequentially processing the preprocessed text image features into 3 × 3 convolutional layers, BN (Batch Normalization, regularization layer), and Relu (Rectified Linear Units, active layer) through a 3-layer Network, and respectively rate-down-sampling the features of the preprocessed text image to 1/4 size after the 3-layer Network processing. Then, the fusion network is divided into two network branches with different resolutions, the first branch is composed of a residual network model and a DenseASPP model, and the second branch is composed of 3 HFMs (feature fusion modules), wherein the high semantic features of the preprocessed text image are input into a first network unit of Resnet-50, the high semantic features output by the first stage are extracted, and the high semantic features output by the first stage are used as input and are respectively input into a second network unit and the first HFM in Resnet-50. And fusing the high semantic features of the first stage and the high-resolution semantic features corresponding to the high-resolution network branches in the first HFM, taking the fused semantic features as the high-resolution semantic features corresponding to the second stage, repeating the operation, and extracting the high semantic features output by the fourth stage and the high-resolution semantic features output by the third HFM. And inputting the high semantic features output by the fourth network unit into a DenseASPP model to perform multi-scale feature fusion operation so as to extract the high semantic features.

The DenseASPP model is composed of a basic network layer and a series of laminated convolutional layers, the DenseASPP model combines the advantages of parallel and cascade use of the void convolutional layers, and the final output characteristic image of the DenseASPP model covers not only large-range semantic information, but also the range in a very dense mode.

Optionally, after the text image is input into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image, the method further includes: and after the binarization processing is carried out on the pixels of the determined layout elements, the conditions of morphological operations such as corresponding expansion corrosion and the like, addition of minimum area, length, width, height, threshold value and the like are carried out to determine the category and the area position frame coordinate of each layout element.

EXAMPLE III

step 310, acquiring a text image to be analyzed, and preprocessing the text image;

step 320, inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image;

the semantic segmentation algorithm model comprises an encoding stage and a decoding stage;

As shown in fig. 4, the fusion process of the feature fusion module specifically includes the following steps:

after the high-resolution features of the high-resolution network branches pass through a 3 x 3 convolution layer and a BN (boron nitride) regularization layer, obtaining the high-resolution features to be fused; meanwhile, enabling high semantic features output in the first stage in a residual network model to pass through a 1 x 1 convolutional layer and a BN regularization layer, reducing the dimension of the number of channels of the high semantic features to be consistent with the number of channels of the high resolution features to be fused, and keeping the size of the high semantic features consistent with the size of the high resolution features to be fused through bilinear interpolation upsampling to obtain processed high-resolution fusion semantic features in the first stage; fusing the processed first-stage high-resolution fusion semantic features and the high-resolution semantic features to be fused in an element addition mode to obtain fusion high-resolution semantic features; after passing through the Relu activation layer, the fused high-resolution semantic features are sequentially input into a 3 x 3 convolution layer, a BN regularization layer and the Relu activation layer for processing, and then the processed high-resolution semantic features are used as high-resolution semantic features input in the second stage; the steps are circulated, high semantic features output by a fourth stage in the residual error network model and high-resolution semantic features output by a third feature fusion unit are obtained;

inputting the high semantic features output by the fourth stage into a DenseASPP model for processing, performing feature fusion with the high-resolution semantic features output by the third feature fusion unit in a splicing manner after upsampling, and determining fused output image features through a 3 x 3 convolutional layer and a BN + Relu regularization activation layer.

Optionally, the decoder stage is configured to perform feature fusion on the high semantic features extracted in the last stage of the encoding stage after upsampling the high semantic features and the high resolution features output by the last feature fusion unit in the encoding stage in a splicing manner, and further includes: and (3) performing 2 times of operations of the 3 × 3 convolutional layer + BN normalization layer + Relu activation layer on the fused output image characteristics, and performing 4 times of bilinear interpolation up-sampling operation to restore the resolution of the output characteristic image to the resolution of the text image.

Example four

step 410, acquiring a text image to be analyzed, and preprocessing the text image;

step 420, inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image;

optionally, the down-sampling may be performed in a manner that a bottleeck module in the network element in Resnet-50 is replaced, which is specifically as follows:

as shown in fig. 5, the 1 st bottleneck residual module in the first 3 network elements in Resnet-50 down samples the input text image features, including: performing channel expansion on the input text image features by adopting a 1 × 1 convolutional layer with the step length of 1, performing 2-time down-sampling operation on the input text image features after the channel expansion by adopting a maximum pooling layer or an average pooling layer, and extracting the input text image features after first processing; meanwhile, the input text image features are subjected to channel dimensionality reduction sequentially through a 1 × 1 convolutional layer with the step length of 1 and then connected with a corresponding BN normalization layer and a Relu activation layer, a 3 × 3 convolutional layer with the step length of 2 is subjected to 2-time down-sampling and then connected with the corresponding BN normalization layer and the Relu activation layer, channel expansion is performed through the 1 × 1 convolutional layer with the step length of 1 and then connected with the BN normalization layer, and the input text image features after second processing are extracted; and accessing the first processed input text image characteristic and the second processed input text image characteristic into a Relu activation layer in a residual error mode, and extracting the down-sampled characteristic.

EXAMPLE five

The embodiment of the invention provides a text layout analysis device, which comprises:

The preprocessing module is used for preprocessing the text image, and specifically comprises: performing random data enhancement operation on the text image in a deep network model training stage; and carrying out image normalization processing and Gaussian bilateral filtering processing on the training and testing text images.

The training process of the semantic segmentation algorithm model specifically comprises the following steps: preprocessing a sample image; inputting the preprocessed sample image into the semantic segmentation algorithm model for training; wherein the preprocessing of the sample image comprises a random combination of at least one of the following data enhancement methods: random rotation, random scaling, random clipping, random flipping, random contrast/brightness enhancement, random RGB-grayscale-RGB color space conversion, random addition of different gaussian or salt-and-pepper noise, image normalization, and gaussian bilateral filtering.

Wherein, the encoding stage is composed of a residual error network model and a DenseASPP model, and then high semantic features of different stages are extracted through the residual error network model, and the method comprises the following steps: the residual network model Resnet-50 contains 4 network elements; each network unit is used for extracting high semantic features of corresponding stages; the extracting of the high semantic features of the corresponding stage comprises the following steps: the network units in 4 different stages respectively comprise a plurality of bottleneck residual error modules; the 1 st bottleneck residual error module in the first 3 network units performs downsampling on the input text image characteristics to extract high semantic characteristics of different resolutions of the input characteristics of the network units, wherein the subsequent bottleneck residual error module in each unit extracts the high semantic characteristics, and the high semantic characteristics extracted by the current network unit are respectively input to the next network unit and the characteristic fusion unit and serve as the high semantic characteristics output at the first stage; a bottleneck residual error module in the fourth network unit adopts expansion/cavity convolution operation to expand the receptive field while keeping the fourth stage characteristic resolution; and inputting the high semantic features output by the fourth network unit into a DenseASPP model to perform multi-scale feature fusion operation so as to extract the high semantic features.

And finally, the characteristic resolution of the Resnet-50 network after down sampling is reduced to 1/32 of the original input image resolution.

Specifically, the encoding stage is configured to perform feature fusion on high semantic features at different stages in the residual network model and high semantic features in the high-resolution network branches in an element addition manner, and includes: after the high-resolution features of the high-resolution network branches pass through a 3 x 3 convolution layer and a BN (boron nitride) regularization layer, obtaining the high-resolution features to be fused; meanwhile, enabling high semantic features output in the first stage in a residual network model to pass through a 1 x 1 convolutional layer and a BN regularization layer, reducing the dimension of the number of channels of the high semantic features to be consistent with the number of channels of the high resolution features to be fused, and keeping the size of the high semantic features consistent with the size of the high resolution features to be fused through bilinear interpolation upsampling to obtain processed high-resolution fusion semantic features in the first stage; fusing the processed first-stage high-resolution fusion semantic features and the high-resolution semantic features to be fused in an element addition mode to obtain fusion high-resolution semantic features; after passing through the Relu activation layer, the fused high-resolution semantic features are sequentially input into a 3 x 3 convolution layer, a BN regularization layer and the Relu activation layer for processing, and then the processed high-resolution semantic features are used as high-resolution semantic features input in the second stage; the steps are circulated, high semantic features output by a fourth stage in the residual error network model and high resolution features output by a third feature fusion unit are obtained; inputting the high semantic features output by the fourth stage into a DenseASPP model for processing, performing feature fusion with the high-resolution semantic features output by the third feature fusion unit in a splicing manner after upsampling, and determining fused output image features through a 3 x 3 convolutional layer and a BN + Relu regularization activation layer.

Optionally, before performing feature fusion on the high semantic features of different stages in the residual network model and the high semantic features in the high-resolution network branch in an element addition manner, the encoding step further includes: and (3) performing 3 times of sequentially connected operations of 3 × 3 convolutional layers + BN normalization layers + Relu activation layers on the text image features, and performing down-sampling to obtain 1/4 with the feature map as the original resolution of the text image.

Wherein, the 1 st bottleneck residual error module in the first 3 network elements in Resnet-50 down samples the input text image features, including: performing channel expansion on the input text image features by adopting a 1 × 1 convolutional layer with the step length of 1, performing 2 times of down-sampling operation on the input text image features after the channel expansion by adopting a maximum pooling layer or an average pooling layer, and extracting the input text image features after first processing; meanwhile, the input text image features are subjected to channel dimensionality reduction sequentially through a 1 × 1 convolutional layer with the step size of 1 and then connected with a corresponding BN normalization layer and a Relu activation layer, a 3 × 3 convolutional layer with the step size of 2 is subjected to 2-time down-sampling and then connected with the corresponding BN normalization layer and the Relu activation layer, a 1 × 1 convolutional layer with the step size of 1 is subjected to channel expansion and then connected with the BN normalization layer, and the input text image features after second processing are extracted; and accessing the first processed input text image characteristic and the second processed input text image characteristic into a Relu activation layer in a residual error mode, and extracting the down-sampled characteristic.

EXAMPLE six

Fig. 6 is a schematic structural diagram of an apparatus/terminal/server according to embodiment 6 of the present invention, as shown in fig. 6, the apparatus/terminal/server includes a processor 610, a memory 620, an input device 630, and an output device 640; the number of the processors 610 in the device/terminal/server may be one or more, and one processor 610 is taken as an example in fig. 6; the processor 610, the memory 620, the input device 630 and the output device 640 in the apparatus/terminal/server may be connected by a bus or other means, and fig. 6 illustrates the example of connection by a bus.

The memory 620, as a computer-readable storage medium, may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to text layout analysis (e.g., preprocessing module, layout analysis module in a text layout analysis apparatus) in embodiments of the present invention. The processor 610 executes various functional applications of the device/terminal/server and data processing by executing software programs, instructions and modules stored in the memory 620, that is, implements the text layout analysis method described above.

The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 620 may further include memory located remotely from the processor 610, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 630 may be used to receive an input text image and generate key signal inputs related to user settings and function control of the device/terminal/server. The output device 640 may include a display device such as a display screen.

EXAMPLE seven

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform a text layout analysis method, and the method includes:

acquiring a text image to be analyzed, and preprocessing the text image;

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the method described above, and may also perform related operations in the text layout analysis method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the above search apparatus, each included unit and module are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A text layout analysis method is characterized by comprising the following steps:

acquiring a text image to be analyzed, and preprocessing the text image;

2. The method of claim 1, wherein preprocessing the text image comprises:

performing random data enhancement operation on the text image in a deep network model training stage;

and carrying out image normalization processing and Gaussian bilateral filtering processing on the training and testing text images.

3. The method according to claim 1, wherein the encoding stage is composed of a residual network model and a DenseASPP model, and then extracting high semantic features of different stages through the residual network model comprises:

the residual network model Resnet-50 contains 4 network elements;

each network unit is used for extracting high semantic features of corresponding stages;

the extracting of the high semantic features of the corresponding stage comprises the following steps:

the network units in 4 different stages in Resnet-50 respectively comprise a plurality of bottleneck residual error modules;

the 1 st bottleneck residual error module in the previous 3 network units of Resnet-50 can carry out down-sampling on the input text image characteristics so as to update the resolution of the input characteristics of the current network unit, wherein the subsequent bottleneck residual error module in each unit extracts high semantic characteristics, and the high semantic characteristics extracted by the current network unit are respectively input into the next network unit and the characteristic fusion unit and serve as the high semantic characteristics output at the first stage;

a bottleneck residual error module in a fourth network unit in Resnet-50 adopts expansion/hole convolution operation to expand the receptive field while keeping the fourth stage characteristic resolution;

and inputting the high semantic features output by the fourth network unit in Resnet-50 into a DenseASPP model for multi-scale feature fusion operation so as to extract the high semantic features.

4. The method of claim 3, wherein the residual network model Resnet-50 comprises 4 network elements, and the final Resnet-50 network down-samples the feature resolution to 1/32 of the original input image resolution.

5. The method according to claim 1, wherein the encoding stage is configured to perform feature fusion in an element-addition manner on high semantic features of different stages in the residual network model and high semantic features in the high-resolution network branches, and includes:

after the high-resolution features of the high-resolution network branches pass through a 3 x 3 convolution layer and a BN (boron nitride) regularization layer, obtaining the high-resolution features to be fused;

meanwhile, enabling high semantic features output in the first stage in a residual network model to pass through a 1 x 1 convolutional layer and a BN regularization layer, reducing the dimension of the number of channels of the high semantic features to be consistent with the number of channels of the high-resolution features to be fused, and keeping the size of the high semantic features consistent with the size of the high-resolution semantic features to be fused through bilinear interpolation upsampling to obtain processed high-fraction fusion semantic features in the first stage;

fusing the processed first-stage high-resolution fusion semantic features and the high-resolution semantic features to be fused in an element addition mode to obtain fusion high-resolution semantic features; after passing through the Relu activation layer, the fused high-resolution semantic features are sequentially input into a 3 x 3 convolution layer, a BN regularization layer and the Relu activation layer for processing, and then the processed high-resolution semantic features are used as high-resolution semantic features input in the second stage; the steps are circulated, high semantic features output by a fourth stage in the residual error network model and high-resolution semantic features output by a third feature fusion unit are obtained;

and inputting the high semantic features output by the fourth stage into a DenseASPP model for processing, performing feature fusion with the high-resolution semantic features output by the third feature fusion unit in a splicing manner after upsampling, and determining fused output image features through a 3 x 3 convolution layer, a BN regularization layer and a Relu activation layer.

6. The method according to claim 1, wherein the encoding stage, before performing feature fusion by adding elements of the high semantic features of different stages in the residual network model and the high semantic features in the high-resolution network branches, further comprises:

and (3) performing 3 times of sequentially connected operations of 3 × 3 convolutional layers + BN normalization layers + Relu activation layers on the text image features, and performing down-sampling to obtain 1/4 with the feature map as the original resolution of the text image.

7. The method according to claim 1, wherein the decoder stage is configured to perform feature fusion in a splicing manner on the high semantic features extracted in the last stage of the encoding stage after upsampling the high semantic features and the high resolution features output by the last feature fusion unit in the encoding stage, and further includes:

and (3) performing 2 times of operations of the 3 × 3 convolutional layer + BN normalization layer + Relu activation layer on the fused output image characteristics, and performing 4 times of bilinear interpolation up-sampling operation to restore the resolution of the output characteristic image to the resolution of the text image.

8. The method as claimed in claim 3, wherein the 1 st bottleneck residual module in the first 3 network elements of Resnet-50 down samples the input text image features, comprising:

performing channel expansion on the input text image features by adopting a 1 × 1 convolutional layer with the step length of 1, performing 2-time down-sampling operation on the input text image features after the channel expansion by adopting a maximum pooling layer or an average pooling layer, and extracting the input text image features after first processing;

meanwhile, the input text image features are subjected to channel dimensionality reduction sequentially through a 1 × 1 convolutional layer with the step length of 1 and then connected with a corresponding BN normalization layer and a Relu activation layer, a 3 × 3 convolutional layer with the step length of 2 is subjected to 2-time down-sampling and then connected with the corresponding BN normalization layer and the Relu activation layer, channel expansion is performed through the 1 × 1 convolutional layer with the step length of 1 and then connected with the BN normalization layer, and the input text image features after second processing are extracted;

and accessing the first processed input text image characteristic and the second processed input text image characteristic into a Relu activation layer in a residual error mode, and extracting the down-sampled characteristic.

9. The method according to claim 1, further comprising a training process of the semantic segmentation algorithm model, specifically comprising:

preprocessing an input text image;

inputting the preprocessed text image into the semantic segmentation algorithm model for training;

wherein the preprocessing of the input text image comprises a random combination of at least one of the following data enhancement methods:

random rotation, random scaling, random clipping, random flipping, random contrast/brightness enhancement, random RGB-grayscale-RGB color space conversion, random addition of different gaussian or salt-and-pepper noise, image normalization, and gaussian bilateral filtering.

10. The method of claim 1, wherein after inputting the text image into a semantic segmentation algorithm model for layout analysis to determine layout elements in the text image, further comprising:

and after the binarization processing is carried out on the pixels of the determined layout elements, the conditions of morphological operations such as corresponding expansion corrosion and the like, addition of minimum area, length, width, height, threshold value and the like are carried out to determine the category and the area position frame coordinate of each layout element.

11. A text layout analysis apparatus, comprising:

12. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the text layout analysis method of any of claims 1-10 when executing the program.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for text layout analysis according to any one of claims 1 to 10.