CN116704537B

CN116704537B - Lightweight pharmacopoeia picture and text extraction method

Info

Publication number: CN116704537B
Application number: CN202211539551.6A
Authority: CN
Inventors: 李朋; 于硕
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-11-03
Anticipated expiration: 2042-12-02
Also published as: CN116704537A

Abstract

The invention belongs to the technical field of visual document understanding, and discloses a lightweight pharmacopoeia picture and text extraction method which comprises two key steps. 1) Constructing a pharmacopoeia characteristic light focusing module: firstly, constructing a low-rank neural network layer by using a full-rank network feature main component, then designing a focusing strategy to extract key information from input features, and 2) constructing a pharmacopoeia document information identification extraction network: the 8 pharmacopoeia characteristic light focusing modules are connected in series to form a network skeleton, and a multi-stage encoder is constructed to extract pharmacopoeia data characteristic embedding; then, 8 pharmacopoeia characteristic light focusing modules are connected in series to form a network skeleton, and a multi-stage decoder is constructed to convert pharmacopoeia data information into a specific text, so that pharmacopoeia electronization is realized; and finally, measuring the difference between the pharmacopoeia data text extracted by the decoder and the original pharmacopoeia data text by using the cross entropy loss, and optimizing network parameters by minimizing the cross entropy loss.

Description

Lightweight pharmacopoeia picture and text extraction method

Technical Field

The invention belongs to the technical field of visual document understanding, and relates to a lightweight pharmacopoeia picture and text extraction method.

Background

Currently, the great variety and large scale of pharmaceutical agents result in difficult pharmacopoeia management and maintenance. Meanwhile, the biomedical industry is vigorously developed, various novel pharmaceutical preparations are rapidly increased, and the pharmacopoeia management difficulty is further increased. The pharmacopoeia electronization is realized by combining the information technology, and the pharmacopoeia management formula is hopeful. However, current pharmacopoeia electronization still has a great challenge, mainly represented by low speed, structuring degree and integration degree of data acquisition in the pharmacopoeia data acquisition process, which makes it difficult to efficiently manage and utilize effective knowledge in pharmacopoeia. Thus, there is an urgent need to develop a new method to more effectively promote the progress of pharmacopoeia electronization.

The information extraction means that character information is identified and extracted from image data of a document, is a key task for understanding a visual document, and widely exists in the data electronization process. Conventional information extraction methods generally rely on Optical Character Recognition (OCR) to scan document materials, detect scanned images, obtain text content information, and further perform character recognition and layout recovery using algorithms such as image classification or template matching. In recent years, as the popularity and complexity of acquiring data are continuously improved, the usability of the information extraction method based on the traditional OCR is gradually reduced, and related researchers develop the information extraction method based on the depth OCR. Among the current many information extraction methods based on depth OCR, a method combining a Convolutional Neural Network (CNN), a cyclic neural network (RNN), and an attention mechanism is the mainstream, and a good effect is obtained in recognition extraction of pharmacopoeia related documents.

However, most current information extraction methods that rely on OCR have inherent disadvantages. As a preprocessing method, OCR generally requires an expensive training expenditure, requires additional inference cost in a scene of pursuing high-quality information extraction output, and may further propagate internal errors to the rest of the information extraction method, affecting the method performance. In addition, the current information extraction method relies on an attention mechanism to extract data key information, so that the calculation cost is too high, and the requirement of pharmacopoeia electronization on efficient data acquisition is difficult to meet.

In summary, the invention provides a light-weight pharmacopoeia picture text extraction method, which utilizes the weight main component approximation idea to design a low-rank pharmacopoeia characteristic light-weight focusing module to realize efficient key information extraction, and then constructs an effective pharmacopoeia document information identification extraction network based on the coding and decoding idea to realize accurate and efficient electronization of pharmacopoeia data.

Disclosure of Invention

In order to solve the problems, the invention provides a lightweight pharmacopoeia picture and text extraction method, which comprises the following steps:

step 1, constructing a pharmacopoeia characteristic light focusing module

The construction of the pharmacopoeia characteristic light focusing module comprises the construction of a low-rank neural network layer and the realization of a focusing strategy;

construction of a low-rank neural network layer: according to the tensor CP decomposition principle, using a network weight main component to perform weighting calculation tasks in the neural network layer, and constructing a low-rank neural network layer;

specifically, the low rank neural network layer comprises the input variable of the layer is marked as characteristic z, the output variable is marked as characteristic z', an activation function sigma, a bias vector b and an importance factor lambda _r K weight vectors for the layerThe pharmacopoeia characteristic light focusing module multiplies the characteristic z by K weight vectors in turn, and then the characteristic z is multiplied by an importance factor lambda according to the value range of tensor rank r _r Weighted summation is carried out, then the offset vector b is overlapped, and finally, the final calculation result z' is obtained through the activation function sigma; the specific formula of the low-rank full-connection layer calculation process is as follows:

wherein W is an equivalent network weight tensor,r is the value range of R which is set for the vector outer product;

the specific formula of the calculation process of the low-rank convolution layer is as follows:

wherein ,and->Output elements and input elements of the low-rank convolution layer respectively; i.e ₁ ,i ₂ ,i ₃ The value range is the dimension of the output characteristic, which is the subscript of the element in the output characteristic; j (j) ₁ ,j ₂ ,j ₃ The value range is the dimension of the convolution kernel;

implementation of the focusing strategy: the pharmacopoeia characteristic light focusing module comprises a low-rank full-connection layer f for mapping input characteristics, L low-rank convolution layers c for extracting input characteristic multi-layer representation and L gating factors G ^l And mapping low-rank full-connection layers h and q of the modulation characteristic and the query characteristic respectively; the pharmacopoeia characteristic light focusing module is used for cascading a plurality of low-rank convolution layers and mapping input characteristics into multi-layer representation; then fusing the multi-layer representation by using a gating mechanism to obtain multi-layer integration characteristics of the input characteristics; finally, mapping the input features and the multi-layer integration features into query features and modulation features by using two low-rank full-connection layers respectively, and obtaining key information of the input features by using element-by-element multiplication of the query features and the modulation features;

specifically: given the input feature z, the pharmacopoeia feature lightweight focus module obtains an initial representation z of the input feature using low rank full connected layer mapping ⁰ =f (z); then, L low-rank convolution layer cascade mapping is utilized to obtain multi-layer representation z of input characteristics ^l ＝c(z ^l-1 ) L=1, 2, … L; then using L gating factors G ^l And multi-layer representation z ^l Corresponding element-by-element multiplication and superposition are carried out to obtain a multi-layer integration feature; finally, mapping the multi-layer integrated feature and the original input feature Z into a modulation feature and a query feature in a public feature space by using a low-rank full-connection layer respectively, and obtaining key information of the input feature, namely a focusing feature Z, by multiplying elements of the two features; the specific calculation formula of the process is as follows:

the pharmacopoeia characteristic light focusing module utilizes the low-rank full-connection layer and the low-rank convolution layer to extract the input characteristic key information, effectively reduces the quantity of model parameters under the condition of guaranteeing the input characteristic extraction effect, and improves the operation efficiency of the module;

step 2, constructing a pharmacopoeia document information identification extraction network

The pharmacopoeia document information identification and extraction network comprises an encoder and a decoder; wherein the encoder comprises five computation stages, the first stage is to input H×WConversion of the pharmacopoeia image of x 3 toA sequence of length 48; the second stage comprises 2 pharmacopoeia characteristic light focusing modules, outputting +.>Takes as input a two-dimensional sequence of (2) into +.>An output feature of length 128; the third stage comprises 2 pharmacopoeia characteristic light focusing modules, and the light focusing modules are output in the second stageIs converted into +.>Output features of length 256; the fourth stage comprises 14 pharmacopoeia characteristic light focusing modules, outputting +.>Is converted into +.>An output feature of length 512; the fifth stage comprises 2 pharmacopoeia characteristic light focusing modules, and the light focusing modules are output in the fourth stageIs converted into +.>Output features 1024 in length; the decoder comprises four calculation stages, the first stage uses the output characteristics of the encoder (++>Length 1024) as input, converted to +.>Output features 1024 in length; the second, third and fourth phases are the same as the first phase in calculation; the specific construction process is as follows:

construction of an encoder: the encoder comprises five stages connected end to end, refines the input pharmacopoeia image data to be processed stage by stage, and extracts characteristic information contained in the pharmacopoeia image data;

the first stage is a block division stage; given x as input pharmacopoeia image data to be processed, the height, width and channel number of the input pharmacopoeia image data are H, W and 3 respectively, the block dividing stage divides the input image into non-overlapped blocks with the size of 4×4×3, each block dimension is 4×4×3=48, and the block number isI.e. the input pharmacopoeia image data to be processed is converted into +.>Is a two-dimensional sequence of (2);

the second stage takes the output of the first stage as input, and sequentially passes through a focusing characteristic extraction stage, a cyclic shift stage and a focusing characteristic extraction stage, wherein the second stage comprises a low-rank full-connection layer and two pharmacopoeia characteristic lightweight focusing modules; the method comprises the following steps: firstly, mapping each block with dimension 48 to 128 dimensions by adopting a low-rank full connection layer to obtainIs a two-dimensional linear embedded sequence of (a); extracting +.f. by adopting pharmacopoeia characteristic light focusing module>Is a focus feature of (2); then circularly shifting the original block dividing boundary by half block distance along the block diagonal direction to realize the information interaction between blocks; finally extracting +.f. under new block division by adopting a second pharmacopoeia characteristic light focusing module>Embedding focusing characteristics as characteristics output in the first stage;

the third stage takes the output of the second stage as the outputThe system comprises a low-rank full-connecting layer and two pharmacopoeia characteristic light-weight focusing modules; first, adjacent 2×2 blocks in the input are spliced so that the number of blocks is determined byReduce to +.>While the block dimension is increased to 512; then reducing the size of each block to 256 by using a low-rank full connection layer; finally, the same focusing characteristic extraction-cyclic shift-focusing characteristic extraction process as the first stage is used to calculate the +.>Feature embedding;

the flow of the fourth stage, the fifth stage and the third stage is the same; the characteristic embedding output in the fifth stage is the final output of the encoder;

construction of the decoder: the decoder takes the output of the encoder as input and comprises four stages connected end to end, converts the key information extracted by the encoder into text data conforming to a specific format, and realizes the identification and extraction of the pharmacopoeia document information;

the first stage takes the output of the encoder as input and comprises two pharmacopoeia characteristic light focusing modules and two low-rank full-connection layers; firstly, mapping position information and input features into the same dimension space by using two low-rank full-connection layers to be combined, then refining the input features with the position information by using two continuous pharmacopoeia feature light focusing modules, finally amplifying the refined input feature dimensions to 4 times by using two continuous low-rank full-connection layers, recovering the original dimensions, and effectively fusing the internal information of the features by a scaling process to generate output features of a stage;

the second, third and fourth stages take the output of the previous stage as input, and the characteristic internal information is further integrated by utilizing the continuous two pharmacopoeia characteristic light focusing modules and the continuous two low-rank full-connection layers; the fourth stage output characteristic is mapped to the same dimension as the output of the encoder through a low-rank full-connection layer, namely, the text data conforming to a specific format;

step 3, calculating network model loss

Measuring the prediction loss in the extraction process of the pharmacopoeia image data characteristics, and promoting the optimization of the identification and extraction network parameters of the pharmacopoeia document information by minimizing the prediction loss; specifically, prediction loss L _ce Measuring the difference between the pharmacopoeia data text predicted and extracted by the network decoder and the original pharmacopoeia data text, and forcing the encoder and the decoder to accurately learn the pharmacopoeia image data information; the predicted loss is calculated as follows:

wherein ,y_i And (3) withThe text of the ith original pharmacopoeia data and the text of the predicted pharmacopoeia data are respectively, and N is the total number of the pharmacopoeia data.

Drawings

Fig. 1 is a flow chart of a lightweight pharmacopoeia picture text extraction method;

figure 2 is a frame diagram of a lightweight pharmacopoeia picture text extraction method.

Detailed Description

Embodiments of the present invention will be further described with reference to the accompanying drawings.

Fig. 2 is a frame diagram of a lightweight pharmacopoeia picture text extraction method. According to the invention, the original pharmacopoeia image data is firstly input into an encoder of a pharmacopoeia document information identification extraction network, and key information contained in the input pharmacopoeia image data is extracted by means of a pharmacopoeia characteristic light focusing module in the encoder. And then, the key information extracted by the encoder is converted by a decoder of the pharmacopoeia document information recognition extraction network and mapped into text data conforming to a specific format, so that the recognition extraction of the pharmacopoeia document information is realized. And finally, calculating model loss by using a predictive loss function, guiding optimization learning of all network parameters, and improving the accuracy of network extraction.

The method comprises the following steps:

step 1, constructing a pharmacopoeia characteristic light focusing module

specifically, the low rank neural network layer includes the input variable of the layer as characteristic z, the output variable as characteristic z', an activation function sigma, a bias vector b and an importance factor lambda _r And K weight vectors for the layerThe pharmacopoeia characteristic light focusing module multiplies the characteristic z by K weight vectors in turn, and then the characteristic z is multiplied by an importance factor lambda according to the value range of tensor rank r _r The weighted summation is followed by superposition of the bias vectors b, and finally the final calculation result z' is obtained via the activation function sigma. The specific formula of the low-rank full-connection layer calculation process is as follows:

wherein ,and->Output elements of low rank convolutional layers respectivelyAnd an input element. i.e ₁ ,i ₂ ,i ₃ The value range is the dimension of the output feature, which is the subscript of the element in the output feature. j (j) ₁ ,j ₂ ,j ₃ The index of the element in the convolution kernel is the index of the convolution kernel.

Implementation of the focusing strategy: the pharmacopoeia characteristic light focusing module comprises a low-rank full-connection layer f for mapping input characteristics, L low-rank convolution layers c for extracting input characteristic multi-layer representation and L gating factors G ^l And low-rank full connection layers h and q mapping modulation characteristics and query characteristics respectively. The pharmacopoeia characteristic light focusing module is used for cascading a plurality of low-rank convolution layers and mapping input characteristics into multi-layer representation; then fusing the multi-layer representation by using a gating mechanism to obtain multi-layer integration characteristics of the input characteristics; finally, mapping the input features and the multi-layer integration features into query features and modulation features by using two low-rank full-connection layers respectively, and obtaining key information of the input features by using element-by-element multiplication of the query features and the modulation features;

wherein h and q are low rank full connection layers that generate modulation signature and query signature, respectively. The pharmacopoeia characteristic light focusing module utilizes the low-rank full-connection layer and the low-rank convolution layer to extract the input characteristic key information, effectively reduces the quantity of model parameters under the condition of guaranteeing the input characteristic extraction effect, and improves the operation efficiency of the module;

The pharmacopoeia document information identification and extraction network comprises an encoder and a decoder; wherein the encoder comprises five calculation stages, the first stage converts the pharmacopoeia image input as H×W×3 intoA sequence of length 48; the second stage comprises 2 pharmacopoeia characteristic light focusing modules, outputting +.>Takes as input a two-dimensional sequence of (2) into +.>An output feature of length 128; the third stage comprises 2 pharmacopoeia characteristic light focusing modules, and the light focusing modules are output in the second stageIs converted into +.>Output features of length 256; the fourth stage comprises 14 pharmacopoeia characteristic light focusing modules, outputting +.>Is converted into +.>An output feature of length 512; the fifth stage comprises 2 pharmacopoeia characteristic light focusing modules, and the light focusing modules are output in the fourth stageIs converted into +.>With 1024 output features in length. The decoder comprises four calculation stages, the first stage uses the output characteristics of the encoder (++>Length 1024) as input, converted to +.>Output features 1024 in length; the second, third and fourth stages are the same as the first stage. The specific construction process is as follows.

the second stage takes the output of the first stage as input, and sequentially passes through a focusing characteristic extraction stage, a cyclic shift stage and a focusing characteristic extraction stage, and comprises a low-rank full-connection layer and two pharmacopoeia characteristic lightweight focusing modules. The method comprises the following steps: firstly, mapping each block with dimension 48 to 128 dimensions by adopting a low-rank full connection layer to obtainIs a two-dimensional linear embedded sequence of (a); extracting +.f. by adopting pharmacopoeia characteristic light focusing module>Is a focus feature of (2); then circularly shifting the original block dividing boundary by half a block distance along the block diagonal directionRealizing the information interaction between blocks; finally extracting +.f. under new block division by adopting a second pharmacopoeia characteristic light focusing module>Embedding focusing characteristics as characteristics output in the first stage;

and the third stage takes the output of the second stage as input and comprises a low-rank full-connecting layer and two pharmacopoeia characteristic light-weight focusing modules. First, adjacent 2×2 blocks in the input are spliced so that the number of blocks is determined byReduce to +.>While the block dimension is increased to 512; then reducing the size of each block to 256 by using a low-rank full connection layer; finally, the same focusing characteristic extraction-cyclic shift-focusing characteristic extraction process as the first stage is used to calculate the +.>Feature embedding;

the first stage takes the output of the encoder as input and comprises two pharmacopoeia characteristic light focusing modules and two low-rank full connecting layers. Firstly, mapping position information and input features into the same dimension space by using two low-rank full-connection layers to be combined, then refining the input features with the position information by using two continuous pharmacopoeia feature light focusing modules, finally amplifying the refined input feature dimensions to 4 times by using two continuous low-rank full-connection layers, recovering the original dimensions, and effectively fusing the internal information of the features by a scaling process to generate output features of a stage;

step 3, calculating network model loss

Measuring the prediction loss in the extraction process of the pharmacopoeia image data characteristics, and promoting the optimization of the identification and extraction network parameters of the pharmacopoeia document information by minimizing the prediction loss; specifically, prediction loss L _ce The difference between the pharmacopoeia data text predicted and extracted by the network decoder and the original pharmacopoeia data text is identified and extracted by (cross entropy) measuring the pharmacopoeia document information, so that the encoder and the decoder are forced to accurately learn the pharmacopoeia image data information; the predicted loss is calculated as follows:

Table 1 the network structure for identifying and extracting the information of the present invention pharmacopoeia documents

/>

In table 1, LRLinear (48, 128,20, true) is a low rank full link layer with an input dimension of 48, an output dimension of 128, and a rank of 20 with offset, representing a linear embedding mapping that maps 48-dimensional blocks to the second stage of the 128-dimensional encoder. LRConv (8, (3, 3), 1, true) is a low rank convolution layer, the number of input channels is 8, the number of output channels is 8, the convolution kernel size is 3*3, the rank is 1, the step size is 1, the zero padding layer number is 1, and the offset is provided, so that the multi-layer representation extraction process of the second stage focusing module a of the encoder is represented. The low-rank full-connection layers marked (f), (h) and (q) in the table respectively represent a low-rank full-connection layer for mapping input features, a low-rank full-connection layer for mapping modulation features and a low-rank full-connection layer for mapping query features in the pharmacopoeia feature lightweight focusing module, and the low-rank convolution layers marked (c 1) and (c 2) are 2 low-rank convolution layers for extracting input feature multi-layer representation. Each focusing module in table 1 has two consecutive low rank convolutional layers reflecting that the L value of the focusing module is set to 2. In the table, z1 is the output of the low-rank convolution layer c1, z2 is the output of the low-rank convolution layer c2, g1 and g2 are gating factors corresponding to z1 and z2 respectively, and z_sum is the multilayer integration characteristic obtained by weighting and summing z1, g1, z2 and g 2. GERU is composed of a low-rank full-connection layer-an activation function-a low-rank full-connection layer, input features are first increased to 4 times of original dimensions, then an activation function is introduced, and finally the dimensions are reduced back to the original dimensions, so that feature internal information fusion is promoted.

Verification result

In the experiments of the present invention, a general CORD unified receipt dataset was selected, and a dataset composed of partial entries was randomly selected from the first part of the chinese pharmacopoeia 2020 edition to verify the validity of the present invention, and specific information of the dataset is shown in table 2.

CORD unified receipt dataset: consists of 1000 Zhang Lading Wen Shouju images, which contain relatively complex nesting structures in addition to part of the common fields.

Chinese pharmacopoeia dataset: 1000 pictures are randomly selected from the 2020 edition of first part, wherein each document picture contains information such as prescriptions, preparation methods, characters, authentication and the like.

Table 2 specific information of dataset

Data set	Number of samples	Language (L)
			CORD	1000	Latin language
Chinese pharmacopoeia	1000	Chinese language

Evaluation criteria used in the present invention: parameter amount (members, unit: M) required when information extraction is performed, field level F1-score (F1), and accuracy based on tree edit distance (Tree Edit Distance (TED) based accuracies, ACC).

To verify the effect of the present invention, 4 general visual document understanding models were selected: the method comprises the steps of two-way coding representation transformation model (BROS) based on space information, improved universal document understanding pre-training model (LayoutLMv 2), space-dependent semi-structured document information extraction analysis model (SPASDE), and end-to-end weak supervision document generation analysis model (WYVERN) for comparison.

The comparative results of Params, F1 and ACC performances of the method provided by the invention on the CORD data set and the Chinese pharmacopoeia data set are shown in tables 3, 4 and 5.

TABLE 3 comparison of the parameters required for the methods

Method	BROS	LayoutLMv2	SPADE	WYVERN	The invention is that
						Params/M	141	190	156	170	31

Table 4 comparison of the performance of the methods on the CORD dataset

	F1/％	ACC/％
			BROS	83.7	80.3
LayoutLMv2	88.9	87.0
			SPADE	83.1	84.5
WYVERN	62.8	70.5
			The invention is that	89.6	88.5

Table 5 comparison of the performance of the methods on the chinese pharmacopoeia dataset

From tables 3, 4 and 5, it can be observed that the method of the present invention is superior to the comparative baseline method in both the standard dataset CORD and the real dataset chinese pharmacopoeia for 3 evaluation indexes Params, F1 and ACC. Specifically, on a CORD data set with a complex information structure, the ACC index obtained by the method is superior to other baseline methods, and the observation shows that the method can not only effectively extract key information in a document, but also has strong applicability to the complex information structure. On the Chinese pharmacopoeia data set, the F1 evaluation index obtained by the invention slightly leads LayoutLMv2, but on the accuracy ACC index, the invention has obvious advantages compared with other baselines, which shows that the invention has stronger extraction capability on the real pharmacopoeia data. In addition, the Chinese pharmacopoeia data is composed of Chinese with complex character sets, and good indexes on the Chinese pharmacopoeia data set reflect that the Chinese pharmacopoeia data has high-efficiency and accurate information extraction capability on complex character set documents. Meanwhile, the parameter quantity required in the calculation and reasoning process is far lower than that of other methods, and the calculation and reasoning cost is effectively reduced.

Claims

1. The light pharmacopoeia picture text extraction method is characterized by comprising the following steps of:

step 1, constructing a pharmacopoeia characteristic light focusing module

specifically, the low rank neural network layer comprises an input variable characteristic z, an output variable characteristic z', an activation function sigma, a bias vector b and an importance factor lambda _r K weight vectors for the layerThe pharmacopoeia characteristic light focusing module multiplies the input variable characteristic z by K weight vectors in turn, and then the importance factor lambda is used according to the value range of tensor rank r _r Weighted summation is carried out, then offset vector b is overlapped, and finally final calculated result output variable characteristic z' is obtained through an activation function sigma; the specific formula of the low-rank full-connection layer calculation process is as follows:

implementation of the focusing strategy: the pharmacopoeia characteristic light focusing module comprises a low-rank full-connection layer f for mapping input characteristics, L low-rank convolution layers c for extracting input characteristic multi-layer representation and L gating factors G ^l And low-rank full-connection layers h and q (z) respectively mapping modulation characteristics and query characteristics; the pharmacopoeia characteristic light focusing module is used for cascading a plurality of low-rank convolution layers and mapping input characteristics into multi-layer representation; then fusing the multi-layer representation by using a gating mechanism to obtain multi-layer integration characteristics of the input characteristics; finally, mapping the input features and the multi-layer integration features into query features and modulation features by using two low-rank full-connection layers respectively, and obtaining key information of the input features by using element-by-element multiplication of the query features and the modulation features;

specifically: given the input variable feature z, the pharmacopoeia feature lightweight focus module obtains an initial representation z of the input feature using low rank full connected layer mapping ⁰ =f (z); then, L low-rank convolution layer cascade mapping is utilized to obtain multi-layer representation z of input characteristics ^l ＝c(z ^l-1 ) L=1, 2, … L; then using L gating factors G ^l And multi-layer representation z ^l Corresponding element-by-element multiplication and superposition are carried out to obtain a multi-layer integration feature; finally, mapping the multi-layer integrated feature and the original input variable feature Z into a modulation feature and a query feature in a public feature space by using a low-rank full-connection layer respectively, and obtaining input feature key information, namely a focusing feature Z, by multiplying elements of the two features; the specific calculation formula of the process is as follows:

The pharmacopoeia document information identification and extraction network comprises an encoder and a decoder; wherein the encoder comprises five calculation stages, the first stage converts the pharmacopoeia image input as H×W×3 intoA sequence of length 48; the second stage comprises 2 pharmacopoeia characteristic light focusing modules, outputting +.>Takes as input a two-dimensional sequence of (2) into +.>An output feature of length 128; the third stage comprises 2 pharmacopoeia characteristic light focusing modules, outputting +.>Is characterized by two dimensions ofFor input, convert to +.>Output features of length 256; the fourth stage comprises 14 pharmacopoeia characteristic light focusing modules, outputting +.>Is converted into +.>An output feature of length 512; the fifth stage comprises 2 pharmacopoeia feature light focusing modules, outputting +.>Is converted into +.>Output features 1024 in length; the decoder comprises four calculation stages, the first stage uses the output characteristics of the encoder +.>Length 1024 is used as input, converted into +.>Output features 1024 in length; the second, third and fourth phases are the same as the first phase in calculation; the specific construction process is as follows:

the first stage is a block division stage; given x as input pharmacopoeia image data to be processed, the height, width and channel number of the image data are H, W and 3 respectively, and the image data are input in the block dividing stageThe incoming image is split into non-overlapping blocks of size 4×4×3, each block dimension 4×4×3=48, number of blocksI.e. the input pharmacopoeia image data to be processed is converted into +.>Is a two-dimensional sequence of (2);

the third stage takes the output of the second stage as input and comprises a low-rank full-connecting layer and two pharmacopoeia characteristic light focusing modules; first, adjacent 2×2 blocks in the input are spliced so that the number of blocks is determined byReduce to +.>While the block dimension is increased to 512; then reducing the size of each block to 256 by using a low-rank full connection layer; finally, the same focusing characteristic extraction-cyclic shift-focusing characteristic extraction process as the first stage is used to calculate the +.>Feature embedding;

step 3, calculating network model loss

Measuring the prediction loss in the extraction process of the pharmacopoeia image data characteristics, and promoting the optimization of the identification and extraction network parameters of the pharmacopoeia document information by minimizing the prediction loss; specifically, prediction loss L _ce Network decoder for identifying and extracting information of documents in measurement pharmacopoeiaPredicting the difference between the extracted pharmacopoeia data text and the original pharmacopoeia data text, forcing the encoder and the decoder to accurately learn the pharmacopoeia image data information; the predicted loss is calculated as follows: