CN109816118B

CN109816118B - Method and terminal for creating structured document based on deep learning model

Info

Publication number: CN109816118B
Application number: CN201910074243.2A
Authority: CN
Inventors: 黄征; 陈凯; 周曲; 周异; 何建华
Original assignee: Xiamen Shangji Network Technology Co ltd; Shanghai Shenyao Intelligent Technology Co ltd
Current assignee: Xiamen Shangji Network Technology Co ltd; Shanghai Shenyao Intelligent Technology Co ltd
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2022-12-06
Anticipated expiration: 2039-01-25
Also published as: CN109816118A

Abstract

The invention relates to a method and a terminal for creating a structured document based on a deep learning model, and belongs to the field of data processing. The method comprises the steps of presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture; training a preset first deep learning model by using the training sample set to obtain a second deep learning model; the second deep learning model analyzes a first document picture to obtain the position information and the category information of each key field in the first document picture; and creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture. The accuracy of converting the document pictures into the structured documents is improved.

Description

Method and terminal for creating structured document based on deep learning model

Technical Field

The invention relates to a method and a terminal for creating a structured document based on a deep learning model, and belongs to the field of data processing.

Background

Document structuring is a process of extracting key field information from a large amount of text information of a document, such as a payer, a payment date, a payee and the like in a receipt, and storing the key field information according to a certain structure. After a large number of documents are processed through document structuring, services such as efficient document retrieval, document analysis and other intelligentization can be provided. The key of document structuring is also the main technical difficulty in extracting key field information from a large number of words, including determining the position of a required key field in a document and identifying the positioned words.

For some document structuring applications with large traffic and high accuracy requirements, such as invoice reimbursement, bank checkout, etc., many critical tasks in the document structuring system are performed manually. The workflow of a human-based document structuring system is shown in FIG. 1 and includes manually locating fields, manually identifying field text, and entering the identified text into corresponding fields in the archived structured document. Although the manual positioning of the fields and the manual recognition of the text have high accuracy, the document structuring system based on manual operation has many defects, such as slow manual recognition speed, high labor cost, easily affected performance by fatigue and other factors, requiring additional text input time, easily bringing additional errors in text input, and the like, and is not favorable for establishing a large-scale, efficient and economical document structuring system.

With the rapid development of information processing technology, especially deep learning technology in recent years, the character positioning and character recognition performance is greatly improved, the accuracy of character recognition in some fields is close to the level of manual recognition, and the method helps to realize landing of various scene applications. Deep learning techniques are also applied to document structuring systems. At present, a document structuring scheme using deep learning technology, whose working flow is shown in fig. 2, includes the following basic steps: determining the fixed positions of different key fields in the documents by carrying out template analysis and statistics on a large number of documents; preprocessing the document to be structured, and if the document is not a digital document, preprocessing the document to be structured and stored as a digital image; carrying out normalization alignment processing on the position of the content of the key field; intercepting an image corresponding to the field from the document to be processed according to the fixed position corresponding to different key fields; recognizing characters by utilizing a deep learning OCR technology; and automatically storing the recognized characters into corresponding fields of the structured document.

According to the existing deep learning technical scheme, a field positioning task is simplified into the process of intercepting an image corresponding to a field from a fixed position in the image, characters are recognized by utilizing a deep learning OCR technology, full automation is realized on a key task, and the calculation efficiency is greatly improved. However, this document structuring system is effective only in the case where the position of the field to be intercepted is fixed in all documents, and limits the range of use of the system. In practical application, if the invoice printing system sets different content printing positions of the key fields or the content lengths of the key fields are changed, the content information of the key fields is deviated and exceeds the setting range, and thus errors are caused. For some bill identification applications, a large number of bills are stored in a computer through scanning or mobile phone photographing and other modes, so that the displacement of the bills in an image is easily caused, different bills can have different formats, and the positions of the same field in the image are not necessarily the same. The document structuring scheme has low accuracy of converting the image into the structured document knot for the application scene which is easy to generate position offset.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: how to improve the accuracy of converting document pictures into structured documents.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention provides a method for creating a structured document based on a deep learning model, which comprises the following steps:

s1, presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture;

s2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model;

s3, analyzing a first document picture by the second deep learning model to obtain position information and category information of each key field in the first document picture;

and S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture.

Preferably, S4 specifically is:

s41, obtaining position information of a key field to obtain current position information;

s42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture;

s43, identifying characters in the key field picture to obtain text information;

s44, adding the category information of the key field and the text information to a preset structured document;

and S45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed.

Preferably, S2 is specifically:

s21, distributing a unique number for each category of information;

s22, the first deep learning model identifies a sample in the training sample set to obtain an information set; the information set comprises position information and category information;

s23, acquiring the annotation document corresponding to the sample to obtain the current annotation document;

s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are represented by the number;

s25, adjusting parameters of the first deep learning model according to the error value;

and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value, and obtaining the second deep learning model.

Preferably, the first deep learning model is used for object detection.

The invention also provides a terminal for creating a structured document based on a deep learning model, comprising one or more processors and a memory, wherein the memory stores a program and is configured to be executed by the one or more processors to execute the following steps:

Preferably, S4 specifically is:

s41, acquiring position information of a key field to obtain current position information;

Preferably, S2 is specifically:

s21, distributing a unique number for each category of information;

Preferably, the first deep learning model is used for object detection.

The invention has the following beneficial effects:

1. the invention provides a method and a terminal for creating a structured document based on a deep learning model, which are different from the prior art that a field positioning task is simplified into the process of intercepting an image corresponding to a field from a fixed position in the image. According to the document structuring method provided by the invention, the key field can be arranged at any position on the document picture, so that the category and the text content of the key field can be correctly identified and matched in an application scene that the document picture is stored in a computer or the like in a scanning or photographing mode and the position of the key field is easy to deviate in the document picture, and the accuracy of converting the document picture into the structured document is improved. Meanwhile, for document pictures with multiple layout versions but the same substantial content, the positions of key fields of all categories can be identified by using the same model, and a layout version does not need to use a set of special key field position information to match like the prior art, so that resources are saved to a great extent, and the efficiency and accuracy of converting the document pictures into structured documents are improved.

2. Furthermore, the text information corresponding to the category information of one key field is identified according to the position information of the key field, and the category information belonging to the same key field is associated with the text information and stored in the structured document, which is beneficial to providing efficient document retrieval, document analysis and other intelligent services.

3. Furthermore, because the output of the deep learning model is digital, and the class information is represented by using the digital number in the labeled document, errors in the process of converting the output result of the deep learning model into the corresponding information class are avoided, and the accuracy of comparing the difference between the recognition result of the deep learning model and the standard result is improved, so that the accuracy of recognizing the information class of the second deep learning model obtained by training the training sample set is improved.

4. Furthermore, the first deep learning model is used for target detection, so that no matter where the key fields are located in the document picture, the key fields in the document picture can be identified through the second deep learning model obtained after training of the training sample set, and further the position information of the key fields is obtained. The method is different from the method for analyzing and counting the positions of the key fields by utilizing a large number of templates in the prior art, the key fields are extracted by using fixed frames to frame the fixed positions of the documents, the document positioning performance is easily influenced by document deformation, scanning deformation, overlong key field content or line crossing and other factors, and the idea of deep learning model target detection is applied to the positioning of the key fields of the documents, so that the method has high accuracy and flexibility and a wider application range.

Drawings

FIG. 1 is a flow diagram of a method of structuring a human document;

FIG. 2 is a flow diagram of a prior art document structuring method;

FIG. 3 is a flowchart of a method for creating a structured document based on a deep learning model according to an embodiment of the present invention;

FIG. 4 is a sample training sample;

FIG. 5 is a sample character fragment picture of the Total Key field;

FIG. 6 is a block diagram of a specific embodiment of a terminal for creating a structured document based on a deep learning model according to the present invention;

description of reference numerals:

1. a processor; 2. a memory.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Referring to fig. 3 to fig. 6,

the first embodiment of the invention is as follows:

as shown in FIG. 3, the invention provides a method for creating a structured document based on a deep learning model, which comprises the following steps:

s1, presetting a training sample set; each sample in the training sample set comprises a document picture and an annotated document corresponding to the document picture; and the annotation document records the position information and the category information of each key field in the document picture.

For example, 1000 bill pictures are collected and processed to be used as samples, one part of the samples are used as training samples, and one part of the samples are used as testing samples. Each ticket includes a number of fields including key fields of interest. Each sample comprises a document picture and a document with key fields labeled. The annotation document records the position of each key field in the document picture and the category information of the key field. Document labeling can be purely manual or can be achieved by deep learning pre-labeling and then manual correction. FIG. 4 is a sample case of a generic quota invoice with the location and type of the four key fields (invoice type, invoice code, invoice number and total amount) noted. The sample used for training and testing can be supplemented continuously.

And S2, training a preset first deep learning model by using the training sample set to obtain a second deep learning model. The method specifically comprises the following steps:

s21, distributing a unique number for each category of information;

preferably, the first deep learning model is used for object detection.

For example, there are some well-established deep learning models for object detection, fast-RCNN, SSD, yolo, etc., which can be used to detect whether a given object, such as a cat, a dog, an airplane, etc., is in the image. The present embodiment adopts the existing deep learning network model for target detection as the first deep learning model to be trained, but is innovatively used to detect different key fields. Different key fields belong to different categories and the content of the same key field may vary.

The first deep learning model is used for target detection, so that the key fields in the document picture can be identified no matter where the key fields are located in the document picture by the second deep learning model obtained after training through the training sample set, and further the position information of the key fields is obtained. The method is different from the method for analyzing and counting the positions of the key fields by utilizing a large number of templates in the prior art, the key fields are extracted by using fixed frames to frame the fixed positions of the documents, the document positioning performance is easily influenced by document deformation, scanning deformation, overlong key field content or line crossing and other factors, and the idea of deep learning model target detection is applied to the positioning of the key fields of the documents, so that the method has high accuracy and flexibility and a wider application range.

because the output of the deep learning model is digital, the class information is represented by using the digital number in the labeled document, errors in the process of converting the output result of the deep learning model into the corresponding information class are avoided, the accuracy of comparing the difference between the recognition result of the deep learning model and the standard result is improved, and the accuracy of recognizing the information class of the second deep learning model obtained by training the training sample set is improved.

In this embodiment, the deep learning model structure adopts a convolutional neural network, a Long Short Term Memory (LSTM) network, and a CTC structure. The convolutional neural network has a plurality of stages (stages), each of which contains a number of convolution modules (extracting image features) and pooling layers (reducing feature map size), etc.

For example, before the training samples are input into the first deep learning model, each key field of interest is assigned a unique number. The first deep learning model detects key fields in the input training samples and outputs the position of each detected key field and the number corresponding to the key field. In the training process, the training samples are directly input into the first deep learning model and can be used as a 3-dimensional matrix to represent the training samples in a computer. For example, I _ (w 0, h0, c 0), where w0 represents the width (number of pixels) of the document picture in the input training sample, h0 represents the height of the document picture, c0 represents the color channel of the document picture, the color picture has three color channels of red, blue and green, and the grayscale picture has only one color channel. And then comparing the position information of the key fields in the labeled documents of the training samples and the category information expressed by the number numbers with the output of the first deep learning model, calculating the weighted comprehensive error of positioning and classification, reversely inputting the weighted comprehensive error into the first deep learning model according to the positioning and classification, adjusting the parameters of the deep learning network, continuing learning, testing the trained first deep learning model on a test sample set until the positioning and classification errors of the first deep learning model are reduced to a certain degree and have better positioning and classification capabilities, and stopping training to obtain a trained second deep learning model.

And S3, analyzing the first document picture by the second deep learning model to obtain the position information and the category information of each key field in the first document picture.

And S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture. The method comprises the following specific steps:

s41, obtaining the position information of a key field to obtain the current position information.

The current position information is four vertex coordinates of a minimum square which can completely contain the key field.

And S42, intercepting an image corresponding to the current position information on the first document picture to obtain a key field picture.

Wherein, a key field corresponds to a key field picture.

And S43, identifying characters in the key field picture to obtain text information.

Before S43, a third deep learning model for recognizing characters in the key field picture needs to be trained; and the third deep learning model is used for identifying characters in the key field picture to obtain text information. The method specifically comprises the following steps:

collecting a certain number of character fragment pictures (for example, 100000 pictures), and processing the pictures to be used as samples for deep learning character recognition, wherein a part of the samples are used as training samples, and a part of the samples are used as test samples. Each picture corresponds to a key field. Each character fragment sample comprises a character fragment picture and an annotation document corresponding to the character fragment picture. And recording the character content of the character fragment pictures in the label document corresponding to the character fragment pictures. The marking of the character segment samples can adopt a purely manual method or a method of adopting deep learning pre-marking and then using manual correction. Fig. 5 shows a sample of a character fragment image of a total amount key field, and the character content recorded in the markup document corresponding to the character fragment is 4500.00. The training sample can be continuously supplemented. A third depth model for character recognition is trained using a training sample set.

Before the training samples are input into the deep learning model for training, the character labels are converted into numerical labels, and each interested Chinese character, english letter, number and punctuation mark are mapped into a unique and different numerical number. The deep learning is to detect each character in the input training picture and output the number corresponding to the detected character, that is, to classify the detected character.

In the training process, character segment pictures are directly input into the deep learning network and can be represented as a 3-dimensional matrix in a computer. And the number of the training sample is used for comparing with the output of the deep learning model, calculating the recognition error and adjusting the network parameters. After passing through the convolution module of the deep learning network, the features of the training picture are extracted, and a feature map with a certain number of channels, such as F _ (w 1, h1, c 1), is output, where w1, h1, and c1 respectively represent the width, height, and number of channels of the feature map after passing through the convolution module. After passing through the multi-stage convolution module and the pooling layer, the feature map (denoted as F _ (wn, hn, cn)) output by the convolution network is fed as input into a long-term memory (LSTM) network. Feature information (including height dimension and channel dimension) is input to the LSTM network one by one for each column (corresponding to one pixel width) in the width direction of the feature map, and each column outputs the probabilities of all possible characters and one extra character (representing no character). The output of the LSTM network is processed by the CTC module, the integer code of the identified effective character is output, and the effective character obtained by the deep learning model identification is output through mapping conversion. And comparing effective characters obtained by the deep learning model recognition with the self-carried labeled documents of the training samples, calculating the error of the deep learning network recognition, reversely inputting the effective characters into the deep learning model according to the recognition error, adjusting the parameters of the deep learning model, continuing learning, and stopping training until the recognition error of the deep learning network is reduced to a certain degree and has better recognition capability, thereby obtaining a third deep learning model.

And identifying characters in the key field picture by using a traditional identification model to obtain text information.

And S44, adding the category information and the text information of the key field to a preset structured document.

The structured document of the embodiment comprises a category field and a text content field; each record in the structured document stores information about a key field in the picture of the document.

For example, converting the ticket shown in FIG. 4 into a structured document is shown in Table 1:

TABLE 1

Categories	Text content
		BillTittle	Xiamen city XX fast moving limited company quota invoice
InvoiceCode	1350214543xx
		InvoiceNo	00369040
TotalAmount	One-hundred-yuan whole

The embodiment provides a method and a terminal for creating a structured document based on a deep learning model, which are different from the prior art that a field positioning task is simplified into an image corresponding to a field intercepted from a fixed position in the image. According to the document structuring method provided by the invention, the key field can be arranged at any position on the document picture, so that the category and the text content of the key field can be correctly identified and matched in an application scene that the document picture is stored in a computer or the like in a scanning or photographing mode and the position of the key field is easy to deviate in the document picture, and the accuracy of converting the document picture into the structured document is improved. Meanwhile, for document pictures with various layout versions but the same substantial content, the positions of key fields of various types can be recognized by using the same model, and the situation that one layout version needs to be matched by using a set of special key field position information like the prior art is not needed, so that resources are greatly saved, and the efficiency and the accuracy of converting the document pictures into the structured documents are improved. Compared with the existing manual scheme and fixed position character recognition scheme, the method and the device can greatly improve the speed and accuracy of creating the structured document, reduce the cost of the structured document creation system, facilitate the increase of the scale of the structured document creation system and support more users.

The second embodiment of the invention is as follows:

as shown in fig. 6, the present invention further provides a terminal for creating a structured document based on a deep learning model, which includes one or more processors 1 and a memory 2, wherein the memory 2 stores a program and is configured to be executed by the one or more processors 1 to perform the following steps:

s1, presetting a training sample set; each sample in the training sample set comprises a document picture and a labeled document corresponding to the document picture; and the annotation document records the position information and the category information of each key field in the document picture.

For example, 1000 bill pictures are collected and processed to be used as samples, a part of the samples are used as training samples, and a part of the samples are used as testing samples. Each ticket includes a number of fields including key fields of interest. Each sample comprises a document picture and a document with key fields labeled. The annotation document records the position of each key field in the document picture and the category information of the key field. Document labeling can be purely manual or can be achieved by deep learning pre-labeling and then manual correction. FIG. 4 is a sample case of a generic quota invoice with the location and type of the four key fields (invoice type, invoice code, invoice number and total amount) noted. The sample used for training and testing can be supplemented continuously.

s21, distributing a unique number for each category of information;

preferably, the first deep learning model is used for object detection.

For example, there are some well-established deep learning models for object detection, fast-RCNN, SSD, yolo, etc., which can be used to detect whether there is a given object in an image, such as a cat, a dog, an airplane, etc. The present embodiment adopts the existing deep learning network model for target detection as the first deep learning model to be trained, but is innovatively used to detect different key fields. Different key fields belong to different categories and the content of the same key field may vary.

s24, comparing the information set with the current labeled document, and calculating to obtain an error value; the information set and the category information in the current markup document are both represented by the digital numbers;

the output of the deep learning model is digital, and the class information is represented by using a digital number in the annotation document, so that errors in the process of converting the output result of the deep learning model into the corresponding information class are avoided, the accuracy of comparing the difference between the recognition result of the deep learning model and the standard result is improved, and the accuracy of recognizing the information class of the second deep learning model obtained by training the training sample set is improved.

For example, each key field of interest is assigned a unique number before the training samples are input to the first deep learning model. The first deep learning model detects key fields in the input training sample and outputs the position of each detected key field and the number corresponding to the key field. In the training process, the training samples are directly input into the first deep learning model and can be represented as a 3-dimensional matrix in a computer. For example, I _ (w 0, h0, c 0), where w0 represents the width (number of pixels) of the document picture in the input training sample, h0 represents the height of the document picture, c0 represents the color channel of the document picture, the color picture has three color channels of red, blue and green, and the grayscale picture has only one color channel. And then comparing the position information of the key fields in the labeled documents of the training samples and the category information expressed by the number numbers with the output of the first deep learning model, calculating the weighted comprehensive error of positioning and classification, reversely inputting the weighted comprehensive error into the first deep learning model according to the positioning and classification, adjusting the parameters of the deep learning network, continuing learning, testing the trained first deep learning model on a test sample set until the positioning and classification errors of the first deep learning model are reduced to a certain degree and have better positioning and classification capabilities, and stopping training to obtain a trained second deep learning model.

And S4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture. The method specifically comprises the following steps:

Wherein, a key field corresponds to a key field picture.

Before S43, a third deep learning model for recognizing characters in a key field picture needs to be trained, where the third deep learning model is used to recognize characters in the key field picture to obtain text information. The method specifically comprises the following steps:

collecting a certain number of character fragment pictures (for example, 100000 pictures), and processing the pictures to be used as samples for deep learning character recognition, wherein a part of the samples are used as training samples, and a part of the samples are used as test samples. Each picture corresponds to a key field. Each character fragment sample comprises a character fragment picture and an annotation document corresponding to the character fragment picture. And recording the character content of the character fragment pictures in the label document corresponding to the character fragment pictures. The marking of the character segment samples can adopt a purely manual method or a method of adopting deep learning pre-marking and then using manual correction. Fig. 5 shows a sample of a character fragment image of a total amount key field, and the character content recorded in the markup document corresponding to the character fragment is 4500.00. The training sample can be continuously supplemented. The third depth model for character recognition is trained using a training sample set.

In the training process, character segment pictures are directly input into the deep learning network and can be represented as a 3-dimensional matrix in a computer. And the number of the training sample is used for comparing with the output of the deep learning model, calculating the identification error and adjusting the network parameters. After passing through the convolution module of the deep learning network, the features of the training picture are extracted, and a feature map with a certain number of channels, such as F _ (w 1, h1, c 1), is output, where w1, h1, and c1 respectively represent the width, height, and number of channels of the feature map after passing through the convolution module. After passing through the multi-stage convolution module and the pooling layer, the feature map (denoted as F _ (wn, hn, cn)) output by the convolution network is fed as input into a long-short-term memory (LSTM) network. Feature information (including height dimension and channel dimension) for each column (corresponding to one pixel width) in the width direction of the feature map is input to the LSTM network one by one, and each column outputs the probabilities of all possible characters and one additional character (representing no characters). The output of the LSTM network is processed by the CTC module, the integer code of the identified effective character is output, and the effective character obtained by the deep learning model identification is output through mapping conversion. And comparing effective characters obtained by deep learning model recognition with labeled documents carried by training samples, calculating errors of deep learning network recognition, reversely inputting the effective characters into the deep learning model according to the recognition errors, adjusting parameters of the deep learning model, continuing learning, stopping training until the recognition errors of the deep learning network are reduced to a certain degree and have better recognition capability, and obtaining a third deep learning model.

The structured document of the embodiment comprises a category field and a text content field; each record in the structured document stores information about a key field in the document picture.

For example, converting the ticket shown in FIG. 4 into a structured document is shown in Table 2:

TABLE 2

The embodiment provides a method and a terminal for creating a structured document based on a deep learning model, which are different from the prior art that a field positioning task is simplified into the process of intercepting an image corresponding to a field from a fixed position in the image. According to the document structuring method provided by the invention, the key field can be arranged at any position on the document picture, so that the category and the text content of the key field can be correctly identified and matched in an application scene that the document picture is stored in a computer or the like in a scanning or photographing mode and the position of the key field is easy to deviate in the document picture, and the accuracy of converting the document picture into the structured document is improved. Meanwhile, for document pictures with various layout versions but the same substantial content, the positions of key fields of various types can be recognized by using the same model, and the situation that one layout version needs to be matched by using a set of special key field position information like the prior art is not needed, so that resources are greatly saved, and the efficiency and the accuracy of converting the document pictures into the structured documents are improved. Compared with the existing manual scheme and fixed position character recognition scheme, the method and the device can greatly improve the speed and accuracy of creating the structured document, reduce the cost of the structured document creation system, facilitate the increase of the scale of the structured document creation system and support more users.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for creating a structured document based on a deep learning model is characterized by comprising the following steps:

s1, presetting a training sample set; collecting 1000 bill pictures, and processing the bill pictures to be used as samples; each sample in the training sample set comprises a document picture and a labeled document corresponding to the document picture; the annotation document records the position information and the category information of each key field in the document picture;

s4, creating a structured document corresponding to the first document picture according to the position information and the category information of each key field in the first document picture;

the S4 specifically comprises the following steps:

s45, repeatedly executing S41 to S44 until each key field corresponding to the first document picture is traversed;

the S2 specifically comprises the following steps:

s21, distributing a unique number for each category of information;

and S26, repeatedly executing S22 to S25 until the error value is smaller than a preset threshold value to obtain the second deep learning model, wherein the structure of the second deep learning model adopts a convolutional neural network, a long-time and short-time memory network and a CTC structure.

2. The deep learning model-based method for creating a structured document according to claim 1, wherein the first deep learning model is used for object detection.

3. A deep learning model-based terminal for creating structured documents, comprising one or more processors and memory, the memory storing a program and configured to perform the following steps by the one or more processors:

the S4 specifically comprises the following steps:

the S2 specifically comprises the following steps:

s21, distributing a unique number for each category of information;

4. The deep learning model-based terminal for creating structured documents according to claim 3, wherein the first deep learning model is used for target detection.