CN112541490A

CN112541490A - Archive image information structured construction method and device based on deep learning

Info

Publication number: CN112541490A
Application number: CN202011398958.2A
Authority: CN
Inventors: 曹孟君; 曾智; 胡磊; 陈韵; 卢强; 孙颖; 邹瑶; 刘小保; 黎浩云; 才翔; 宋莎
Original assignee: Guangzhou Urban Construction Archives; Guangzhou Urban Planning Technology Development Service Co ltd
Current assignee: Guangzhou Urban Construction Archives; Guangzhou Urban Planning Technology Development Service Co ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-23

Abstract

The invention discloses a file image information structuring construction method and device based on deep learning, wherein the method comprises the following steps: step S1, acquiring an archive picture, and preprocessing the archive picture to obtain an archive picture sample; step S2, carrying out artificial text positioning and labeling on the file image samples, extracting keywords, carrying out text recognition, constructing an end-to-end deep learning model in a multi-learning mode, and training by using training samples to obtain a final file image information construction model; step S3, inputting the archive picture to be identified, constructing a model by the trained archive image information to perform character positioning, identification and keyword extraction, saving the output content into a preset format, and extracting the output content to a label library file; and step S4, extracting through a labeling library tool, and performing structured construction through library storage.

Description

Archive image information structured construction method and device based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for file image information structuring construction based on deep learning.

Background

The archive is a real record of the development of things, contains a large amount of space-time transition information (history, current situation and future planning), and meanwhile, the archive information is also one of important sources of information resources. Under the background of current big data, the informatization, digitization and networking development of all social industries are trending greatly. The archive information-based construction is an important component for integrating archive work into modern life, butting against smart cities, communicating a big data society and participating in national economy and social development, and has important practical significance for adapting to external environment changes and assisting coordinated development of various industries. Therefore, it is a challenge for the decision manager how to make the archive informatization.

Most of the existing archives are in the early stage of informatization, most of the methods are that paper archives are scanned or shot to form images, and the images are converted into files existing in an electronic form, but not into text files in the true sense, namely, a computer can only know the appearance of the archives and does not know the content information contained in the archives, and a user can only read the original appearance of a single archive through the existing directory index but cannot search and operate according to the key content information. In practical application, because the quantity of archives is huge, and is of a great variety, the non-standard of archives management leads to archives retrieval inefficiency, is difficult for seeking, can't acquire all relevant archives, the low scheduling problem of utilization ratio, consequently, get through the information isolated island between the archives, let the information go on a journey more, let the masses run the leg less, realize that the archives are convenient for people, are the demand of present epoch.

In summary, the file informatization technology in the prior art generally has the following problems:

1. most of archives are files only existing in an image form, and key information is not extracted;

2. the files only establish the directory index and do not carry out key information direction expansion on the directory index, so that the query efficiency of the files is influenced;

3. the simple full-text OCR database based on the image lacks extraction of key information, and reduces precision ratio;

4. the retrieval speed is slow, the required files are difficult to find completely at one time, and frequent and repeated operation is needed;

5. effective association is not established between the files, and all associated file information cannot be searched in the existing mode;

6. the space position information of the file is lacked, the positioning is difficult, and the searching efficiency is influenced;

7. the field information statistics based on the archives is poor in time consumption, labor consumption and accuracy, and subsequent analysis, utilization and decision making are affected.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a file image information structured construction method and device based on deep learning, which solve the problems that a large amount of digital work is performed in a manual mode in the traditional mode, the cost is high, the efficiency is low, the accuracy is low, and a single directory database or simple full-text retrieval cannot quickly search all relevant file information, realize the automatic extraction of the key information of a file, the multi-client creation, the quick and efficient query of the relevant file information, no repeated operation is needed, improve the field-based informatization utilization level of the file, and provide data support for the analysis, utilization and decision of a user.

In order to achieve the above and other objects, the present invention provides a method for constructing an archive image information structure based on deep learning, comprising the following steps:

step S1, acquiring an archive picture, and preprocessing the archive picture to obtain an archive picture sample;

step S2, carrying out artificial text positioning and labeling on the file image samples, extracting keywords, carrying out text recognition, constructing an end-to-end deep learning model in a multi-learning mode, and training by using training samples to obtain a final file image information construction model;

step S3, inputting the archive picture to be identified, constructing a model by the trained archive image information to perform character positioning, identification and keyword extraction, saving the output content into a preset format, and extracting the output content to a label library file;

and step S4, extracting through a labeling library tool, and performing structured construction through library storage.

Preferably, the preprocessing includes, but is not limited to, sharpening, graying, binarizing, image denoising, tilt correcting, and the like on the acquired archival picture.

Preferably, in step S2, the deep learning model employs an end-to-end CRNN OCR technology based on deep learning, which employs an architecture including a convolutional neural network CNN, a cyclic neural network RNN, and a joint time-series classification CTC.

Preferably, the deep learning model comprises:

the convolution network layer is used for extracting image pixel characteristics by using a CNN (convolutional neural network) model and extracting a characteristic sequence, namely an image characteristic vector, from an input image;

the circulating network layer uses an RNN model and utilizes a bidirectional LSTM to perform sequence processing and is used for learning the associated sequence information of the characteristic sequence acquired from the convolutional network layer and predicting the label distribution of the characteristic sequence;

and the sequence identification layer induces the connection characteristics between characters by using a CTC algorithm, and converts the label distribution acquired from the circulation network layer into a final identification result through operations such as de-duplication integration and the like to output.

Preferably, in step S2, the deep learning model training process includes: acquiring a training set sample, and performing data preprocessing; inputting the preprocessed data into a network model to obtain a predicted value; calculating a CTC-Loss function; and updating parameters by using an optimizer to search an optimal value, repeating the process until the epoch times are set or the loss function is not reduced any more, and finishing the training.

Preferably, in step S3, the archive picture to be recognized is modeled by the archive image information to generate a plurality of text positioning areas, and then the archive image content is automatically recognized by the OCR recognition technology and the key information of the archive image content is extracted to be saved in the SVG format.

Preferably, in step S4, extracting and warehousing by using a labeling library tool to perform structured construction, and building project blocks, unit blocks and space blocks of the archive in association with external key information, and building association by integrating the project, unit and space blocks in the city construction field and breaking through information barriers between the archives.

Preferably, in step S4, after SVG labeling is performed on the archive and a labeling library file is created, partial information is extracted from the organization code certificate and the archive catalog database in the electronic certificate, a unit full block, a project full block database and a spatial database of the archive are created, then quality inspection is performed, and result catalog addition and result submission are performed when quality inspection passes, otherwise, full block data is combed again until quality inspection passes.

Preferably, in step S4, the file number is used as the key element for file association in the project object, and the elements of the project block, the block volume and the file relation directed graph are calculated and generated according to the key element; taking the unit name as a key element associated with the file in the unit object, and calculating and generating the elements of the unit whole file and the unit whole file according to the unit name and the external key information; taking coordinates, upward integrated standard diagram numbers and administrative regions as file association key elements in the space object, and calculating and generating elements of the space full file and the full file according to the key elements; and associating the project, unit and space all with the file and the business key information to realize the quick connection of the related project and unit all information and the spatial presentation.

In order to achieve the above object, the present invention further provides an archive image information structuring construction device based on deep learning, including:

the file and picture preprocessing unit is used for acquiring a file and picture and preprocessing the file and picture;

the model construction training unit is used for carrying out artificial text positioning and labeling on the file image samples, extracting key words, carrying out text recognition, constructing an end-to-end deep learning model in a multi-learning mode, and then training by using the training samples to obtain a final file image information construction model;

the file image processing unit to be identified is used for inputting the file image to be identified, establishing a model through trained file image information to perform character positioning, identification and keyword extraction, storing output contents into a preset format and extracting the output contents to a labeling library file;

and the structural construction unit is used for carrying out extraction and warehousing storage for carrying out structural construction through a labeling library tool.

Compared with the prior art, the file image information structuralization construction method and the file image information structuralization construction device based on the deep learning establish a neural network model based on the deep learning through a large amount of training (Character positioning-Character Recognition-keyword extraction), realize end-to-end automatic Recognition output, upgrade the digitalization of the traditional file, ensure that a scanning piece is not only a picture but also OCR (Optical Character Recognition), digitalize and spatialize, solve the technical problems that the digitalization work is carried out by a large amount of manual methods in the traditional method, the cost is high, the efficiency is low, the accuracy is low, and all related file information can not be quickly searched by a single directory database or simple full-text retrieval, realize the automatic extraction of the key information of the file, the construction of a plurality of objects (items, units and spaces), the related file information can be quickly and efficiently inquired, repeated operation is not needed, the informatization utilization level of the files based on fields (such as the urban construction field) is improved, and data support is provided for analysis, utilization and decision making of users.

Drawings

FIG. 1 is a flowchart illustrating the steps of a method for constructing a file image information structure based on deep learning according to the present invention;

FIG. 2 is a system architecture diagram of an apparatus for structured construction of archival image information based on deep learning according to the present invention;

FIG. 3 is a flow chart of the file image information structuring construction according to the embodiment of the present invention;

FIG. 4 is a flow chart of whole data combing according to an embodiment of the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

Fig. 1 is a flowchart illustrating steps of a method for constructing an archive image information structure based on deep learning according to the present invention. The invention relates to a file image information structuring construction method based on deep learning, which comprises the following steps:

step S1, acquiring an archive picture, and preprocessing the archive picture. That is, an image acquisition device, such as a camera, is used to acquire an image of the archive, and the acquired archive image is preprocessed, such as sharpening, graying, binarization, image denoising, tilt correction, etc., to obtain an archive picture sample.

Specifically, step S1 further includes:

step S100, carrying out binarization on the archive picture: specifically, a threshold method is adopted, the difference between a target and a background in an image is utilized, the image is respectively set to be at two different levels, and a proper threshold is selected to determine whether a certain pixel is the target or the background, so that a binary image is obtained, and preparation is made for subsequent character recognition.

Step S101, denoising the binarized image: denoising based on wavelet threshold of wavelet domain, selecting threshold collapsing method, setting coefficient smaller than threshold to zero and keeping wavelet coefficient larger than threshold by setting proper threshold; then obtaining an estimation coefficient through threshold function mapping; and finally, carrying out inverse transformation on the estimation coefficient to realize denoising and reconstruction.

Step S102, correcting the inclination of the denoised file image: for different texts, the mode of the straight line direction can be obtained through Hough transformation, or the rotation angle can be obtained through drawing a minimum circumscribed rectangle or based on Houhg transformation and the like, and inclination correction is carried out.

And step S2, performing artificial text positioning and labeling on the archive picture samples, extracting keywords, performing text recognition, constructing an end-to-end deep learning model in a multi-learning mode, and training by using the training samples to obtain a final archive image information construction model.

In the present invention, an end-to-end CRNN (Convolutional Neural Network, Convolutional Recurrent Neural Network) OCR (Optical Character Recognition) technique based on deep learning is adopted, and a structure of CNN (Convolutional Neural Network) + RNN (Recurrent Neural Network) + CTC (connected time series classification) is adopted, which inputs a W32 ^ 1 normalized grayscale image, extracts a Feature Map (Feature Map) based on the CNN layer, then cuts the Feature Map by columns (each column contains 512-dimensional features) to extract a Feature vector (Map-to-Sequence), inputs the Feature vector to a BiLSTM (bidirectional LSTM) layer for fusion and extraction, and then realizes approximate soft alignment of input and output of an indefinite-length Sequence by calculating a CTC-Loss function to obtain a text Sequence, specifically including:

a convolution network layer, which extracts image pixel characteristics and a characteristic sequence, namely an image characteristic vector, from an input image by using a CNN model;

the system comprises a convolutional network layer, a cyclic network layer and a convolutional network layer, wherein the cyclic network layer extracts image time sequence characteristics by using an RNN (radio network) model, specifically, bidirectional LSTM (least significant bit) is used for carrying out sequence processing, and associated sequence information of a characteristic sequence acquired from the convolutional network layer is learned and label distribution of the characteristic sequence is predicted;

and the sequence identification layer (or the transcription layer) induces the connection characteristics between characters by using a CTC algorithm, and converts the label distribution acquired from the circulation network layer into a final identification result through processing such as de-duplication integration and the like to output.

For example, now an image is input, and in order to input its features to the loop network layer, the processing steps are as follows:

first, the image is scaled to a size of W × 32 × 1 (W represents an arbitrary width);

secondly, the image becomes a (W/4) multiplied by 1 multiplied by 512 characteristic vector after passing through a convolution network layer CNN;

then, at the loop network layer RNN, T ═ W/4 and D ═ 512 are set, and after LSTM, the vector becomes T × nclass. Wherein, RNN adopts two-layer two-way LSTM structure, and every layer contains 256 hidden nodes respectively, and T is the time step, nclass is the number of classification. And then under the guidance of the softmax function, obtaining probability distribution, and returning the class with the maximum probability value as a predicted value.

And finally, the obtained predicted value is subjected to redundancy elimination and other operations on the sequence identification layer through a blank mechanism to obtain a complete identification result, and the complete identification result is output.

In the invention, the model training process is as follows: acquiring a training set sample, preprocessing data (such as converting into a tensor form), inputting the data into a CRNN network model, obtaining a predicted value- > calculating CTC-Loss function- > updating parameters of an optimizer (such as Adadelta) to find an optimal value, repeating the process until the epoch times are set or the Loss function is not reduced any more, finishing training, wherein one epoch refers to the process that all data are sent into the network to finish forward calculation and backward propagation once, and the epoch times are iteration times.

And step S3, inputting the archive picture to be identified, constructing a model by the trained archive image information to perform character positioning, identification and keyword extraction, storing the output content into an SVG format, and extracting the output content into a label library file.

After a file image information construction model is trained, obtaining a file picture to be identified, inputting the file picture into the file image information construction model, processing the file picture by the model to generate a plurality of text positioning areas, then automatically identifying file image content by an OCR (Optical Character Recognition) identification technology, extracting key information of the file image content and storing the key information in an SVG format, so that the unstructured image and the semi-structured information of the file can be integrally stored and used with high quality. Generally, structured information refers to data information that can be digitized, can be conveniently managed through computer and database technologies, information that cannot be completely digitized is called unstructured information, such as document files, pictures, drawing data, etc., and from a data perspective, structured data refers to relational model data, i.e., data managed in the form of relational database tables, semi-structured data refers to data that is not relational models and has a substantially fixed structural schema, such as log files, XML documents, etc., unstructured data refers to data without fixed patterns, such as pictures, videos, etc. in various formats, such as WORD, PDF, PPT, EXL, etc., therefore, the unstructured image in the present invention refers to an archive picture or document that is not digitized, and the semi-structured information refers to archive information that is partially digitized, is not put in storage, and does not form an association relationship.

And step S4, extracting the marked library file obtained in the step S3 through a marked library tool (a tool for extracting the marked information in the SVG into a database), performing structured construction on warehouse storage, establishing a project block, a unit block and a space block of the archive in combination with external key information, integrating three key elements (namely, a project, a unit and a space) in the urban construction field, getting through information barriers among the archives, and establishing association, so that the retrieval and utilization efficiency of the archive is improved.

The key information comprises but is not limited to an organization code certificate, an archive catalogue database and the like in the electronic certificate, namely after SVG marking is carried out on the archive through the steps and a marked library file is established, partial information is extracted from the organization code certificate and the archive catalogue database in the electronic certificate, a unit whole, a project whole database and a spatial database of the archive are established, then quality inspection is carried out, achievement catalogue adding and achievement submitting are carried out when the quality inspection is passed, and otherwise, whole data is combed again until the quality inspection is passed.

In the embodiment of the invention, a file is integrated to construct a file object overall block according to three objects of projects, units and spaces in the field of urban construction and is associated and utilized, and the method specifically comprises the steps of taking a document number as a key element for file association in the project object and calculating and generating the elements of the project overall block, the overall volume and a file relation directed graph according to the key element; taking the unit name as a key element associated with the file in the unit object, and calculating and generating the elements of the unit whole file and the unit whole file according to the unit name and the external key information (such as a mechanism code certificate); taking coordinates, upward integrated standard diagram numbers and administrative regions as file association key elements in the space object, and calculating and generating elements of the space full file and the full file according to the key elements; the method is characterized in that the project, unit and space three-object full parcel is associated with the business key information (such as file number, character number and the like) through files, so that the related project and unit full parcel information is quickly linked and presented in a spatial manner.

Fig. 2 is a system architecture diagram of an apparatus for structuring and constructing archival image information based on deep learning according to the present invention. The invention relates to a file image information structuring construction device based on deep learning, which comprises:

the archive picture preprocessing unit 201 is configured to obtain an archive picture and preprocess the archive picture. That is, an image acquisition device, such as a camera, is used to acquire an image of the archive, and the acquired archive image is preprocessed, such as sharpening, graying, binarization, image denoising, tilt correction, etc., to obtain an archive picture sample.

The archival picture preprocessing unit 201 is specifically configured to:

and (3) carrying out binarization on the archive picture: specifically, a threshold method is adopted, the difference between a target and a background in an image is utilized, the image is respectively set to be at two different levels, and a proper threshold is selected to determine whether a certain pixel is the target or the background, so that a binary image is obtained, and preparation is made for subsequent character recognition.

Denoising the binarized image: denoising based on wavelet threshold of wavelet domain, selecting threshold collapsing method, setting coefficient smaller than threshold to zero and keeping wavelet coefficient larger than threshold by setting proper threshold; then obtaining an estimation coefficient through threshold function mapping; and finally, carrying out inverse transformation on the estimation coefficient to realize denoising and reconstruction.

Correcting the inclination of the denoised file image: for different texts, the mode of the straight line direction can be obtained through Hough transformation, or the rotation angle can be obtained through drawing a minimum circumscribed rectangle or based on Houhg transformation and the like, and inclination correction is carried out.

The model construction training unit 202 is configured to perform artificial text positioning and labeling on the archive image samples, extract keywords, perform text recognition, construct an end-to-end deep learning model in a multi-learning manner, and perform training by using the training samples to obtain a final archive image information construction model.

In the present invention, an end-to-end CRNN (Convolutional Neural Network, Convolutional Recurrent Neural Network) OCR technology (Optical Character Recognition) based on deep learning is adopted, which adopts a structure of CNN (Convolutional Neural Network) + RNN (Recurrent Neural Network) + CTC (connected time series classification), inputs a W × 32 ^ 1 normalized grayscale image, extracts a Feature Map (Feature Map) based on the CNN layer, then cuts the Feature Map by columns (each column contains 512-dimensional features) to extract a Feature vector (Map-to-Sequence), inputs the Feature vector to a BiLSTM (bidirectional LSTM) layer for fusion and extraction, and then realizes approximate soft alignment of input and output of an indefinite-length Sequence by calculating a CTC-Loss function to obtain a text Sequence, which specifically includes:

In the invention, the model training process is as follows: acquiring a training set sample, preprocessing data (such as converting into a tensor form), inputting the data into a CRNN network model, obtaining a predicted value- > calculating CTC-Loss function- > updating parameters of an optimizer (such as Adadelta) to find an optimal value, repeating the process until the epoch times are set or the Loss function is not reduced any more, and finishing training.

And the file and picture processing unit 203 to be identified is used for inputting the file and picture to be identified, establishing a model by the trained file image information to perform character positioning, identification and keyword extraction, storing the output content into an SVG format, and extracting the output content to a labeling library file.

After a file image information construction model is trained, obtaining a file picture to be identified, inputting the file picture into the file image information construction model, processing the file picture by the model to generate a plurality of text positioning areas, then automatically identifying file image content by an OCR (Optical Character Recognition) identification technology, extracting key information of the file image content and storing the key information in an SVG format, so that the unstructured image and the semi-structured information of the file can be integrally stored and used with high quality.

The structured construction unit 204 is configured to perform structured construction by extracting and warehousing storage through a labeling library tool, jointly establish a project block, a unit block and a space block of the archive with external key information, integrate three key elements (namely, a project object, a unit object and a space object) in the urban construction field, break through information barriers between the archives, and establish association, thereby improving the efficiency of retrieval and utilization of the archives.

Examples

Fig. 3 is a flow chart of the file image information structuring construction according to the embodiment of the present invention, as shown in fig. 3, in the embodiment, the flow chart of the file image information structuring construction is as follows:

s1 preprocessing of file image

Picture binarization: the image is set to two different levels by using the difference between the target and the background in the image by adopting a threshold method, and a proper threshold is selected to determine whether a certain pixel is the target or the background, so that a binary image is obtained and preparation is made for the identification of subsequent characters.

Noise removal: denoising based on wavelet threshold of wavelet domain, selecting threshold collapsing method, setting coefficient smaller than threshold to zero and keeping wavelet coefficient larger than threshold by setting proper threshold; then obtaining an estimation coefficient through threshold function mapping; and finally, carrying out inverse transformation on the estimation coefficient to realize denoising and reconstruction.

And (3) correcting the inclination: for different texts, the mode of the straight line direction can be obtained through Hough transformation, or the rotation angle can be obtained through drawing a minimum circumscribed rectangle or based on Houhg transformation and the like, and inclination correction is carried out.

S2 construction of neural network model

The invention constructs a neural network model by training samples, the invention adopts an end-to-end CRNN OCR technology based on deep learning, and the CRNN model in the technology adopts a structure of CNN (convolutional neural network) + RNN (cyclic neural network) + CTC (connection time sequence classification), which is specifically as follows:

firstly, a convolution network layer extracts image pixel characteristics by using a CNN model, and extracts a characteristic sequence, namely an image characteristic vector, from an input image;

then, a circulating network layer extracts image time sequence characteristics by using an RNN model, specifically, bidirectional LSTM is used for sequence processing, and the label (true value) distribution of a characteristic sequence obtained from the convolutional layer is predicted;

and finally, a sequence identification layer (or a transcription layer) induces the connection characteristics between characters by using a CTC algorithm, and converts the label distribution acquired from the circulation layer into a final identification result through processing such as de-duplication integration and the like to output.

S3 input-output of model

Inputting a picture to be identified, carrying out character positioning, identification and keyword extraction through the established CRNN model, storing output contents into an SVG format, and extracting the output contents into a labeling library file.

S4 creation and result submission of file object block

After SVG labeling is carried out on the archive and a labeling library file is established, partial information is extracted from an organization code certificate electronic certificate and an archive catalogue database, a unit full block, a project full block database and a spatial database of the archive are established, quality inspection is carried out subsequently, achievement catalogue addition and achievement submission are carried out when quality inspection passes, and otherwise, full block data carding is carried out again until quality inspection passes.

FIG. 4 is a flow chart of whole data combing according to an embodiment of the present invention. In the embodiment of the present invention, the three objects in the project, unit and space are described as follows:

in the unit block: 1. the unit names are all the unit names in the label library; 2. when the 'unit name' field is the same as the 'unit name' field in the electronic license information table SQL results, the tool acquires the 'code' field content in the electronic license information table SQL results to the field; ID is automatic number, and the whole number is equal to ID; 4. file hierarchy, tool head interface selectable function, the value range of this field is: the project level, the file level and the electronic file level are assigned in a unified way after being selected (the condition is single selection); 5. the file number, the tool obtains the file number of the unit name from the label; 6. when the 'marking' field in the marking base table TABL _ BIAOZHU _ INFO is consistent with the unit name in the unit whole volume, acquiring the 'Svg _ name' field content in the table TABL _ PIC _ INFO to the field according to the 'Map _ ID' associated 'Pic _ ID' field; 7. sending a text number, and obtaining a table through file number association: the content of the All _ basic _ info 'message number' field; 8. the storage unit obtains a table through file number association: the All _ basic _ info "branch name" field contents.

(II) in the project block:

1. the root file in the project block is the construction land planning license number or the file number corresponding to the construction project planning license number in the block, the block description defaults to the construction land planning license number or the construction project planning license number of the block, and the ID is the block ID. Project full volume: acquiring a file number, a label library and a file record catalogue information table; obtaining the file name in the volume from a label library; the storage unit, the file record directory information table.

2. Taking the 'relevant document number' of the designated whole root case as a starting point, finding out the record of the 'relevant document number' in the labeling library, obtaining the field contents of the 'document number' and the 'document name in the file', and finding out the originating document number corresponding to the 'document number' in the file directory information table through document number association. All record "direction" fields resulting from this process are labeled "1".

3. Taking the 'associated file number' of the appointed root file as a starting point, acquiring other file numbers of the file from the labeling library as 'starting numbers', acquiring 'file names in the file', associating the 'starting numbers' with the file directory information table to find corresponding 'file numbers', and labeling 'direction' fields of all records obtained in the process as '-1'.

4. And taking the first record of the 'starting number' of the project full-page volume of the designated full page as a 'related file number', taking the corresponding 'file number' as a 'related file number', repeating the circulation of 'direction' 1 'and' direction '1', and repeating the circulation until no specific search result of 'direction' 1 'and' direction '1' in the label library is returned.

5. Repeat "2-4" to construct the next block.

(III) space wholesale: and establishing a space block according to graph vectorization and address spatialization in the file image, enabling the file to be displayed, positioned and inquired in a space position, extracting a topographic icon reference picture number and an administrative region name according to the space gland condition of the file vectorization graph, and recording corresponding fields of the space block volume.

To sum up, the file image information structuralization construction method and device based on deep learning of the present invention establishes a neural network model based on deep learning through a large number of training (Character positioning-Character Recognition-keyword extraction) to realize end-to-end automatic Recognition output, which not only upgrades the digitization of the traditional file to make the scanned part not only be a picture but also be the digitization and the spatialization of OCR (Optical Character Recognition), but also solves the technical problems that the digitization work is largely carried out by manual mode in the traditional mode, the cost is high, the efficiency is low, the accuracy is low, and the single directory database or simple full text retrieval can not quickly search all the related file information, realizes the automatic extraction of the key information of the file, the construction of multiple objects (items, units and spaces), the quick and efficient query of the related file information, repeated operation is not needed, the informatization utilization level of the files based on fields (such as the urban construction field) is improved, and data support is provided for analysis, utilization and decision making of users.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A file image information structuring construction method based on deep learning comprises the following steps:

2. The method as claimed in claim 1, wherein the method comprises the following steps: the preprocessing includes but is not limited to sharpening, graying, binarization, image denoising, tilt correction and the like on the acquired archive image.

3. The method as claimed in claim 1, wherein the method comprises the following steps: in step S2, the deep learning model employs an end-to-end CRNN OCR technology based on deep learning, which employs an architecture including a convolutional neural network CNN, a cyclic neural network RNN, and a joint timing classification CTC.

4. The method as claimed in claim 3, wherein the deep learning model comprises:

5. The method as claimed in claim 4, wherein in step S2, the deep learning model training process comprises: acquiring a training set sample, and performing data preprocessing; inputting the preprocessed data into a network model to obtain a predicted value; calculating a CTC-Loss function; and updating parameters by using an optimizer to search an optimal value, repeating the process until the epoch times are set or the loss function is not reduced any more, and finishing the training.

6. The method as claimed in claim 5, wherein the method comprises the following steps: in step S3, the archive image to be recognized is modeled by the archive image information to generate a plurality of text positioning areas, and then the archive image content is automatically recognized by the OCR recognition technology and the key information of the archive image content is extracted and stored in the SVG format.

7. The method as claimed in claim 6, wherein the method comprises the following steps: in step S4, extracting and warehousing the file by using a label library tool to construct a structure, and establishing a project block, a unit block and a space block of the file in association with external key information, and establishing a relationship by integrating the project, unit and space blocks in the city construction field and breaking through information barriers between the files.

8. The method as claimed in claim 7, wherein the method comprises the following steps: in step S4, after SVG labeling and creating a labeling library file for the archive, extracting partial information from the organization code certificate and the archive catalog database in the electronic certificate, creating a unit wholesale, project wholesale database and spatial database for the archive, then performing quality inspection, adding a result catalog and submitting the result when the quality inspection passes, or combing the wholesale data again until the quality inspection passes.

9. The method as claimed in claim 8, wherein the method comprises the following steps: in step S4, the file number is used as the key element for file association in the project object, and the elements of the project full-parcel, full-parcel volume and file relation directed graph are calculated and generated according to the key element; taking the unit name as a key element associated with the file in the unit object, and calculating and generating the elements of the unit whole file and the unit whole file according to the unit name and the external key information; taking coordinates, upward integrated standard diagram numbers and administrative regions as file association key elements in the space object, and calculating and generating elements of the space full file and the full file according to the key elements; and associating the project, unit and space all with the file and the business key information to realize the quick connection of the related project and unit all information and the spatial presentation.

10. An archives image information structurization construction equipment based on deep learning includes: