CN117593755B

CN117593755B - Method and system for recognizing gold text image based on skeleton model pre-training

Info

Publication number: CN117593755B
Application number: CN202410071885.8A
Authority: CN
Inventors: 李春桃; 徐昊; 韩育浩; 曹伟; 刁晓蕾; 史大千; 石立达; 张骞; 戚睿华
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-04-02
Anticipated expiration: 2044-01-18
Also published as: CN117593755A

Abstract

The invention discloses a method and a system for recognizing a gold image based on skeleton model pre-training, wherein the method comprises the following steps: collecting a gold text image, and preprocessing the gold text image to obtain a preprocessed image to be recognized; pre-training the skeleton model by using the unlabeled ancient text image to obtain a trained skeleton model; and inputting the preprocessing image to be recognized into the trained skeleton model, the word information reasoner and the word recognizer to obtain a recognition result of the word to be recognized. The invention solves the problems of poor data quality, unbalanced category and the like in the gold data set, and realizes gold image recognition.

Description

Method and system for recognizing gold text image based on skeleton model pre-training

Technical Field

The invention relates to the technical field of digital humane and image processing, in particular to a method and a system for recognizing a golden text image based on skeleton model pre-training.

Background

The gold text is the inscription cast or inscribed on the bronze ware by ancient Chinese people in Qin period, and reflects the conditions of the current society in various aspects such as economy, culture, style, politics, custom and the like. The study of the gold text has very important historic, cultural, academic and artistic values. In recent years, artificial intelligence has been a hot-rolled mat for various fields, but research for the gold field is extremely scarce. On one hand, the design of the artificial intelligence-based gold identification method can provide a more accurate and reliable basis for gold research and interpretation, and is convenient for the research of gold scholars and ancient scholars; on the other hand, computer technology based methods can better preserve and propagate the wisdom of the archaeologist. Therefore, there is a need for an artificial intelligence-based golden text image recognition algorithm.

However, in practical tasks, there are many difficulties in designing an artificial intelligence-based gold recognition algorithm. Although recognition of character text by deep learning models has met with great success. However, the variety of characters in the golden text image is large, and the total data amount of the golden text image is not large, which brings great difficulty to the learning of the deep learning model. In addition, the sample distribution among the text types is unbalanced, and the quality of the obtained data set is poor, so that the deep learning model is difficult to learn important features in a few types.

Meanwhile, the difficulty in acquiring data is also a great difficulty for researchers at present. Because the underground burying time is longer, the noise of the gold image is larger, the image quality is poor, and a great challenge is caused to the gold identification task, the existing image noise reduction method can only remove simple noise such as salt and pepper noise, and the complex noise in the gold image is difficult to process, and the noise reduction method needs to be designed for the gold image.

In summary, the deficiencies and drawbacks of the prior art include: 1. the data volume of the golden text image is limited; 2. the data quality of the golden text image is poor, and the noise is serious; 3. the gold image has strong long tail effect.

Disclosure of Invention

The invention provides a method and a system for recognizing a gold image based on skeleton model pre-training, wherein the method comprises the following steps:

s1, acquiring a gold text image, and preprocessing the gold text image to obtain a preprocessed image to be identified;

s2, pre-training the skeleton model by using a label-free ancient text image dataset to obtain a trained skeleton model;

and S3, inputting the preprocessed image to be recognized into the trained skeleton model, the word information reasoner and the word recognizer to obtain a recognition result of the word to be recognized.

Optionally, in the step S1, the preprocessing mainly includes noise reduction, component splitting and expansion of a gold data set on the gold text image.

Optionally, the noise reduction method specifically includes:

extracting shallow layer characteristics of the gold text image by using a 3X 3 convolution layer with a leakage linear rectifying unit, and recovering detail parts of the gold text image to obtain an initial characteristic diagram;

and using a character enhancement model with the method for reducing the noise of the image of the golden text rubbing to reduce the noise of the depth features of the initial feature map.

Optionally, the method for denoising the image with the gold rubbing specifically comprises the following steps:

Splicing the initial feature map in the dimension of the channel after passing through a maximum pooling layer and an average pooling layer to obtain a spliced feature map;

performing multi-scale convolution on the spliced feature images to obtain a scale convolution feature image;

and performing scale adjustment on the scale convolution feature images, then splicing, and performing pooling layer on the spliced scale convolution feature images to complete learning of the font features in the golden text images so as to realize noise reduction treatment on the golden text images.

Optionally, in the step S2, the training of the skeleton model by using the unlabeled paleo-text image dataset includes:

carrying out data enhancement on the unlabeled palace text image data set to obtain an enhanced unlabeled palace text image data set;

and performing skeleton model pre-training by using the enhanced label-free ancient text image dataset to obtain a trained skeleton model.

Optionally, the method for acquiring the enhanced label-free ancient text image dataset specifically includes:

will have no label ancient text image dataPerforming component identification in a gold character component library, and replacing the corresponding component to realize data enhancement if the corresponding component is identified in the component library >And->If the inquiry is not found, the random data enhancement is used to enhance the enhanced +.>And->：

Wherein,representing a YOLO based component detector, < >>Respectively representing the component in the image +.>Upper left, lower left, upper right and lower right coordinate positions of the middle. />Is an image blending operation.

Optionally, the method for pre-training the skeleton model by using the enhanced label-free ancient text image dataset to obtain a trained skeleton model comprises the following steps of;

will have no label ancient text imageEnhanced label-free ancient text image data set obtained after data enhancement>And；

assembling said unlabeled paleo-text image datasetAnd->Through skeleton model->Obtaining the corresponding characterization vector->Andthe formula is:

is the characterization vector after average pooling, cosine similarity measures two image vectors, for +.>Andis defined as:

wherein,is an indication function, +.>Is the temperature parameter->Representative is L2 normalization.

Optionally, the method for inputting the pre-processed image to be identified into the trained skeleton model, the word information reasoner and the word identifier to obtain the identification result of the word to be identified specifically includes:

inputting the preprocessing image to be identified into the trained skeleton model to extract character depth characteristics of the gold text image;

Inputting the character depth features into the text information reasoner to generate a candidate radical set and a candidate font structure set of the golden text image;

and inputting the candidate radical set and the candidate font structure set into the character recognizer to recognize the golden text image.

Optionally, the method for inputting the candidate radical set and the candidate font structure set into the character recognizer to recognize the golden text image specifically includes:

constructing the candidate radical set and the candidate font structure set by using a query list selector, wherein the construction method is to select a group of radical combinations and a font structure to form a list to be searchedObtaining a set I of M x N elements;

the list to be searchedTransmitting to a search strategy selector for analysis piece by piece to obtain a candidate list setThe candidate list set +.>Returning to the query list selector for reordering;

the recognition result memory sequentially selects elements from the sorted candidate query list to query in the knowledge graph to obtain a knowledge graph retrieval recognition result, and the retrieval recognition result is stored to realize the recognition of the gold text image.

The invention also discloses a system for recognizing the golden text image based on the pre-training of the skeleton model, which comprises: the system comprises a data preprocessing module, a skeleton network pre-training module and an image recognition module;

the data preprocessing module is used for collecting the gold text image, preprocessing the gold text image and obtaining a preprocessed image to be recognized;

the skeleton model pre-training model is used for pre-training the skeleton model by using the label-free ancient text image dataset to obtain a trained skeleton model;

the image recognition module is used for inputting the preprocessed image to be recognized into the trained skeleton model, the character information reasoner and the character recognizer to obtain a recognition result of the character to be recognized.

Compared with the prior art, the invention has the beneficial effects that:

the invention designs a noise reduction model aiming at the noise of the gold character image, learns the font information of the gold character by introducing the attention deep learning frame of the font information, is used for extracting the characteristics and the inherent font of the character, and improves the reconstruction function of the image. The invention saves the manual labeling cost under the condition of ensuring that the high-quality data set is obtained.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments are briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a method step diagram of a method and system for recognizing a golden text image based on a pre-training of a skeleton model according to an embodiment of the present invention;

FIG. 2 is a comparison of the images of the present invention before and after noise reduction;

FIG. 3 is a flow chart of a character enhancement model according to an embodiment of the present invention;

fig. 4 is a flowchart of a paleo-text rubbing image noise reduction module according to an embodiment of the present invention;

FIG. 5 is a component labeling flow chart of an embodiment of the invention;

FIG. 6 is a data set expansion flow chart of an embodiment of the present invention;

FIG. 7 is a diagram of characters versus components for an embodiment of the present invention;

FIG. 8 is a flowchart of a skeletal model pre-training process in accordance with an embodiment of the present invention;

FIG. 9 is a workflow diagram of a text depth feature extractor according to an embodiment of the present invention; FIG. a is a workflow diagram of a text depth feature extraction network according to an embodiment of the present invention; FIG. b is a workflow diagram of a dual attention layer for an embodiment of the present invention;

FIG. 10 is a flowchart of the text recognizer according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

A method for recognizing a golden text image based on pre-training of a skeleton model, as shown in fig. 1, the method comprising:

in S1, the preprocessing mainly comprises the steps of noise reduction, component splitting and gold data set expansion of the gold text image.

The noise reduction method specifically comprises the following steps:

the noise reduction method specifically comprises the following steps: extracting shallow layer characteristics by using a 3×3 convolution layer with a leaky linear rectifying unit, and recovering detailed parts of a character image; the character enhancement model with the gold rubbing image noise reduction module is used for carrying out noise reduction on depth characteristics of the character image; the character enhancement model adopts a U-Net structured encoder-decoder, wherein the encoder part and the decoder part comprise a plurality of golden document rubbing image noise reduction modules. The U-Net structure can effectively capture local and global information of an input feature map, so that the performance of a model is improved. At each stage of the encoder and decoder, the feature map is processed accordingly to extract a richer feature representation.

Splicing the feature images after passing through the maximum pooling layer and the average pooling layer to obtain spliced feature images; convolving the spliced feature images through a multi-scale convolution layer, splicing the convolved data after scale adjustment, and passing the spliced feature images through a pooling layer; and the noise reduction processing of the gold text image is realized. The denoised image is shown in figure 2.

As shown in fig. 4, the ancient rubbing image noise reduction module splices the input feature graphs together on the latitude of the channel after passing through the maximum pooling layer and the average pooling layer, where the maximum pooling layer selects the maximum value of each region as output, so as to help extract main features, and the average pooling layer calculates the average value of each region as output, so that the overall features can be extracted. The results of the two pooling layers are used for splicing, so that the model can be helped to better extract local features and global features of the gold character. The spliced feature images are respectively convolved through multi-scale convolution layers. The convolution check feature graphs of 1x1, 3x3, 5x5 and 7x7 are adopted to carry out convolution respectively, and convolution kernels with different scales can capture various scale features in an input image. Larger convolution kernels can capture a wider range of global features, while smaller convolution kernels are good at capturing local features and details. By combining convolutions of different scales, the model can learn a richer representation of the features. The data after the multi-scale convolution are spliced together after the scale adjustment, and the learning of the font characteristics in the golden character image is completed through the pooling layer again.

The formula is as follows:

wherein MaxPool () and AvgPool () are max-pooling and average-pooling, respectively. Concat () is a concatenation operation, y is a characteristic map after the characteristic map x is subjected to maximum pooling and average pooling, conv1x1, conv3x3, conv5x5, conv7x7 are the results of a convolution kernel where y passes through 1x1, 3x3, 5x5, 7x7, respectively. The results after the multi-scale convolution are spliced together to obtain a feature map z, and the z obtains a Final output Final through Pool () pooling operation.

In the forward propagation process, the input features first pass through the paleo-text rubbing image noise reduction module of the encoder section and then pass to the decoder section. At the encoder stage, the feature map may be scaled down by a downsampling operation to capture a more abstract representation of the features. At the decoder stage, the feature map is scaled up by an upsampling operation to fuse with the corresponding feature map from the encoder section.

The formula of the fusion operation:

wherein the method comprises the steps ofRepresenting the depth feature map of the paleo-text image of the i-th paleo-text rubbing image noise reduction module, and ONB () is the noise reduction operation of the paleo-text rubbing image noise reduction module. At->When the feature map is positioned at the encoder part, the feature is extracted by the paleo-text rubbing image noise reduction module, and the feature map is positioned at +. >When the feature map is in the decoder part, the fusion operation is realized through jump connection, and the up-sampling feature map of the decoder part is added with the down-sampling feature map of the corresponding encoder part, so that more accurate image reconstruction is realized.

In the decoder part, the feature map is sequentially processed by a plurality of paleo-text rubbing image noise reduction modules. The purpose of transferring the feature map to the next paleo-text rubbing image denoising module is to further fuse the advanced feature information from the encoder portion while recovering richer spatial details using the upsampled feature map of the decoder portion. The operation is helpful to gradually restore the spatial resolution of the image while maintaining the feature map information, and finally, the extraction of the depth features of the gold characters and the learning of the character font structure are realized.

After noise reduction is achieved, the golden text parts are split. Three classes of labels, component class labels, component coordinate labels and structure class labels are defined. The component coordinate labeling is to complete a labeling frame containing components, and the structural categories comprise twelve structural categories, namely a single structure, an up-down structure, a left-right structure, an up-middle-lower structure, a left-middle-right structure, a surrounding structure, an up-down surrounding structure, a left-right surrounding structure, a triangular structure, a left-part up-down structure, a right-part up-down structure and a lower-part left-right structure. The character of the gold may be divided into a single-part character including one part and a multi-part character including two or more parts. Wherein single component characters may also evolve into components.

When labeling the parts, the part number k of the character is first obtained from the character decomposition dictionary using a part number acquisition function. The following two labeling schemes can be classified according to the number of parts k. (1) For a single component character with a component number of one, a character class label is set as the component class label, and a structure class label is set as a single structure. Component coordinates identifying a component using the YOLO target detection model positioning the target detection model identifies individual golden components in the golden sample image and returns (x _min ,y _min )、(x _max ,y _max ) Class results for the part, (x) _min ,y _min )、(x _max ,y _max ) And the coordinate points respectively representing the upper left corner and the lower right corner of the boundary box are marked with rectangles, so that the automatic marking of the single-component characters is realized. (2) for multi-part characters having a number of parts greater than one. As shown in fig. 5, the information labeling method is adopted to label the information of the components. Specifically, the category and the position information of each part in the character image are marked by two experts, if the marking information of the two experts is consistent, the two experts are invited to check the part marking if the marking information of the two experts is inconsistent, and the final marking information is determined. After the marking of the component category and the component position in the character image is completed, marking the structural category of the character by two experts, and if the judgment made by the two experts is inconsistent, marking by one senior expert The structure class is remarked to determine the final result.

Through the labeling scheme, the manual labeling cost is saved under the condition of ensuring that a high-quality data set is obtained.

And expanding the gold characters by adopting a splicing and synthesizing strategy according to the font structure of the gold and the components in the character decomposition dictionary. As shown in FIG. 6, the diversity of training data is increased by synthesizing new golden character images, thereby improving the performance of the model on golden character and radical recognition tasks. For eleven structural types except a single structure, a set of detailed procedures is designed to realize the splicing and synthesis of the gold characters, so that the expansion of a gold data set is realized.

As shown in fig. 7, first, various structure types other than the individual structure are analyzed using a character structure template function, and the number of parts required for each structure template and the relative position information between the parts are acquired. The structural templates are helpful for knowing the composition rule of the golden characters and provide guidance for subsequent splicing and synthesis.

Then, according to the number of the components in the corresponding structural templates, a corresponding number of component lists are randomly acquired from the component dictionary. To increase the diversity of the synthesized gold character samples, the list of parts is randomly manipulated, such as enlarged, reduced, rotated, warped, etc., using an enhanced processing function. The parts are scaled up or scaled down, which is helpful for simulating the golden character images with different sizes; the direction of the component can be changed by rotating the component, so that the character characteristics of the model for learning at different angles can be improved; the distortion of the component image can simulate the deformation of the golden character caused by factors such as historical environment. The operations can simulate the golden character images with different sizes, angles and deformations, and are helpful for improving the generalization capability of the golden recognition model. After the random selection and enhancement processing of the components is completed, the position and the size of each component in the synthesized character image are determined according to the relative position information provided by the character structure template function. The step ensures that the structure of the synthesized character image accords with the character form rule of the gold, and is convenient for the model to learn the composition characteristics of the characters.

And finally, utilizing the relative position information of the enhanced part list and the structural template to realize the final synthesized character image through a character generating function. These synthetic character images are added to the synthetic character image set, thereby enabling expansion of the golden text image dataset. Compared with the original gold character image data set, the expanded data set has the advantages that the categories of images are reduced, the total sample number is increased, and the data are more concentrated. The maximum sample number and the minimum sample number are obviously increased, so that the problem of long tail effect in the training task of the golden image can be alleviated.

s2, pre-training a skeleton model by using a label-free ancient text image dataset, wherein the specific method for obtaining the trained skeleton model comprises the following steps of:

Because the sample size of the golden data set is smaller, the pre-training of the skeleton model is performed before the training of the classification model, so that the classified skeleton model can extract more compact golden image characterization. Unlike fine-tuning training directly on the golden data set, the pre-training of the skeletal model in the present invention is performed on large scale unlabeled data. In a specific implementation, the present invention uses a large number of unlabeled text image datasets to pretrain. The data set includes various text data such as oracle, gold, and war.

As shown in fig. 3, replacing the corresponding parts of the gold text parts in the part library to realize data enhancement, thereby obtaining a data enhancement image; and (3) pre-training the skeleton model by using the label-free ancient text image dataset with the enhanced data to obtain a trained skeleton model. Corresponding the gold text parts in the part libraryThe method for realizing data enhancement by replacing the components and obtaining the data enhanced image comprises the following steps of; due to the particularity of the golden text image, the invention provides a data enhancement method based on the component. For imagesFirst, component identification (component detection algorithm based on YOLO, consistent with the above) is performed, if the existing components in the component library exist in the text, the corresponding components are replaced to realize data enhancement, and enhanced +_is obtained>And->. For ancient text image without parts, enhanced ++enhanced using random data enhancement>And. The formula is as follows:

wherein,representing a YOLO based component detector, < >>Respectively representing the component in the image +.>Upper left, lower left, upper right and lower right coordinate positions of the middle. />Is an image mixing operation, the inputs of which are two, the first input is the image +. >Four point coordinates of the component in (2), the second input is the component library +.>In the same class of part images as the part. The purpose of this function is to be taken from the component library +.>The same kind but different kinds of parts in the image are replaced to the original ancient character image.Is a random function, with the aim of being +.>Is selected randomly from the current component class but differently shaped. />Is an index function whose purpose is to be in the parts library +.>All the same class of components as the current component are indexed.

Because of the complexity of the ancient text image, some ancient text is not composed of a plurality of parts, but is composed of a single body or a special structure, and is not in a part libraryIs a kind of medium. In this case, the present invention uses a method of random data enhancement。/>The method is composed of 5 image enhancement methods, including center clipping, random horizontal overturn, color disturbance, random graying and Gaussian filtering. In a specific implementation, different occurrence probabilities are used for random enhancement.

As shown in fig. 8, the pre-training method is specifically as follows:

is the vector after mean pooling, +.>Is the dimension after pooling. We define the same original image +.>Two data enhancement samples obtained by data enhancement +. >And->For positive sample pairs, other in the same batchThe data enhancement is considered as a negative sample. The present invention uses cosine similarity to measure two image vectors. A pair of examples of alignment samples->And->Is defined as:

is an indication function, +.>Is a temperature parameter. />Representative is L2 normalization.

Inputting the preprocessing image to be recognized into a trained skeleton model, a word information reasoner and a word recognizer to obtain a recognition result of the word to be recognized, wherein the method specifically comprises the following steps of:

Character depth feature extractor

The character image to be recognized is input into a text depth feature extractor (CIDFE, character Image Deep Feature Extractor) network. The ciffe network is composed of a plurality of feature extraction sublayers (ciffe Blocks).

As shown in fig. 9, a set of text depth feature extraction blocks are designed to extract depth features from an input image in the CIDFE. Wherein each CIDFE Block is composed of multiple dual attention layers (DAL, dual Attention Layer) and a Batch normalization layer (BN, batch Norm). As shown in fig. 9 b, DAL is a common module in the field of computer vision, and is mainly applied to solve the problem of boundary division before radicals in the present invention. In the invention, residual connection (namely, the output of the upper layer is added with the output of the present layer and is used as the input of the next layer) is used between DALs of multiple layers in each CIDFE Block, so that the long-distance characteristic capturing capability is further enhanced, a single Block can be made deeper, and a network with stronger characteristic extracting capability is constructed. Since radicals and structures in the characters are closely related, the characters all contain rich semantic information. The CIDFE network formed by stacking the CIDFE blocks can effectively learn the character radicals and the character style structural features contained in the characters, so that deep global feature representations can be effectively extracted, and the deep global feature representations can be supplied to a downstream character reasoning module to infer possible character radicals and structural information.

Word information inference device

Character depth features extracted by the CIDFE network are input into a text information reasoning module. The text information reasoning module comprises two main parts, namely a radical reasoner and a text structure reasoner, wherein the inputs of the radical reasoner and the text structure reasoner are deep global features extracted by the CIDFE network. The text information reasoning module finally can be used for carrying out ++according to the deep global feature>To generate possible radicals and font structures for the input literal picture.

The radical reasoner is a component consisting of four convolution layers and a component having an input dimension ofA model composed of all the connection layers. Wherein (1)>Represents the number of grids into which the input character image is divided, and K represents the number of anchor boxes in each grid. The model is used for deducing radical information in character images, including radical category and position coordinates +.>And confidence of radical detection. Training of the CIDFE (character image depth feature extractor) is constrained by predicting radical class and position.

The font structure reasoner is used for predicting the font structure of the character image. It uses the shallow features (i.e. the first output feature of the CIDFE block) and the deepLayer characteristics(features extracted by the CIDFE, including radical location information) to capture global and local structural information. The reasoner is composed of five convolution layers and a full connection layer and is used for further processing the connected shallow layer and deep layer characteristics to generate candidate font structures of characters to be recognized. The font structure reasoner constrains the training of the CIDFE by predicting the font structure.

Character identifier

In a second step, a set of candidate component parts and font structures is obtained by means of a component part reasoner and a font structure reasoner. The set includes combinations of components, font structures, and confidence in the model output for these information. The character component information and character structure information about the characters are then input as character recognition modules to complete recognition of the character images. The core of the character recognition module is a recognition control unit, and the recognition control unit and the character knowledge graph cooperate to complete the recognition function of the character image. The identification control unit comprises three core components, namely a query list selector, a retrieval strategy selector and an identification result memory.

Query list selector and retrieval policy selector: after the pictures pass through the text information reasoning module, M groups of radical combinations and N font structures are respectively obtained. The query list selector constructs the M groups of radicals and N font structures by selecting a group of radicals and a font structure to form a list to be retrievedFinally obtain possession->Collections of individual elements->. Then query list selector to search list set +. >Delivering to a retrieval policy selector for analysis. The search policy selector analyzes the combinations sent by the query list selector piece by piece. The analysis work is mainly done by the aggregate information analysis unit in the retrieval policy selector. As shown in fig. 10, the specific analysis procedure is as follows:

in the first stage, the aggregate information analysis unit will first search each element in the list aggregateConfidence is calculated. />Consists of a radical list and fonts, and is entirely +.>Confidence of->The calculation of (1) is also divided into component list confidence +.>Confidence with font structure->Two parts. Radical list confidence->Wherein n is the number of radicals in the list, +.>Confidence for the radical at the i-th position in the list (output from the radical reasoner in the second step). Font structure confidence->Wherein->Input from the font structure reasoner in the second stepAnd (5) outputting. Obtain->And->After (I)>。/>For adjustable parameters, the character radical information is generally considered to be more important, and therefore +.>Typically take values greater than 0.5.

Second stage, aggregate information analysis unit, analyze each elementIs->. The analysis rules are as follows: given threshold +.>If something is- >Is +.>Is->Above a given threshold, the search strategy selector selects a full match search strategy for this candidate search list. If something is->Is +.>Is->Below a given thresholdWhen a threshold is given->If something is->Is +.>Is->Above a given threshold, this candidate list is considered +.>The search strategy selector selects a fuzzy search strategy with radical priority for the candidate search list. When something is->Some element of->Is->Below a given threshold +.>Element->Is->Is also below a given threshold +.>When it is, then consider this strip->Is not trusted, the retrieval policy selector selects a low priority retrieval policy for this candidate retrieval list. />

Third stage, after the aggregate information analysis unit distributes strategy for each candidate list, the whole candidate list is aggregatedAnd returning to the query list selector. At this point the operation of the retrieval policy selector ends.

Each query list in the new candidate query list set includes newly added confidence information and retrieval strategy information in addition to basic component combination information and font structure information. The query list selector reorders the candidate query list based on this information. The ordering rules are as follows: for candidate query lists with complete matching search strategies, each candidate query list is used for And (5) performing descending order arrangement. For candidate query lists having fuzzy search strategies with radical priority, the +.>And (5) performing descending order arrangement. For candidate query lists having a low-priority search strategy, the +.>And (5) performing descending order arrangement.

At this time, the query list selector sequentially selects elements from the sorted candidate query list set to query in the knowledge graph. Confidence of candidate list if query is successfulAnd the confidence of the query result is determined. And carrying out query retrieval on the candidate query list with the complete matching retrieval strategy strictly according to the radical combination and the font structure. Marking the result as +.f if the knowledge graph search is successful>The (result confidence, result value, result flag value) triples are added to the recognition result memory. For a candidate query list with a fuzzy search strategy with radical priority, neglecting structural information during knowledge graph query, directly querying the first K candidate words from the knowledge graph according to radical combination, and marking all the candidate words asAnd added to the recognition result memory. For a candidate query list with low-priority search strategies, only if the first two strategies are searched and the number of the elements in the identification result memory is still not up to the specified number, searching is carried out (a specific search method of the low-priority search strategies also adopts a strictly complete matching method to query in a knowledge graph), otherwise, the search is ignored. If the knowledge graph search fails, the next query search is performed. And the identification result is put into a memory and fed back to a user, so that the identification of the gold text is realized. The whole ancient character recognition process of the present invention ends.

Example two

A system for recognition of a golden text image based on pre-training of a skeletal model, the system comprising: the system comprises a data preprocessing module, a skeleton network pre-training module and an image recognition module;

The formula is as follows:

The formula of the fusion operation:

After noise reduction is achieved, the golden text parts are split. Three types of labels, component class labels, component coordinate labels and structure class labels are defined. The component coordinate labeling is to complete a labeling frame containing components, and the structural categories comprise twelve structural categories, namely a single structure, an up-down structure, a left-right structure, an up-middle-lower structure, a left-middle-right structure, a surrounding structure, an up-down surrounding structure, a left-right surrounding structure, a triangular structure, a left-part up-down structure, a right-part up-down structure and a lower-part left-right structure. The character of the gold may be divided into a single-part character including one part and a multi-part character including two or more parts. Wherein single component characters may also evolve into components.

When labeling the parts, the part number k of the character is first obtained from the character decomposition dictionary using a part number acquisition function. The following two labeling schemes can be classified according to the number of parts k. (1) For a single component character with a component number of one, a character class label is set as the component class label, and a structure class label is set as a single structure. Component coordinates identifying a component using the YOLO target detection model positioning the target detection model identifies individual golden components in the golden sample image and returns (x _min ,y _min )、(x _max ,y _max ) Class results for the part, (x) _min ,y _min )、(x _max ,y _max ) And the coordinate points respectively representing the upper left corner and the lower right corner of the boundary box are marked with rectangles, so that the automatic marking of the single-component characters is realized. (2) for multi-part characters having a number of parts greater than one. As shown in fig. 5, the information labeling method is adopted to label the information of the components. Specifically, the category and the position information of each part in the character image are marked by two experts, if the marking information of the two experts is consistent, the two experts are marked by the marking, and if the marking information of the two experts is inconsistent, the one-bit senior special is invited And (5) checking the component labels and determining final label information. After the labeling of the component categories and the component positions in the character image is completed, labeling the structural categories of the characters by two experts, and if the judgment made by the two experts is inconsistent, re-labeling the structural categories by one senior expert to determine the final result.

And expanding the gold characters by adopting a splicing and synthesizing strategy according to the font structure of the gold and the components in the character decomposition dictionary. By synthesizing new golden character images, the diversity of training data is increased, so that the performance of the model on golden character and radical recognition tasks is improved. For eleven structural types except a single structure, a set of detailed procedures is designed to realize the splicing and synthesis of the gold characters, so that the expansion of a gold data set is realized.

replacing the corresponding parts of the gold text parts in the part library to realize data enhancement, so as to obtain a data enhancement image; and (3) pre-training the skeleton model by using the label-free ancient text image dataset with the enhanced data to obtain a trained skeleton model. The method for realizing data enhancement by replacing the corresponding parts of the gold text parts in the part library comprises the following steps of; due to the particularity of the golden text image, the invention provides a data enhancement method based on the component. For imagesFirst, component identification (component detection algorithm based on YOLO, consistent with the above) is performed, if the existing components in the component library exist in the text, the corresponding components are replaced to realize data enhancement, and enhanced +_is obtained>And. For not to doAncient character image with parts, enhanced by random data enhancement>And->. The formula is as follows:

wherein,representing a YOLO based component detector, < >>Respectively representing the component in the image +. >Upper left, lower left, upper right and lower right coordinate positions of the middle. />Is an image mixing operation, the inputs of which are two, the first input is the image +.>Four point coordinates of the component in (2), the second input is the component library +.>In the same class of part images as the part. The purpose of this function is to be taken from the component library +.>The same kind but different kinds of parts in the image are replaced to the original ancient character image.Is a random function, with the aim of being +.>Is selected randomly from the current component class but differently shaped. />Is an index function whose purpose is to be in the parts library +.>All the same class of components as the current component are indexed.

The pre-training method is specifically as follows:

is the vector after mean pooling, +. >Is the dimension after pooling. We define the same original image +.>Two data enhancement samples obtained by data enhancement +.>And->For positive sample pairs, other in the same batchThe data enhancement is considered as a negative sample. The present invention uses cosine similarity to measure two image vectors. A pair of examples of alignment samples->And->Is defined as:

The image recognition module further includes: a character depth feature extraction sub-module, a character information reasoning sub-module and a character recognition sub-module;

the character depth feature extraction submodule is used for inputting the preprocessing image to be identified into the trained skeleton model to extract character depth features of the gold character image;

the character information reasoning sub-module is used for inputting the character depth characteristic into the character information reasoning device to generate a candidate radical set and a candidate font structure set of the golden character image;

The character recognition sub-module is used for inputting the candidate radical set and the candidate font structure set into the character recognizer to realize the recognition of the golden text image.

Word depth feature extraction submodule

Word information reasoning sub-module

Character depth features extracted by the CIDFE network are input into a text information reasoning module. The word information reasoning module comprises two main parts, namely a radical reasoner and a word structure reasoner, wherein the input of the radical reasoner and the word structure reasoner is CDeep global features extracted by IDFE network. The text information reasoning module finally can be used for carrying out ++according to the deep global feature>To generate possible radicals and font structures for the input literal picture.

The font structure reasoner is used for predicting the font structure of the character image. It uses shallow features (i.e. the first output features of the CIDFE block) and deep features(features extracted by the CIDFE, including radical location information) to capture global and local structural information. The reasoner is composed of five convolution layers and a full connection layer and is used for further processing the connected shallow layer and deep layer characteristics to generate candidate font structures of characters to be recognized. The font structure reasoner constrains the training of the CIDFE by predicting the font structure.

Character recognition sub-module

Query list selector and retrieval policy selector: after the pictures pass through the text information reasoning module, M groups of radical combinations and N font structures are respectively obtained. The query list selector constructs the M groups of radicals and N font structures by selecting a group of radicals and a font structure to form a list to be retrievedFinally obtain possession->Collections of individual elements->. Then query list selector to search list set +. >Delivering to a retrieval policy selector for analysis. The search policy selector analyzes the combinations sent by the query list selector piece by piece. The analysis work is mainly done by the aggregate information analysis unit in the retrieval policy selector. The specific analysis process is as follows:

in the first stage, the aggregate information analysis unit will first search each element in the list aggregateConfidence is calculated. />Consists of a radical list and fonts, and is entirely +.>Confidence of->The calculation of (1) is also divided into component list confidence +.>Confidence with font structure->Two parts. Radical list confidence->Wherein n is the number of radicals in the list, +.>Confidence for the radical at the i-th position in the list (output from the radical reasoner in the second step). Font structure confidence->Wherein->Output from the font structure reasoner in the second step. Obtain->And->After (I)>。/>For adjustable parameters, the character radical information is generally considered to be more important, and therefore +.>Typically take values greater than 0.5. />

Second stage, aggregate information analysis unit, analyze each elementIs->、/>、/>. The analysis rules are as follows: given threshold +.>If something is->Is +. >Is->Above a given threshold, the search strategy selector selects a full match search strategy for this candidate search list. If something is->Is +.>Is->Below a given thresholdWhen a threshold is given->If something is->Is +.>Is->Above a given threshold, this candidate list is considered +.>The search strategy selector selects a fuzzy search strategy with radical priority for the candidate search list. When something is->Some element of->Is->Below a given threshold +.>Elements ofIs->Is also below a given threshold +.>When it is, then consider this strip->Is not trusted, the retrieval policy selector selects a low priority retrieval policy for this candidate retrieval list.

At this time, the query list selector sequentially selects elements from the sorted candidate query list set to query in the knowledge graph. Confidence of candidate list if query is successfulAnd the confidence of the query result is determined. And carrying out query retrieval on the candidate query list with the complete matching retrieval strategy strictly according to the radical combination and the font structure. Marking the result as +.f if the knowledge graph search is successful>The (result confidence, result value, result flag value) triples are added to the recognition result memory. For a candidate query list with a fuzzy search strategy with radical priority, neglecting structural information during knowledge graph query, and directly querying the first K candidates from the knowledge graph according to radical combinationWords, and marked in their entirety asAnd added to the recognition result memory. For a candidate query list with low-priority search strategies, only if the first two strategies are searched and the number of the elements in the identification result memory is still not up to the specified number, searching is carried out (a specific search method of the low-priority search strategies also adopts a strictly complete matching method to query in a knowledge graph), otherwise, the search is ignored. If the knowledge graph search fails, the next query search is performed. And the identification result is put into a memory and fed back to a user, so that the identification of the gold text is realized. The whole ancient character recognition process of the present invention ends.

The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims

1. A method for recognizing a golden text image based on a pre-training of a skeleton model, the method comprising:

s3, inputting the preprocessed image to be recognized into the trained skeleton model, the word information reasoner and the word recognizer to obtain a recognition result of the word to be recognized;

in the step S2, the method for pre-training the skeleton model by using the label-free ancient text image dataset to obtain the trained skeleton model comprises the following specific steps:

Performing skeleton model pre-training by using the enhanced label-free ancient text image dataset to obtain a trained skeleton model;

the method for acquiring the enhanced label-free ancient text image data set specifically comprises the following steps:

will have no label ancient text image dataPerforming component identification in a gold character component library, and replacing the corresponding component to realize data enhancement if the corresponding component is identified in the component library>And->If the inquiry is not found, the random data enhancement is used to enhance the enhanced +.>And->：

Wherein (1)>Representing a YOLO based component detector, < >>Representing an index function>Representing a library of parts; />Respectively representing the component in the image +.>Coordinate positions of upper left, lower left, upper right and lower right of the middle; />Is an image blending operation.

2. The method for recognizing a golden text image based on the pre-training of a skeleton model according to claim 1, wherein in S1, the preprocessing mainly comprises the steps of noise reduction, component splitting and expansion of a golden text data set for the golden text image.

3. The method for recognizing a golden text image based on the pre-training of a skeleton model according to claim 2, wherein the method for reducing noise specifically comprises the following steps:

and using a character enhancement model with the method for reducing the noise of the image of the gold rubbing to reduce the noise of the depth features of the initial feature map.

4. The method for recognizing a golden text image based on the pre-training of a skeleton model according to claim 3, wherein the method for denoising the image with the golden text rubbing specifically comprises the following steps:

5. The skeleton model pre-training-based golden text image recognition method according to claim 1, wherein the method for performing skeleton model pre-training by using the enhanced label-free golden text image dataset to obtain a trained skeleton model comprises the following steps of;

Will have no label ancient text imageEnhanced label-free ancient text image data set obtained after data enhancement>And->；

Assembling said unlabeled paleo-text image datasetAnd->Through skeleton model->Obtaining the corresponding characterization vector->And->The formula is:

，/>is a representation vector after average pooling, two image vectors are measured by cosine similarity, and the pair of the two image vectors is calculated by the cosine similarity/>And->Is defined as:

wherein (1)>Is an indication function, +.>Is the temperature parameter->Representative is L2 normalization.

6. The method for recognizing the golden text image based on the pre-training of the skeleton model according to claim 1, wherein the method for inputting the pre-processed image to be recognized into the trained skeleton model, the text information reasoner and the text recognizer to obtain the recognition result of the text to be recognized specifically comprises the following steps:

7. The method for recognizing a golden text image based on pre-training of a skeleton model according to claim 6, wherein the method for inputting the candidate radical set and the candidate font structure set into the text recognizer to recognize the golden text image specifically comprises:

8. A skeleton model pre-training based golden image recognition system for implementing the skeleton model pre-training based golden image recognition method of any one of claims 1-7, the system comprising: the system comprises a data preprocessing module, a skeleton network pre-training module and an image recognition module;