CN116503876A - Training method and device of image recognition model, and image recognition method and device - Google Patents

Training method and device of image recognition model, and image recognition method and device Download PDF

Info

Publication number
CN116503876A
CN116503876A CN202310464514.1A CN202310464514A CN116503876A CN 116503876 A CN116503876 A CN 116503876A CN 202310464514 A CN202310464514 A CN 202310464514A CN 116503876 A CN116503876 A CN 116503876A
Authority
CN
China
Prior art keywords
image
recognition model
image recognition
model
sample text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310464514.1A
Other languages
Chinese (zh)
Inventor
赵星然
李亚东
王洪彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310464514.1A priority Critical patent/CN116503876A/en
Publication of CN116503876A publication Critical patent/CN116503876A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19153Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document

Abstract

The embodiment of the specification provides a training method and device for an image recognition model, and an image recognition method and device, wherein the training method for the image recognition model comprises the following steps: acquiring a sample text image and a sample text label of the sample text image; determining a first global feature of a visual dimension corresponding to the sample text image through the image recognition model, and determining a second global feature of a language dimension corresponding to the sample text label through the text recognition model; performing iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained; the recognition results of the target image recognition model are fused with semantic features corresponding to the visual dimension and the language dimension respectively. And in the training stage, iterative training is carried out based on the first global feature and the second global feature, so that the model has two coding capacities of vision and language, and the recognition efficiency and accuracy are improved.

Description

Training method and device of image recognition model, and image recognition method and device
Technical Field
The embodiment of the specification relates to the technical field of artificial intelligence, in particular to a training method of an image recognition model.
Background
With the continuous development of computer technology, image recognition technology is more and more important, and text recognition technology is used as a branch of image recognition technology, so that the application range is wider and wider. The text recognition technology is a technology of performing image recognition on a text image to recognize text in the text image. For example, applications such as answer content in a test paper, and change content in a contract are identified. However, the current image recognition only realizes the conversion from an image to a text, so that the recognition accuracy is not high. Therefore, there is a need for an image recognition method with high recognition accuracy and high recognition speed.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide an image recognition model training method and an image recognition method. One or more embodiments of the present specification relate to an image recognition model training apparatus, an image recognition apparatus, a computing device, a computer-readable storage medium, and a computer program to solve the technical drawbacks of the prior art.
According to a first aspect of embodiments of the present disclosure, there is provided an image recognition model training method, including:
acquiring a sample text image and a sample text label of the sample text image;
Determining a first global feature of the visual dimension corresponding to the sample text image through an image recognition model, and determining a second global feature of the language dimension corresponding to the sample text label through a text recognition model;
performing iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained;
the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
According to a second aspect of embodiments of the present specification, there is provided an image recognition method, including:
acquiring an image to be identified, and inputting the image to be identified into a target image identification model obtained through training by the image identification model training method;
and obtaining the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively.
According to a third aspect of embodiments of the present specification, there is provided an image recognition method including:
Receiving a claim material image uploaded by a target user aiming at a target item;
inputting the claim material image into a target image recognition model obtained through training of the image recognition model training method;
obtaining the claim text information of the claim material image output by the target image recognition model, wherein the claim text information fuses semantic features corresponding to visual dimensions and language dimensions respectively;
and determining a claim settlement result according to the claim settlement text information and feeding back to the target user.
According to a fourth aspect of embodiments of the present specification, there is provided an image recognition model training apparatus, comprising:
an acquisition module configured to acquire a sample text image and a sample text label of the sample text image;
a determining module configured to determine a first global feature of the visual dimension corresponding to the sample text image by an image recognition model and a second global feature of the language dimension corresponding to the sample text label by a text recognition model;
the training module is configured to perform iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting a model training ending condition is obtained;
The recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
According to a fifth aspect of embodiments of the present specification, there is provided an image recognition apparatus comprising:
the input module is configured to acquire an image to be identified and input the image to be identified into a target image identification model obtained through training of the image identification model training method;
the output module is configured to obtain the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively.
According to a sixth aspect of embodiments of the present specification, there is provided an image recognition apparatus comprising:
the receiving module is configured to receive the claim material image uploaded by the target user aiming at the target item;
an input module configured to input the claim material image to a target image recognition model obtained by training of the image recognition model training method;
the obtaining module is configured to obtain the claim text information of the claim material image output by the target image recognition model, wherein the claim text information fuses semantic features corresponding to visual dimension and language dimension respectively;
And the feedback module is configured to determine a claim settlement result according to the claim settlement text information and feed back the result to the target user.
According to a seventh aspect of embodiments of the present specification, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer executable instructions, and the processor is configured to execute the computer executable instructions, where the computer executable instructions when executed by the processor implement the steps of the image recognition model training method and the image recognition method described above.
According to an eighth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the image recognition model training method, the image recognition method described above.
According to a ninth aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the image recognition model training method, the image recognition method described above.
The specification provides an image recognition model training method, which comprises the following steps: acquiring a sample text image and a sample text label of the sample text image; determining a first global feature of the visual dimension corresponding to the sample text image through an image recognition model, and determining a second global feature of the language dimension corresponding to the sample text label through a text recognition model; performing iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained; the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
According to the embodiment of the specification, the first global features corresponding to the visual dimensions are determined through the image recognition model in the training stage, the second global features corresponding to the language dimensions are determined through the text recognition model, and the image recognition model is subjected to iterative training based on the first global features and the second global features, so that the recognition results of the trained target image recognition model are fused with the semantic features corresponding to the visual dimensions and the language dimensions respectively, the target image recognition model has the visual and language coding capabilities, and the recognition accuracy is improved. When the image recognition is carried out later, the recognition processing can be completed only by using the image recognition model, so that the image recognition efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of an image recognition model training method according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of an image recognition model training method according to one embodiment of the present disclosure;
FIG. 3 is a flowchart of a process of an image recognition model training method according to one embodiment of the present disclosure;
FIG. 4 is a flowchart of an image recognition method according to one embodiment of the present disclosure;
FIG. 5 is a flow chart of an image recognition method provided in one embodiment of the present disclosure;
FIG. 6 is a schematic structural diagram of an image recognition model training device according to an embodiment of the present disclosure;
fig. 7 is a schematic structural view of an image recognition apparatus according to an embodiment of the present disclosure;
fig. 8 is a schematic structural view of an image recognition apparatus according to an embodiment of the present disclosure;
FIG. 9 is a block diagram of a computing device provided in one embodiment of the present description.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.
First, terms related to one or more embodiments of the present specification will be explained.
OCR: optical character recognition optical character recognition is a technique for extracting and converting characters on an electronic document into computer-readable text data by computer vision techniques.
ABINet: autonomos, bidirectional and Iterative net, autonomous, bi-directional and iterative networks, ABINet is a natural scene recognition model with Autonomous, bi-directional and iterative properties.
Vision-LAN: vision-Language, visual Language network directly gives visual model Language ability, regards visual and Language model as a whole, because Language information is together with visual feature acquisition, does not need extra Language model, vision-LAN can self-adaptation consider Language information to strengthen visual feature, and then reaches higher discernment rate of accuracy.
At present, in the application scenes of picture type information extraction and contract comparison, the image is often required to be identified, and the text content and text coordinates in the image are accurately obtained, so that the image information is obtained or the information comparison is carried out, and the accuracy of the text coordinates in the process is particularly important. In the conventional image recognition technology, only visual prediction is performed on an image, and recognition results are not high in accuracy due to the lack of semantic information, so that ABINet and Vision-LAN are proposed to solve the problem. However, ABINet and Vision-LAN have the following problems, respectively:
ABINet is composed of visual branches and language branches, and the whole flow is divided into three steps: 1. visual branch pre-predicting high-order semantic features and text recognition results; 2. the text recognition result of the visual branch prediction is sent to a pre-trained language branch for iterative error correction, and a high-order semantic feature value of the language branch prediction is obtained; 3. and fusing high-order semantic features of the visual branch and the language branch, and obtaining a final prediction result through a classifier. Therefore, the ABINet language branch iterative error correction makes the training and reasoning time cost very high, and the result after the two models are fused leads to overlarge occupation of the model video memory, so that the light-weight requirement in the project cannot be met.
Vision-LAN is a single visual branch, and the whole flow is divided into two steps: 1. training a weak supervision model to realize character level segmentation, masking a certain character by applying a segmentation result, and predicting a complete character under the condition that the certain character is masked by inputting a training target to learn global features; 2. the inference prediction does not need to cover any characters. Therefore, the Vision-LAN training process is too complex, relies on a weak supervision segmentation model too much, has a small English word quantity, is not suitable for long text recognition, does not fully utilize language information, only realizes the conversion from an image to a text, and increases the recognition difficulty due to the information conversion difficulty of two modes.
Based on the above, in the present specification, an image recognition model training method is provided, which is used for endowing the perception capability of the visual branch to the language in the training process, and meanwhile, the inference stage maintains the visual single branch, and does not increase extra memory occupation and inference time. The present specification relates to an image recognition method and apparatus, an image recognition model training apparatus, a computing device, and a computer-readable storage medium, and is described in detail in the following embodiments.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an image recognition model training method according to an embodiment of the present disclosure, where a sample text image and a sample text label are training samples for training an image recognition model, the sample text image may be an image including text, such as a poster, a contract, or the like, the sample text label may be understood as real text included in the sample text image, and after the sample text image is input into the trained image recognition model, a result predicted by the image recognition model should be the sample text label. The image recognition model can be a pre-trained visual model, the visual model outputs a predicted text result, and in the prediction process, a first global feature extracted based on the sample text image can be obtained, wherein the first global feature is a feature of the sample text image corresponding to the visual dimension, and the enhancement of pure language information is absent. The sample text labels are input into a pre-trained text recognition model, a second global feature output by the text recognition model can be obtained, the second global feature is a feature extracted based on plain text and belongs to plain language information, so that the first global feature of the image recognition model can be supervised based on the second global feature of the plain language information, namely, the image recognition model is subjected to iterative training according to the first global feature and the second global feature, so that the trained target image recognition model has coding capacity and text prediction capacity of two modes of language and vision, image recognition can be completed by using the image recognition model only in an image recognition stage, the calculated amount and the video memory occupancy rate are reduced, and the recognition efficiency is improved.
Referring to fig. 2, fig. 2 shows a flowchart of an image recognition model training method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 202: and acquiring a sample text image and a sample text label of the sample text image.
In one embodiment of the present specification, the execution subject may be a terminal device, which may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet personal digital assistant), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a vehicle-mounted navigation terminal), etc., and a fixed terminal such as a digital TV, a desktop computer, etc., which is not particularly limited.
Specifically, the number of the sample text images is plural, and the sample text images can be understood as text images in natural scenes or image areas in the text images in natural scenes. The text image in the natural scene may be a poster image, a logo image, an advertisement image, a street view image, etc., and the shape, layout, etc. of the text in the text image are complex, and have irregularity, for example, the text in the poster image may be handwriting, dense text, etc. A sample text label may be understood as a text contained in a sample text image, for example, if the sample text image is a movie poster in which a word of "i love me" is written, the sample text label corresponding to the sample text image may be a text of "i love me". The pre-training language model can be trained through the sample text image and the sample text image, so that various natural language processing tasks are realized.
In a specific embodiment of the present disclosure, a sample text image and a sample text label corresponding to the sample text image are obtained, the sample text image is a contract text, the sample text label is a contract content text in the contract text, and a user wants to train an image recognition model for extracting the contract content text, so that a contract is compared according to the extracted contract content text.
Further, since there may be an image area that does not include text in the text image, in order to facilitate recognition processing of a subsequent model, the text image may be preprocessed, and specifically, a sample text image and a sample text label of the sample text image are obtained, including: acquiring an initial sample text image and a sample text label, and preprocessing the initial sample text image to obtain an intermediate sample text image; and determining an identification object according to the model identification information of the image identification model, and selecting an image area containing the identification object from the intermediate sample text image as a sample text image.
The initial sample text image may be understood as an original sample text image, and the initial sample text image is not preprocessed, so that redundant noise may exist in the initial sample text image, which affects accuracy of subsequent image recognition. The preprocessing includes, but is not limited to, contrast enhancement processing, noise reduction processing, image segmentation processing, etc., and the intermediate sample text image may be understood as a text image obtained after the preprocessing of the initial sample text image, where the intermediate sample text image may include an image area without text, so that the intermediate sample text image is further subjected to clipping processing, and the intermediate sample text image is extracted to include an image area corresponding to the identification. The model identification information of the image identification model can be understood as identification target information of the image identification model, the model identification information is used for recording an identification subject of the model, for example, the image identification model is used for identifying character texts in text images, and then an identification object of the image identification model is the texts in the images; if the image recognition model is used for recognizing the logo text in the text image, the recognition object of the image recognition model is the logo pattern in the image.
In practical application, the initial sample text image may be an arbitrary text image in a natural scene, and the text image in the natural scene has a larger data size and a larger image area without text, so that the initial sample text image can be preprocessed first, redundant noise information in the initial sample text image is eliminated, an identification object is determined according to model identification information of an image identification model, an image area corresponding to identification is cut out from the intermediate sample text image, and the cut-out image area is used as the sample text image. In order to reduce the amount of calculation in preprocessing, the initial sample text image may be cut according to the recognition object, and then the image area obtained by cutting may be preprocessed.
In a specific embodiment of the present disclosure, an initial sample text image and a sample text label are obtained, the initial sample text image is a contract text image, the sample text label is text content in the contract, the initial sample text image is preprocessed, interference factors such as shadow noise and horizontal line noise in the initial sample text image are eliminated, and an intermediate sample text image is obtained according to a preprocessing result. Determining that the recognition object is a character according to the model recognition information of the image recognition model, cutting the middle sample text image according to the recognition object, selecting an image area containing contract text content as the sample text image, and performing model training on the image recognition model through the sample text image and the sample text label.
Based on the method, through preprocessing and cutting the text image, interference noise in the text image can be eliminated, an invalid image area which does not contain an identification object in the text image is provided, the subsequent training of an image identification model is facilitated, and the training efficiency is improved.
Step 204: determining a first global feature of the visual dimension corresponding to the sample text image through an image recognition model, and determining a second global feature of the language dimension corresponding to the sample text label through a text recognition model.
The first global feature can be understood as a global semantic feature of the image recognition model based on image extraction, the image recognition model can firstly perform visual feature extraction of visual dimensions on a text image to be recognized, the visual feature extraction can be understood as extracting image information as abstract features which can be recognized by a computer, it is to be noted that the image information refers to information which can be recognized by a person, the abstract features refer to information which can be recognized by the computer, and the first global feature is information which can be recognized and processed by the computer. The second global feature can be understood as a global semantic feature of the text recognition model based on text extraction, and the second global feature is a global feature obtained based on plain text, and belongs to plain language information.
In practical application, the global semantic information of the characters can be encoded through a pre-trained language model, and the second global features of the corresponding language dimensions of the sample text labels are obtained. And coding the global semantic information of the image through the pre-trained visual model to obtain a first global feature of the corresponding visual dimension of the sample text image.
In a specific embodiment of the present disclosure, a first global feature of the contract text image corresponding to the visual dimension is determined by an image recognition model, and a second global feature of the contract text content corresponding to the language dimension is determined by a text recognition model. The image recognition model may then be model trained based on the first global feature and the second global feature.
Further, in order to accurately extract the first global feature in the sample text image, the first global feature may be obtained through image feature extraction and visual coding processing, and specifically, the first global feature of the corresponding visual dimension of the sample text image may be determined through an image recognition model, including: inputting the sample text image into an image recognition model; performing image feature extraction processing on the sample text image through an image convolution unit in the image recognition model to obtain image features of the sample text image; and performing visual coding processing on the image features through an image transformation unit in the image recognition model to obtain first global features of the corresponding visual dimensions of the sample text image.
The image convolution unit may be understood as a convolutional neural network (Convolutional Neural Networks, CNN), and the image feature of the sample text image is used to represent image information carried by the sample text image by performing image feature extraction processing on the sample text image by the image convolution unit, and the specific image feature may be understood as a feature vector or a feature map used to represent the target image. After the image features of the sample text image are obtained, the image features can be subjected to visual coding processing through an image transformation unit in the image recognition model, the image transformation unit can be understood as a transformation neural network (transformer), and the image features can be subjected to visual coding processing through the image transformation unit, so that first global features of visual dimensions corresponding to the sample text image are obtained.
In practical application, a visual model, namely an image recognition model, can be trained based on CNN+transformer, image feature extraction processing is carried out on a sample text image through an image convolution unit in the image recognition model to obtain image features, and then visual coding processing is carried out on the image features through an image transformation unit to obtain first global features of corresponding visual dimensions of the sample text image.
In a specific embodiment of the present disclosure, an image feature extraction process is performed on a contract text image by using an image convolution unit in an image recognition model to obtain an image feature of the contract text image, and a visual encoding is performed on the image feature by using an image change unit to obtain a first global feature of a visual dimension corresponding to the contract text image.
Based on the above, through the image convolution unit and the image transformation unit in the image recognition model, a first global feature of the visual dimension corresponding to the sample text image can be obtained, and the first global feature can be used for decoding the predicted text corresponding to the image recognition model.
Further, in order to accurately extract image features, the image feature extraction process may include an outline feature extraction process and a position feature extraction process for an identification object in a sample text image, specifically, the image feature extraction process is performed on the sample text image by an image convolution unit in the image identification model, so as to obtain the image feature of the sample text image, including: performing appearance feature extraction processing and position feature extraction processing on text characters in the sample text image through an image convolution unit in the image recognition model to obtain character appearance features and character position features corresponding to the text characters; and fusing the character appearance characteristics and the character position characteristics to obtain character fusion characteristics which are used as image characteristics of the sample text image.
The text characters in the sample text image can be understood as text content in the sample text image, the outline feature extraction processing of the text characters can be understood as extracting outline features of each character in the sample text image, the position feature extraction processing of the text characters can be understood as extracting position features of each character in the sample text image, the character outline features can be understood as features describing the outline of the characters and also can be understood as outlines of the characters, the character position features can be understood as features describing the positions of the characters, and the positions of any one of the characters refer to: the sample text image includes a plurality of characters, and any one of the characters is position information of which character among the plurality of characters.
In practical application, after the character appearance feature and the character position feature are obtained, the character appearance feature and the character position feature can be fused to obtain the image feature of the sample text image, and the fusion process can be to fuse the character appearance feature and the character position feature through an encoder, so that the character fusion feature is obtained and is used as the image feature.
In a specific embodiment of the present disclosure, an image convolution unit in an image recognition model performs appearance feature extraction processing and position feature extraction processing on contract text characters in a contract text image to obtain character appearance features and character position features corresponding to the contract text characters, and an encoder fuses the character appearance features and the character position features to obtain character fusion features and serve as image features of the contract text image.
Based on the above, the character outline features and the character position features are respectively extracted and fused, character fusion features are obtained according to fusion results, the character fusion features are used as image features of the sample text image, and the image features can be subjected to visual coding subsequently, so that visual features of the sample text image are obtained.
Further, in order to obtain the second global feature of the language dimension corresponding to the sample text label, embedding processing is required to be performed on the sample text, specifically, determining the second global feature of the language dimension corresponding to the sample text label through a text recognition model includes: determining a text vector sequence corresponding to the sample text label; and encoding the text vector sequence through a text transformation unit in the text recognition model to obtain a second global feature of the language dimension corresponding to the sample text label.
The text vector sequence may be understood as a text vector sequence obtained after vectorization processing is performed on the sample text label, where the text vector sequence includes token vectors corresponding to each word in the sample text label, and by inputting the text vector sequence to the text recognition model, second global features output by the text recognition model may be obtained, where the second global features are global features of the words.
In practical applications, text data faced by natural language processing is often unstructured and unordered text data, while data processed by machine learning algorithms is input and output of fixed length, so that machine learning cannot directly process original text data and must convert the text data into numbers, such as vectors. In the specific implementation, the text can be vectorized and encoded in a single-hot model, a word bag model, an N-element model and the like.
In a specific embodiment of the present disclosure, vectorization processing is performed on the sample text label, a text vector sequence corresponding to the sample text label is obtained, and the text vector sequence is input to the text recognition model, so as to obtain a second global feature of a corresponding language dimension output by the text recognition model.
Based on the above, the vectorization processing is performed on the sample text labels, so that after the vectorized text vector sequence is input into the text recognition model, the second global feature output by the text recognition model can be obtained.
Step 206: performing iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained; the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
After the first global feature and the second global feature are obtained, the first global feature of the image recognition model can be supervised based on the second global feature of the text recognition model, so that the image recognition model has coding capability and text prediction capability of two modes of language and vision.
In practical application, the image recognition model is iteratively trained through the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained, and at the moment, the recognition result output by the target image recognition model fuses the visual dimension semantic feature and the language dimension semantic feature, so that the target image recognition model has two coding capacities of language and visual.
In a specific embodiment of the present disclosure, the image recognition model is iteratively trained according to the first global feature of the visual dimension and the second global feature of the language dimension until a target image recognition model satisfying the model training end condition is obtained.
The model training ending conditions specifically comprise at least one of the following: loss value comparison conditions, iteration round comparison conditions and model verification comparison conditions.
The model training ending condition can be a loss value comparison condition, an iteration round comparison condition or a model verification comparison condition, or one or more of the three conditions. The loss value comparison condition may be that the loss value of each iteration training is compared with a preset loss value, and after the preset loss value set threshold is met, training is stopped; the iterative round comparison condition can be a preset model training round, and training is stopped after the model training round reaches the preset model training round; the model verification comparison condition can be that a trained model is obtained after each iteration training, verification data is input into the model, whether the model of the current training round reaches the verification qualification rate is determined by comparing the prediction data output by the model with the real data of the verification data, and accordingly whether the model of the current training round is trained is judged.
In a specific embodiment of the present disclosure, iterative training is performed on the image recognition model through the first global feature and the second global feature, and when the preset loss value calculated according to the first global feature and the second global feature is less than 0.5, the model training ending condition is satisfied, and at this time, the trained target image recognition model can be obtained.
In another specific embodiment of the present disclosure, the image recognition model is iteratively trained by using the first global feature and the second global feature, and if the preset training round is 10, after the iterative training is performed 10 times, the training is stopped, and the trained image recognition model is used as the target image recognition model.
In another specific embodiment of the present disclosure, iterative training is performed on an image recognition model through a first global feature and a second global feature, and after each training is completed, the verification data in the verification data set is input into the image recognition model of the current training round, and the verification passing rate of the model is calculated according to the prediction data of the image recognition model, and if the verification passing rate is greater than the verification passing rate of the preset model, training is stopped to obtain the trained target image recognition model.
Further, the model loss value corresponding to the image recognition model may be calculated through the first global feature and the second global feature, so that the image recognition model is trained according to the model loss value, specifically, the image recognition model is iteratively trained according to the first global feature and the second global feature, including: calculating a model loss value corresponding to the image recognition model according to the first global feature and the second global feature; adjusting model parameters of the image recognition model according to the model loss value to obtain an intermediate image recognition model; and when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as the image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
The model loss value is used for representing the difference between the predicted value and the true value, namely the error, the model parameters are adjusted through the model loss value in the model training, after the image recognition model is adjusted in each training round, the intermediate image recognition model after the adjustment parameters can be obtained, at the moment, whether the intermediate image recognition model meets the model training ending condition can be judged, if the model training ending condition is met, the intermediate image recognition model is used as a trained target image recognition model, and if the model training ending condition is not met, the intermediate image recognition model is used as the image recognition model in the next training round, and the model training is continuously executed.
In a specific embodiment of the present disclosure, a model loss value corresponding to the image recognition model is calculated according to the first global feature and the second global feature, the calculated model loss value is 0.7, model parameters of the image recognition model are adjusted according to the model loss value, an intermediate image recognition model is obtained, and if the model training end condition is that the model loss value is less than 0.3 of a preset model loss value, the intermediate image recognition model is used as the image recognition model, and a training step of executing a next training round is performed.
Based on the method, the image recognition model is subjected to model training through the first global feature and the second global feature, so that the trained image recognition model can have two coding capacities of vision and language at the same time, and the image recognition accuracy is improved.
Furthermore, in order to improve training efficiency, on the basis of supervising the first global feature based on the second global feature, based on a sample text label supervising model prediction result, the iterative training may be performed on the image recognition model specifically according to the first global feature and the second global feature, including: calculating a first model loss value corresponding to the image recognition model according to the first global feature and the second global feature; decoding the first global feature through the image recognition model to obtain a predictive recognition text corresponding to the sample text image, and calculating a second model loss value corresponding to the image recognition model according to the sample text label and the predictive recognition text; adjusting model parameters of the image recognition model according to the first model loss value and the second model loss value to obtain an intermediate image recognition model; and when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as the image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
The prediction recognition text can be understood as a recognition result finally output by the image recognition model, for example, after the image recognition is performed on the contract text image, the output prediction recognition text is the contract text image after the model recognition is performed, the output contract content text can be calculated based on the first global feature and the second global feature, the first model loss value can be calculated based on the sample text label and the prediction recognition text, the second model loss value can be calculated based on the sample text label and the prediction recognition text, and then model parameter adjustment can be performed on the image recognition model according to the first model loss value and the second model loss value, so that model training efficiency is improved.
In practical applications, the first model loss value may be calculated by a mean square error (MSE, mean Squared Error) loss function, the second model loss value may be calculated by a Cross Entropy (CE) loss function, after the MSE loss value and the CE loss value are obtained by calculation, the fusion loss value may be obtained by means of weighted calculation, and the image recognition model may be adjusted by the fusion loss value, and a specific calculation formula of the weighted loss value may refer to formula 1, where Δl is represented as the weighted loss value, L CE Expressed as CE loss value, alpha is expressed as a weight coefficient of CE loss value, L MSE Expressed as MSE loss values, and β expressed as a weight coefficient of CE loss values.
ΔL=α*L CE +β*L MSE Equation 1
In a specific embodiment of the present disclosure, a first model loss value corresponding to the image recognition model is calculated according to the first global feature and the second global feature to be 0.6, a second model loss value corresponding to the image recognition model is calculated according to the sample text label and the prediction recognition text to be 0.5, a weighted loss value is calculated according to the first model loss value and the second model loss value to be 0.55, model parameters of the image recognition model are adjusted through the weighted loss value, an intermediate image recognition model is obtained, whether the intermediate image recognition model meets a model training end condition is determined, and if the model training end condition is not met, the image recognition model is continuously trained.
Based on the model parameters, the model is adjusted through the first model loss value and the second model loss value, and model training efficiency is improved.
In the specific implementation, after the trained target image recognition model is obtained, the image recognition can be carried out by using the target image recognition model only, so that the occupation of the video memory is reduced, and the prediction speed is improved. And the recognition result can be ensured to be fused with the visual dimension semantic features and the language dimension semantic features, so that the prediction accuracy is improved. Referring to table 1, table 1 is identification performance data of the target image identification model according to an embodiment of the present disclosure.
TABLE 1
The image recognition model training method provided by the specification comprises the following steps: acquiring a sample text image and a sample text label of the sample text image; determining a first global feature of the visual dimension corresponding to the sample text image through an image recognition model, and determining a second global feature of the language dimension corresponding to the sample text label through a text recognition model; performing iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained; the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension. In the training stage, a first global feature corresponding to the visual dimension is determined through the image recognition model, a second global feature corresponding to the language dimension is determined through the text recognition model, and the image recognition model is subjected to iterative training based on the first global feature and the second global feature, so that the recognition result of the trained target image recognition model is fused with semantic features corresponding to the visual dimension and the language dimension respectively, the target image recognition model has the visual and language coding capability, and the recognition accuracy is improved. When the image recognition is carried out later, the recognition processing can be completed only by using the image recognition model, so that the image recognition efficiency is improved.
The image recognition model training method provided in the present specification will be further described with reference to fig. 3 by taking an application of the image recognition model training method in contract comparison as an example. Fig. 3 is a flowchart of a processing procedure of an image recognition model training method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 302: and acquiring an initial sample text image and a sample text label, and preprocessing the initial sample text image to obtain an intermediate sample text image.
In one implementation manner, an initial sample contract text image and a sample text label corresponding to the initial sample contract text image are obtained, and image enhancement preprocessing is performed on the initial sample contract text image to obtain an intermediate text contract text image.
Step 304: and determining the recognition object according to the model recognition information of the image recognition model, and selecting an image area containing the recognition object in the middle sample text image as the sample text image.
In one implementation, the recognition object is determined to be a text character according to the model recognition information of the image recognition model, and an image area containing the text character is selected from the middle sample contract text image to be used as the sample contract text image.
Step 306: inputting the sample text image into an image recognition model, and carrying out appearance feature extraction processing and position feature extraction processing on text characters in the sample text image through an image convolution unit in the image recognition model to obtain character appearance features and character position features corresponding to the text characters.
In one implementation manner, a sample contract text image is input into an image recognition model, and appearance feature extraction processing and position feature extraction processing are performed on text characters in the sample contract text image through an image convolution unit in the image recognition model, so that character appearance features and character position features corresponding to the text characters are obtained.
Step 308: and fusing the character appearance characteristics and the character position characteristics to obtain character fusion characteristics which are used as image characteristics of the sample text image.
Step 310: and performing visual coding processing on the image features through an image transformation unit in the image recognition model to obtain first global features of the corresponding visual dimensions of the sample text image.
Step 312: and determining a text vector sequence corresponding to the sample text label, and encoding the text vector sequence through a text transformation unit in the text recognition model to obtain a second global feature of the language dimension corresponding to the sample text label.
Step 314: and calculating a model loss value corresponding to the image recognition model according to the first global feature and the second global feature, and adjusting model parameters of the image recognition model according to the model loss value to obtain an intermediate image recognition model.
Step 316: and when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as an image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
In one implementation manner, when the intermediate image recognition model meets the model training ending condition, the intermediate image recognition model is used as a target image recognition model, and text content in contract text is subsequently recognized through the target image recognition model, so that a subsequent contract comparison task can be continued.
According to the training method of the image recognition model, which is provided by the specification, the first global feature corresponding to the visual dimension is determined through the image recognition model in the training stage, the second global feature corresponding to the language dimension is determined through the text recognition model, and the image recognition model is subjected to iterative training based on the first global feature and the second global feature, so that the recognition result of the trained target image recognition model is fused with the semantic features corresponding to the visual dimension and the language dimension respectively, the target image recognition model has the visual and language coding capability, and the recognition accuracy is improved. When the image recognition is carried out later, the recognition processing can be completed only by using the image recognition model, so that the image recognition efficiency is improved.
Referring to fig. 4, fig. 4 shows a flowchart of an image recognition method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 402: and acquiring an image to be identified, and inputting the image to be identified into a target image identification model obtained through training by the image identification model training method.
The image to be identified can be understood as an image to be identified, the image to be identified can be a contract text image, a poster text image and a target image identification model can identify text from the image to be identified, and therefore information extraction or text comparison is achieved.
Step 404: and obtaining the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively.
The recognition result of the image to be recognized can be understood as a prediction result output by the target image recognition model, and as the global feature of the target image recognition model fuses the semantic feature of the visual dimension and the semantic feature of the language dimension, the recognition result of the target image recognition model fuses the semantic features respectively corresponding to the visual dimension and the language dimension.
The image recognition method comprises the steps of obtaining an image to be recognized, and inputting the image to be recognized into a target image recognition model obtained through training of the image recognition model training method; and obtaining the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively. In the model application stage, the recognition result fused with semantic features respectively corresponding to the visual dimension and the language dimension can be output only by recognizing the model through the target image, so that the model prediction time is saved, and the model prediction accuracy is improved.
Referring to fig. 5, fig. 5 shows a flowchart of another image recognition method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 502: and receiving the claim material image uploaded by the target user aiming at the target item.
The target user may be understood as a user participating in the target project, the target project may be a claim project, the target user may be a user participating in the claim project, and in the case that the target user has claim requirements, the claim material image may be uploaded by the terminal for the claim project, the claim material image may be understood as an image generated by shooting and scanning claim evidence, where the claim material image includes characters, patterns, and the like.
Step 504: and inputting the claim material image into a target image recognition model obtained through training by the image recognition model training method.
After receiving the claim material image uploaded by the target user, the project server deployed by the project provider can input the claim material image into the image recognition model, and the image recognition model recognizes the claim material image, so that text content or pattern content in the claim material image is extracted.
Step 506: and obtaining the claim text information of the claim material image output by the target image recognition model, wherein the claim text information is fused with semantic features respectively corresponding to visual dimensions and language dimensions.
The text information of the claim can be understood as text content extracted from the material image of the claim by the target image recognition model, and can be text content in hospital certificates issued by hospitals.
Step 508: and determining a claim settlement result according to the claim settlement text information and feeding back to the target user.
The result of the claim settlement can be understood as the result of the claim settlement of the target user, the result of the claim settlement comprises a plurality of information such as whether the claim is settled, the amount of the claim settlement, the manner of the claim settlement and the like, after the text information of the claim settlement is obtained, the result of the claim settlement can be checked according to the text information of the claim settlement, the project staff can be used for manual checking, or other checking programs can be used for automatic checking, and after the result of the claim settlement is determined, the result of the claim settlement can be sent to the terminal of the target user, so that the target user is informed of the result of the claim settlement.
According to the image recognition method, an image of claim materials uploaded by a target user aiming at a target item is received; inputting the claim material image into the target image recognition model obtained by training of the image recognition model training method; obtaining the claim text information of the claim material image output by the target image recognition model, wherein the claim text information fuses semantic features corresponding to visual dimensions and language dimensions respectively; and determining a claim settlement result according to the claim settlement text information and feeding back to the target user. In the actual application stage of the model, the recognition result fused with semantic features respectively corresponding to the visual dimension and the language dimension can be output only by recognizing the model through the target image, so that the model prediction time is saved, and the model prediction accuracy is improved.
Corresponding to the method embodiment, the present disclosure further provides an embodiment of an image recognition model training device, and fig. 6 shows a schematic structural diagram of an image recognition model training device provided in one embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:
an acquisition module 602 configured to acquire a sample text image and a sample text label of the sample text image;
A determining module 604 configured to determine a first global feature of the visual dimension corresponding to the sample text image by an image recognition model and a second global feature of the language dimension corresponding to the sample text label by a text recognition model;
a training module 606 configured to iteratively train the image recognition model according to the first global feature and the second global feature until a target image recognition model satisfying a model training end condition is obtained;
the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
Optionally, the determining module 604 is further configured to: inputting the sample text image into an image recognition model; performing image feature extraction processing on the sample text image through an image convolution unit in the image recognition model to obtain image features of the sample text image; and performing visual coding processing on the image features through an image transformation unit in the image recognition model to obtain first global features of the corresponding visual dimensions of the sample text image.
Optionally, the determining module 604 is further configured to: performing appearance feature extraction processing and position feature extraction processing on text characters in the sample text image through an image convolution unit in the image recognition model to obtain character appearance features and character position features corresponding to the text characters; and fusing the character appearance characteristics and the character position characteristics to obtain character fusion characteristics which are used as image characteristics of the sample text image.
Optionally, the determining module 604 is further configured to: determining a text vector sequence corresponding to the sample text label; and encoding the text vector sequence through a text transformation unit in the text recognition model to obtain a second global feature of the language dimension corresponding to the sample text label.
Optionally, the training module 606 is further configured to: calculating a model loss value corresponding to the image recognition model according to the first global feature and the second global feature; adjusting model parameters of the image recognition model according to the model loss value to obtain an intermediate image recognition model; and when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as the image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
Optionally, the training module 606 is further configured to: calculating a first model loss value corresponding to the image recognition model according to the first global feature and the second global feature; decoding the first global feature through the image recognition model to obtain a predictive recognition text corresponding to the sample text image, and calculating a second model loss value corresponding to the image recognition model according to the sample text label and the predictive recognition text; adjusting model parameters of the image recognition model according to the first model loss value and the second model loss value to obtain an intermediate image recognition model; and when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as the image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
Optionally, the obtaining module 602 is further configured to: acquiring an initial sample text image and a sample text label, and preprocessing the initial sample text image to obtain an intermediate sample text image; and determining an identification object according to the model identification information of the image identification model, and selecting an image area containing the identification object from the intermediate sample text image as a sample text image.
Optionally, the training module 606 is further configured to: the model training ending conditions include at least one of: loss value comparison conditions, iteration round comparison conditions and model verification comparison conditions.
The image recognition model trainer that this specification provided includes: an acquisition module configured to acquire a sample text image and a sample text label of the sample text image; a determining module configured to determine a first global feature of the visual dimension corresponding to the sample text image by an image recognition model and a second global feature of the language dimension corresponding to the sample text label by a text recognition model; the training module is configured to perform iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting a model training ending condition is obtained; the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension. In the training stage, a first global feature corresponding to the visual dimension is determined through the image recognition model, a second global feature corresponding to the language dimension is determined through the text recognition model, and the image recognition model is subjected to iterative training based on the first global feature and the second global feature, so that the recognition result of the trained target image recognition model is fused with semantic features corresponding to the visual dimension and the language dimension respectively, the target image recognition model has the visual and language coding capability, and the recognition accuracy is improved. When the image recognition is carried out later, the recognition processing can be completed only by using the image recognition model, so that the image recognition efficiency is improved.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the image recognition model training apparatus, since it is substantially similar to the image recognition model training method embodiment, the description is relatively simple, and the relevant points are referred to in the partial description of the image recognition model training method embodiment.
Corresponding to the above method embodiments, the present disclosure further provides an image recognition device embodiment, and fig. 7 shows a schematic structural diagram of an image recognition device provided in one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:
an input module 702 configured to acquire an image to be recognized and input the image to be recognized to a target image recognition model obtained by training of the image recognition model training method;
and the output module 704 is configured to obtain the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively.
The image recognition device comprises an input module, a recognition module and a recognition module, wherein the input module is configured to acquire an image to be recognized and input the image to be recognized into a target image recognition model obtained through image recognition model training of the image recognition model training method; the output module is configured to obtain the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively. In the actual application stage of the model, the recognition result fused with semantic features respectively corresponding to the visual dimension and the language dimension can be output only by recognizing the model through the target image, so that the model prediction time is saved, and the model prediction accuracy is improved.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the image recognition apparatus, since it is substantially similar to the image recognition method embodiment, the description is relatively simple, and the relevant points are referred to in the partial explanation of the image recognition method embodiment.
Corresponding to the above method embodiments, the present disclosure further provides an embodiment of an image recognition apparatus, and fig. 8 shows a schematic structural diagram of another image recognition apparatus provided in one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:
a receiving module 802 configured to receive an image of claim material uploaded by a target user for a target item;
an input module 804 configured to input the claim material image to a target image recognition model obtained by training of the image recognition model training method;
an obtaining module 806, configured to obtain the claim text information of the claim material image output by the target image recognition model, where the claim text information merges semantic features corresponding to a visual dimension and a language dimension respectively;
and a feedback module 808 configured to determine a claim result according to the claim text information and feed back the result to the target user.
The image recognition device comprises a receiving module, a processing module and a processing module, wherein the receiving module is configured to receive claim material images uploaded by a target user aiming at a target item; an input module configured to input the claim material image to a target image recognition model obtained by training of the image recognition model training method; the obtaining module is configured to obtain the claim text information of the claim material image output by the target image recognition model, wherein the claim text information fuses semantic features corresponding to visual dimension and language dimension respectively; and the feedback module is configured to determine a claim settlement result according to the claim settlement text information and feed back the result to the target user. In the actual application stage of the model, the recognition result fused with semantic features respectively corresponding to the visual dimension and the language dimension can be output only by recognizing the model through the target image, so that the model prediction time is saved, and the model prediction accuracy is improved.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the image recognition apparatus, since it is substantially similar to the image recognition method embodiment, the description is relatively simple, and the relevant points are referred to in the partial explanation of the image recognition method embodiment.
Fig. 9 illustrates a block diagram of a computing device 900 provided in accordance with one embodiment of the present specification. The components of computing device 900 include, but are not limited to, memory 910 and processor 920. Processor 920 is coupled to memory 910 via bus 930 with database 950 configured to hold data.
Computing device 900 also includes an access device 940, access device 940 enabling computing device 900 to communicate via one or more networks 960. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, localAreaNetwork), wide area networks (WAN, wideAreaNetwork), personal area networks (PAN, personalAreaNetwork), or combinations of communication networks such as the internet. Access device 940 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless LocalAreaNetwork) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for MicrowaveAccess) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near field communication (NFC, near Field Communication).
In one embodiment of the present description, the above-described components of computing device 900 and other components not shown in FIG. 9 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 9 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 900 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 900 may also be a mobile or stationary server.
The processor 920 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the image recognition model training method and the image recognition method described above.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the computing device embodiment, since it is substantially similar to the image recognition model training method, the image recognition method embodiment, the description is relatively simple, and the relevant points are only required to refer to part of the description of the image recognition model training method, the image recognition method embodiment.
An embodiment of the present disclosure further provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the image recognition model training method and the image recognition method described above.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the computer readable storage medium embodiments, since they are substantially similar to the image recognition model training method, the image recognition method embodiments, the description is relatively simple, and the relevant points are only referred to in the partial description of the image recognition model training method, the image recognition method embodiments.
An embodiment of the present disclosure further provides a computer program, where the computer program when executed in a computer causes the computer to perform the steps of the image recognition model training method and the image recognition method described above.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the computer program embodiment, since it is substantially similar to the image recognition model training method, the image recognition method embodiment, the description is relatively simple, and the relevant points are only required to refer to part of the description of the image recognition model training method, the image recognition method embodiment.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), an electrical carrier signal, a telecommunication signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.
It should be noted that the foregoing describes specific embodiments of the present invention. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (14)

1. An image recognition model training method, comprising:
acquiring a sample text image and a sample text label of the sample text image;
determining a first global feature of the visual dimension corresponding to the sample text image through an image recognition model, and determining a second global feature of the language dimension corresponding to the sample text label through a text recognition model;
Performing iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting the model training ending condition is obtained;
the recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
2. The method of claim 1, determining, by an image recognition model, a first global feature of the sample text image corresponding to a visual dimension, comprising:
inputting the sample text image into an image recognition model;
performing image feature extraction processing on the sample text image through an image convolution unit in the image recognition model to obtain image features of the sample text image;
and performing visual coding processing on the image features through an image transformation unit in the image recognition model to obtain first global features of the corresponding visual dimensions of the sample text image.
3. The method of claim 2, wherein the image feature extraction processing is performed on the sample text image by an image convolution unit in the image recognition model to obtain the image feature of the sample text image, including:
Performing appearance feature extraction processing and position feature extraction processing on text characters in the sample text image through an image convolution unit in the image recognition model to obtain character appearance features and character position features corresponding to the text characters;
and fusing the character appearance characteristics and the character position characteristics to obtain character fusion characteristics which are used as image characteristics of the sample text image.
4. The method of claim 1, determining, by a text recognition model, a second global feature of the sample text label corresponding to a language dimension, comprising:
determining a text vector sequence corresponding to the sample text label;
and encoding the text vector sequence through a text transformation unit in the text recognition model to obtain a second global feature of the language dimension corresponding to the sample text label.
5. The method of claim 1, the iteratively training the image recognition model based on the first global feature and the second global feature, comprising:
calculating a model loss value corresponding to the image recognition model according to the first global feature and the second global feature;
adjusting model parameters of the image recognition model according to the model loss value to obtain an intermediate image recognition model;
And when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as the image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
6. The method of claim 1, the iteratively training the image recognition model based on the first global feature and the second global feature, comprising:
calculating a first model loss value corresponding to the image recognition model according to the first global feature and the second global feature;
decoding the first global feature through the image recognition model to obtain a predictive recognition text corresponding to the sample text image, and calculating a second model loss value corresponding to the image recognition model according to the sample text label and the predictive recognition text;
adjusting model parameters of the image recognition model according to the first model loss value and the second model loss value to obtain an intermediate image recognition model;
and when the intermediate image recognition model does not meet the model training ending condition, taking the intermediate image recognition model as the image recognition model, and returning to execute the step of acquiring the sample text image and the sample text label of the sample text image.
7. The method of claim 1, obtaining a sample text image and a sample text label for the sample text image, comprising:
acquiring an initial sample text image and a sample text label, and preprocessing the initial sample text image to obtain an intermediate sample text image;
and determining an identification object according to the model identification information of the image identification model, and selecting an image area containing the identification object from the intermediate sample text image as a sample text image.
8. The method of any of claims 1-7, the model training end condition comprising at least one of:
loss value comparison conditions, iteration round comparison conditions and model verification comparison conditions.
9. An image recognition method, comprising:
acquiring an image to be identified, and inputting the image to be identified into a target image identification model obtained through training by the method of any one of claims 1-8;
and obtaining the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively.
10. An image recognition method, comprising:
Receiving a claim material image uploaded by a target user aiming at a target item;
inputting the claim material image into a target image recognition model trained by the method of any one of claims 1-8;
obtaining the claim text information of the claim material image output by the target image recognition model, wherein the claim text information fuses semantic features corresponding to visual dimensions and language dimensions respectively;
and determining a claim settlement result according to the claim settlement text information and feeding back to the target user.
11. An image recognition model training apparatus comprising:
an acquisition module configured to acquire a sample text image and a sample text label of the sample text image;
a determining module configured to determine a first global feature of the visual dimension corresponding to the sample text image by an image recognition model and a second global feature of the language dimension corresponding to the sample text label by a text recognition model;
the training module is configured to perform iterative training on the image recognition model according to the first global feature and the second global feature until a target image recognition model meeting a model training ending condition is obtained;
The recognition result of the target image recognition model is fused with semantic features respectively corresponding to the visual dimension and the language dimension.
12. An image recognition apparatus comprising:
an input module configured to acquire an image to be recognized and input the image to be recognized into a target image recognition model obtained by training in the method of any one of claims 1 to 8;
the output module is configured to obtain the recognition result of the image to be recognized, which is output by the target image recognition model, wherein the recognition result of the target image recognition model fuses semantic features corresponding to visual dimensions and language dimensions respectively.
13. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 10.
14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method of any one of claims 1 to 10.
CN202310464514.1A 2023-04-24 2023-04-24 Training method and device of image recognition model, and image recognition method and device Pending CN116503876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310464514.1A CN116503876A (en) 2023-04-24 2023-04-24 Training method and device of image recognition model, and image recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310464514.1A CN116503876A (en) 2023-04-24 2023-04-24 Training method and device of image recognition model, and image recognition method and device

Publications (1)

Publication Number Publication Date
CN116503876A true CN116503876A (en) 2023-07-28

Family

ID=87317813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310464514.1A Pending CN116503876A (en) 2023-04-24 2023-04-24 Training method and device of image recognition model, and image recognition method and device

Country Status (1)

Country Link
CN (1) CN116503876A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057443A (en) * 2023-10-09 2023-11-14 杭州海康威视数字技术股份有限公司 Prompt learning method of visual language model and electronic equipment
CN117292384A (en) * 2023-08-30 2023-12-26 北京瑞莱智慧科技有限公司 Character recognition method, related device and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292384A (en) * 2023-08-30 2023-12-26 北京瑞莱智慧科技有限公司 Character recognition method, related device and storage medium
CN117057443A (en) * 2023-10-09 2023-11-14 杭州海康威视数字技术股份有限公司 Prompt learning method of visual language model and electronic equipment
CN117057443B (en) * 2023-10-09 2024-02-02 杭州海康威视数字技术股份有限公司 Prompt learning method of visual language model and electronic equipment

Similar Documents

Publication Publication Date Title
CN112232149B (en) Document multimode information and relation extraction method and system
CN116503876A (en) Training method and device of image recognition model, and image recognition method and device
CN111160350B (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN113343707A (en) Scene text recognition method based on robustness characterization learning
CN112818646A (en) Method for editing pictures according to texts based on generation countermeasure network and dynamic editing module
CN111401259B (en) Model training method, system, computer readable medium and electronic device
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
CN111598183A (en) Multi-feature fusion image description method
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN116226785A (en) Target object recognition method, multi-mode recognition model training method and device
CN114282013A (en) Data processing method, device and storage medium
CN115964638A (en) Multi-mode social data emotion classification method, system, terminal, equipment and application
CN112966676B (en) Document key information extraction method based on zero sample learning
CN112686219B (en) Handwritten text recognition method and computer storage medium
Xu et al. CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
CN117093864A (en) Text generation model training method and device
CN115905613A (en) Audio and video multitask learning and evaluation method, computer equipment and medium
CN111414959A (en) Image recognition method and device, computer readable medium and electronic equipment
CN116384405A (en) Text processing method, text classification method and emotion recognition method
CN116152824A (en) Invoice information extraction method and system
CN116630369A (en) Unmanned aerial vehicle target tracking method based on space-time memory network
CN115661829A (en) Image-text recognition method and data processing method of image-text recognition model
CN115187456A (en) Text recognition method, device, equipment and medium based on image enhancement processing
CN114139703A (en) Knowledge distillation method and device, storage medium and electronic equipment
CN113221718A (en) Formula identification method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination