CN113435535A

CN113435535A - Training method of image recognition model, image recognition method and device

Info

Publication number: CN113435535A
Application number: CN202110791423.XA
Authority: CN
Inventors: 巩佳超; 戴宇荣; 于冰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-09-24

Abstract

The disclosure provides a training method of an image recognition model, an image recognition method and equipment. The image recognition method comprises the following steps: acquiring an image to be identified; inputting an image to be recognized into a first image recognition model, and obtaining a prediction result of whether the image is a three-section image or not and an upper boundary position and a lower boundary position of a content area in the predicted image; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

Description

Training method of image recognition model, image recognition method and device

Technical Field

The present disclosure relates generally to the field of image technology, and more particularly, to a training method and apparatus for an image recognition model, and an image recognition method and apparatus.

Background

With the development of electronic technology, users can perform various processing on images (videos) according to requirements, for example, users can add upper and lower backgrounds to images (videos) obtained by shooting or obtained from third parties to obtain desired works.

Disclosure of Invention

An exemplary embodiment of the present disclosure is to provide a training method of an image recognition model, an image recognition method and apparatus, which are capable of determining whether an image is a three-segment image, and an upper boundary position and a lower boundary position of a content area in the image.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method of an image recognition model, including: obtaining a training sample, wherein the training sample comprises an image with an annotation, and the annotation comprises: whether the image is the mark of the three-section image or not, and the upper boundary position and the lower boundary position of the content area; inputting an obtained training sample into a first image recognition model to obtain a prediction result of whether the training sample is a three-section image or not and the predicted upper boundary position and lower boundary position of a content area in the training sample; determining a first loss function of a first image recognition model according to the prediction result of whether the training sample is a three-section image and the label of the training sample; determining a second loss function of the first image recognition model according to the label of the training sample and the upper boundary position and the lower boundary position of the predicted content area; training the first image recognition model by adjusting model parameters of the first image recognition model according to the first loss function and the second loss function; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

Optionally, the method further comprises: inputting a positive sample or characteristics of the positive sample in the obtained training samples into a second image recognition model to obtain a predicted background type of the positive sample, wherein the positive sample is a training sample with a mark which is a three-section image, and the labeling of the positive sample further comprises: a background type for indicating a type of a background region in the positive sample; determining a third loss function of a second image recognition model according to the predicted background type and the mark of the positive sample; and training the second image recognition model by adjusting the model parameters of the second image recognition model according to the third loss function.

Optionally, the feature of the positive sample is a feature obtained by performing feature extraction on the input positive sample by using a first image recognition model.

Optionally, the context type comprises at least one of the following types: the background type of the content area is a type of a solid background, a type of the solid background with characters or patterns, a type of a ground glass background, a type of the ground glass background with characters or patterns, a type of an upper, middle and lower repeated background, and other types, wherein the type of the upper, middle and lower repeated background is the same type of a picture displayed by the background area and a picture displayed by the content area.

Optionally, the method further comprises: inputting the obtained training sample or the characteristics of the training sample into a third image recognition model to obtain the predicted upper boundary position and lower boundary position of the content area in the training sample; determining a fourth loss function of the third image recognition model according to the label of the training sample and the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model; training the third image recognition model by adjusting the model parameters of the third image recognition model according to the fourth loss function; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

Optionally, the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model are expressed as: the ratio of the pixel position of the upper boundary of the content area to the image height, and the ratio of the pixel position of the lower boundary of the content area to the image height; and/or the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model are expressed as: and a probability map for indicating the probability that each pixel in the image is located at the upper and lower boundaries of the content area, wherein the higher the probability that the pixel corresponds to, the closer to the upper or lower boundary is.

According to a second aspect of the embodiments of the present disclosure, there is provided an image recognition method, including: acquiring an image to be identified; inputting an image to be recognized into a first image recognition model, and obtaining a prediction result of whether the image is a three-section image or not and an upper boundary position and a lower boundary position of a content area in the predicted image; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

Optionally, the method further comprises: and inputting the image or the characteristics of the image into a second image recognition model to obtain the predicted background type of the background area in the image.

Optionally, the feature of the image is a feature obtained by performing feature extraction on the input image by using the first image recognition model.

Optionally, the method further comprises: inputting the image or the characteristics of the image into a third image recognition model to obtain the predicted upper boundary position and lower boundary position of the content area in the image; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

Optionally, the step of inputting the image or the feature of the image into a second image recognition model to obtain the predicted background type of the background area in the image includes: when a user instruction requesting to identify a background type is received, inputting the image or the characteristics of the image into a second image identification model to obtain a predicted background type of a background area in the image; and/or the second image recognition model is trained by using the training method.

Optionally, the step of inputting the image or the feature of the image into a third image recognition model to obtain the predicted upper boundary position and lower boundary position of the content area in the image includes: when a user instruction requesting to accurately position the boundary of a content area is received, inputting the image or the characteristics of the image into a third image recognition model to obtain the predicted upper boundary position and lower boundary position of the content area in the image; and/or the third image recognition model is trained by using the training method.

Optionally, the first image recognition model is trained using a training method as described above.

Optionally, the method further comprises: and evaluating and/or processing the image according to the predicted background type of the background area in the image.

Optionally, the method further comprises: and evaluating and/or processing the image according to the upper boundary position and the lower boundary position of the content area in the image predicted by the first image recognition model.

Optionally, the method further comprises: and evaluating and/or processing the image according to the upper boundary position and the lower boundary position of the content area in the image predicted by the third image recognition model.

Optionally, the step of evaluating the image comprises: evaluating the picture quality of the image; and/or the step of image processing the image comprises: and carrying out color enhancement processing on the image.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus of an image recognition model, including: a sample acquisition unit configured to acquire a training sample, wherein the training sample comprises an image with an annotation, the annotation comprising: whether the image is the mark of the three-section image or not, and the upper boundary position and the lower boundary position of the content area; a first prediction unit configured to input an acquired training sample into a first image recognition model, and obtain a prediction result of whether the training sample is a three-segment image, and predicted upper and lower boundary positions of a content area in the training sample; a first loss function calculation unit configured to determine a first loss function of a first image recognition model according to a prediction result of whether the training sample is a three-segment image and an annotation of the training sample; a second loss function calculation unit configured to determine a second loss function of the first image recognition model according to the labels of the training samples and the upper boundary position and the lower boundary position of the predicted content area; a first training unit configured to train the first image recognition model by adjusting model parameters of the first image recognition model according to a first loss function and a second loss function; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

Optionally, the apparatus further comprises: a second prediction unit, configured to input a positive sample or features of the positive sample in the acquired training samples into a second image recognition model, so as to obtain a predicted background type of the positive sample, where the positive sample is a training sample with an identifier that is a three-segment image, and the labeling of the positive sample further includes: a background type for indicating a type of a background region in the positive sample; a third loss function calculation unit configured to determine a third loss function of a second image recognition model according to the predicted background type and the labeling of the positive sample; a second training unit configured to train the second image recognition model by adjusting model parameters of the second image recognition model according to a third loss function.

Optionally, the apparatus further comprises: a third prediction unit configured to input the obtained training sample or the feature of the training sample into a third image recognition model, and obtain an upper boundary position and a lower boundary position of a content area in the predicted training sample; a fourth loss function calculation unit configured to determine a fourth loss function of the third image recognition model according to the labels of the training samples and the upper boundary position and the lower boundary position of the content region predicted by the third image recognition model; a third training unit configured to train the third image recognition model by adjusting model parameters of the third image recognition model according to a fourth loss function; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an image recognition apparatus including: an image acquisition unit configured to acquire an image to be recognized; a first prediction unit configured to input an image to be recognized into a first image recognition model, and obtain a prediction result of whether the image is a three-segment image, and predicted upper and lower boundary positions of a content area in the image; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

Optionally, the apparatus further comprises: a second prediction unit configured to input the image or features of the image into a second image recognition model, resulting in a predicted background type of a background region in the image.

Optionally, the apparatus further comprises: a third prediction unit configured to input the image or features of the image into a third image recognition model, resulting in predicted upper and lower boundary positions of a content area in the image; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

Optionally, the second prediction unit is configured to, when receiving a user instruction requesting to identify a background type, input the image or a feature of the image into a second image identification model, resulting in a predicted background type of a background region in the image; and/or the second image recognition model is trained using a training apparatus as described above.

Optionally, the third prediction unit is configured to, when receiving a user instruction requesting to accurately locate a boundary of a content area, input the image or a feature of the image into a third image recognition model, resulting in predicted upper and lower boundary positions of the content area in the image; and/or the third image recognition model is trained using a training apparatus as described above.

Optionally, the first image recognition model is trained using a training apparatus as described above.

Optionally, the apparatus further comprises: a processing unit configured to evaluate and/or image process the image according to a predicted background type of a background region in the image.

Optionally, the apparatus further comprises: a processing unit configured to evaluate and/or image process the image according to the upper and lower boundary positions of the content area in the image predicted by the first image recognition model.

Optionally, the apparatus further comprises: a processing unit configured to evaluate and/or image-process the image according to the upper and lower boundary positions of the content area in the image predicted by the third image recognition model.

Optionally, the processing of evaluating the image comprises: evaluating the picture quality of the image; and/or the processing of the image comprises: and carrying out color enhancement processing on the image.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method of an image recognition model as described above and/or an image recognition method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform the training method of the image recognition model as described above and/or the image recognition method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement the training method of the image recognition model as described above and/or the image recognition method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the exemplary embodiments of the present disclosure, it is possible to identify whether an image is a three-segment image, an upper boundary position and a lower boundary position of a content area in the image, and further, a background type of a background area in the image;

in addition, adaptive image processing and image evaluation can be performed on the three-segment image, so that special processing requirements for the three-segment image are met;

in addition, the corresponding image recognition models can be called according to requirements, so that the calculated amount is saved, the requirements are met, each image recognition model can be updated independently, and the expansibility is good.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 illustrates an example of a three-segment image according to an exemplary embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a method of training an image recognition model according to an example embodiment of the present disclosure;

FIG. 3 shows a flow chart of a method of training an image recognition model according to another example embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a method of training an image recognition model according to another example embodiment of the present disclosure;

FIG. 5 illustrates an example of a boundary probability map in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a flow chart of an image recognition method according to an exemplary embodiment of the present disclosure;

fig. 7 illustrates an example of an image recognition method according to an exemplary embodiment of the present disclosure;

FIG. 8 shows a block diagram of a training apparatus for an image recognition model according to an exemplary embodiment of the present disclosure;

fig. 9 illustrates a block diagram of a structure of an image recognition apparatus according to an exemplary embodiment of the present disclosure;

fig. 10 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Fig. 1 shows an example of a three-segment image, in which the middle segment of the three-segment image is a content area, and the portions above and below the middle segment are background areas.

As an example, the user may add upper and lower backgrounds to the captured image (or video) or obtained from a third party to form a three-segment image, and then upload the three-segment image to the short video platform as a new work, and the image (or video) content of the secondary processing may increase the viewing experience of the image (or video) to some extent, for example, the background may be added above and below the original picture content, and some auxiliary description content (e.g., text description content) may be included in the background to enhance interactivity. In addition, when a user wants to combine a plurality of images into one video, if the sizes of the plurality of images are different, the upper and lower backgrounds need to be added to the original image size, so that the sizes of the plurality of three-segment images formed after processing are the same, and one video can be combined; when the image uploaded by the user does not meet the requirement of the aspect ratio of the uploading size, the upper background and the lower background can be added to the original image size, so that the three-section image formed after processing can meet the requirement of the corresponding aspect ratio; when the horizontal screen video is converted into the vertical screen video, the upper background and the lower background can be added to each frame of image in the video frame to realize conversion, namely, each frame of image in the vertical screen video is a three-section image.

The present disclosure contemplates: in practical applications, there is not a small percentage of three-segment content (e.g., in short videos), and the three-segment content causes serious interference to many image video processing algorithms, for example, the normal prediction of the image video processing algorithms is interfered to some extent. For example, in the fuzzy detection, if a picture in a content region is a clear picture but a picture in a background region is a fuzzy picture (for example, a ground glass background) in a three-segment image uploaded by a user, a definition algorithm is misjudged, the detection algorithm easily predicts the image as an unclear image, and distribution of the recommendation algorithm to the image is influenced; in the color enhancement algorithm, if the upper and lower background regions and the middle content region are very different, wrong color matching is used, and the user experience is seriously affected. Therefore, the image identification method capable of accurately identifying the three-segment image and providing the boundary position of the content area and the background area is provided in the present disclosure, considering that the three-segment image can be identified, and even the accurate upper and lower boundary positions of the three-segment image can be provided, which has a significant guiding effect on the later image video processing flow. In addition, the image recognition method can also determine the background type of the three-segment image.

By the image identification method, whether the content (such as an image or a video) uploaded by a user is of a three-section type can be accurately identified, particularly, the background type of the three-section content can be identified and an accurate three-section boundary position can be given, so that various image (or video) processing algorithms can be effectively guided to accurately act on the content uploaded by the user, a recommendation system can be ensured to accurately provide more high-quality content for the user, and the viewing experience of the user is ensured.

Of course, the image recognition method and/or the image recognition apparatus according to the present disclosure may be applied not only to the above-described scenes but also to other suitable scenes, and the present disclosure is not limited thereto.

Fig. 2 illustrates a flowchart of a training method of an image recognition model according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S101, training samples are acquired.

Here, the training sample includes an image with an annotation, the annotation including: an identification of whether it is a three-segment image, and an upper boundary position and a lower boundary position of the content area.

In step S102, an acquired training sample is input to a first image recognition model, and a result of predicting whether the training sample is a three-segment image, and an upper boundary position and a lower boundary position of a content area in the training sample are obtained.

As an example, the first image recognition model may be constructed based on a deep learning algorithm, and it should be understood that other suitable types of machine learning algorithms may be used to construct the first image recognition model, which is not limited by this disclosure.

As an example, the model structure of the first image recognition model may be a ResNet18 network structure. It should be understood that other suitable types of network architectures are possible, and the present disclosure is not limited thereto.

As an example, the prediction result of whether the training sample is a three-segment image may be a binary classification result (i.e., a result of a three-segment image or not), or may be a probability of a three-segment image.

To adapt to images of different sizes and ensure the accuracy of the upper and lower boundary positions, the upper and lower boundary positions may be given in the form of a ratio, and as an example, the upper and lower boundary positions of the content area predicted by the first image recognition model may be expressed as: the ratio of the pixel position at which the upper boundary of the content area is located to the image height, and the ratio of the pixel position at which the lower boundary of the content area is located to the image height.

As an example, the size of the training samples can be uniformly adjusted to 320 × 320, and although the image structure is deformed to some extent, it is still possible to identify whether the image is a three-segment image and to find the corresponding boundary position, and the model calculation complexity can be reduced.

In step S103, a first loss function of the first image recognition model is determined according to the prediction result of whether the training sample is a three-segment image and the identifier of whether the training sample in the label is a three-segment image.

As an example, the first loss function may be a cross-entropy loss function. It should be understood that other suitable types of loss functions are possible, and the present disclosure is not limited thereto.

In step S104, a second loss function of the first image recognition model is determined according to the upper boundary position and the lower boundary position of the content area in the annotation of the training sample and the upper boundary position and the lower boundary position of the predicted content area.

As an example, the second loss function may be an LI loss function. It should be understood that other suitable types of loss functions are possible, and the present disclosure is not limited thereto.

In step S105, the first image recognition model is trained by adjusting model parameters of the first image recognition model according to the first loss function and the second loss function.

Fig. 3 shows a flowchart of a training method of an image recognition model according to another exemplary embodiment of the present disclosure. As shown in fig. 3, the training method of an image recognition model according to another exemplary embodiment of the present disclosure may further include steps S106, S107, and S108 in addition to the steps S101 to S105 shown in fig. 2. Steps S101 to S105 can be implemented with reference to the specific implementation described with reference to fig. 2, and are not described herein again.

In step S106, a positive sample in the acquired training sample or a feature of the positive sample is input into a second image recognition model, so as to obtain a predicted background type of the positive sample. It should be understood that fig. 3 only shows the case where the positive sample in the acquired training samples is input into the second image recognition model, resulting in the predicted background type of the positive sample.

Here, the positive sample is a training sample with a label that is a three-segment image, and the labeling of the positive sample further includes: a background type for indicating a type of background region in the positive sample.

As an example, the feature of the positive sample may be a feature obtained by performing feature extraction on the input positive sample by using a first image recognition model.

As an example, the second image recognition model may be constructed based on a deep learning algorithm, and it should be understood that other suitable types of machine learning algorithms may be used to construct the second image recognition model, which is not limited by the present disclosure.

As an example, the model structure of the second image recognition model may be a MobileNet V2 network structure. It should be understood that other suitable types of network architectures are possible, and the present disclosure is not limited thereto.

For example, when the model structure of the first image recognition model is a ResNet18 network structure, the model structure of the second image recognition model may be MobileNet V2 with the first convolutional layer removed, but the second image recognition model needs to multiplex the first convolutional layer of the first image recognition model, in other words, the output of the first convolutional layer of the first image recognition model may be used as the input of the second image recognition model, that is, the image feature output by the first convolutional layer of the first image recognition model is the feature of the sample input into the second image recognition model. By multiplexing the image features output by the intermediate layer of the first image recognition model, the computational complexity and the computational load can be reduced.

As an example, the context type may comprise at least one of the following types: the background type of the content area is a type of a solid background, a type of the solid background with characters or patterns, a type of a ground glass background, a type of the ground glass background with characters or patterns, a type of an upper, middle and lower repeated background, and other types, wherein the type of the upper, middle and lower repeated background is the same type of a picture displayed by the background area and a picture displayed by the content area.

Regarding the acquisition of training samples, as an example, a large amount of data to be labeled can be pulled from an online short video to meet the requirement of model learning on data amount, and the data can cover different picture proportions and different content types to ensure that the model can learn accurate characteristics related to a three-section type and ensure the generalization capability of the model. Data annotation can be divided into 3 parts: three-section label marking, three-section background type marking and upper and lower boundary position marking. For three-section label labeling, a labeling person is required to judge whether the image is of a three-section type, if so, the image is labeled as a positive sample, and otherwise, the image is labeled as a negative sample. For the three-segment background type labeling, the present disclosure classifies the background types into 6 categories according to actual requirements, as shown in table 1. For the labeling of the upper and lower boundary positions, a labeling person is required to find out a boundary between the central content area and the background area, i.e., the upper boundary position and the lower boundary position described in this disclosure.

TABLE 1 annotation Specifications for background types

In step S107, a third loss function of the second image recognition model is determined according to the predicted background type and the background type in the labeling of the positive sample.

As an example, the third loss function may be a cross-entropy loss function. It should be understood that other suitable types of loss functions are possible, and the present disclosure is not limited thereto.

In step S108, the second image recognition model is trained by adjusting model parameters of the second image recognition model according to the third loss function.

Fig. 4 shows a flowchart of a training method of an image recognition model according to another exemplary embodiment of the present disclosure. As shown in fig. 4, the training method of an image recognition model according to another exemplary embodiment of the present disclosure may further include steps S109, S110, and S111 in addition to the steps S101 to S105 shown in fig. 2. Steps S101 to S105 can be implemented with reference to the specific implementation described with reference to fig. 2, and are not described herein again.

In step S109, the obtained training sample or the feature of the training sample is input to a third image recognition model, and an upper boundary position and a lower boundary position of a content area in the predicted training sample are obtained. It should be understood that fig. 4 only shows the case where the obtained training sample is input into the third image recognition model, resulting in the predicted background type of the training sample.

For example, the feature of the training sample may be a feature obtained by performing feature extraction on the input training sample by the first image recognition model, or may be a feature obtained by processing a feature obtained by performing feature extraction on the input training sample by the first image recognition model.

As an example, the upper and lower boundary positions of the content area predicted by the third image recognition model may be expressed as: and a probability map for indicating the probability that each pixel in the image is located at the upper and lower boundaries of the content area, wherein the higher the probability that the pixel corresponds to, the closer to the upper or lower boundary is. In the present disclosure, the prediction of the upper and lower boundary positions is regarded as a probability map prediction task, and for the labeled upper and lower boundary positions, the supervision information of the training image is a probability map, and as shown in fig. 5, the higher the probability is, the closer the pixel is to the upper and lower boundaries.

As an example, the third image recognition model may be constructed based on a deep learning algorithm, and it should be understood that other suitable types of machine learning algorithms may be used to construct the third image recognition model, which is not limited by this disclosure.

As an example, the model structure of the third image recognition model may be a UNet network structure. It should be understood that other suitable types of network architectures are possible, and the present disclosure is not limited thereto.

In step S110, a fourth loss function of the third image recognition model is determined according to the labels of the training samples and the upper boundary position and the lower boundary position of the content region predicted by the third image recognition model.

As an example, the fourth loss function may be a BCE loss function. It should be understood that other suitable types of loss functions are possible, and the present disclosure is not limited thereto.

In step S111, the third image recognition model is trained by adjusting model parameters of the third image recognition model according to the fourth loss function.

Here, the accuracy of the upper and lower boundary positions of the content area predicted by the third image recognition model is higher than the upper and lower boundary positions of the content area predicted by the first image recognition model.

Fig. 6 illustrates a flowchart of an image recognition method according to an exemplary embodiment of the present disclosure.

Referring to fig. 6, in step S201, an image to be recognized is acquired.

As an example, the image to be recognized may be adjusted to a fixed size as the image to be recognized. For example, the fixed dimensions may be: 320*320.

In step S202, an image to be recognized is input into a first image recognition model, and a prediction result of whether the image is a three-segment image or not and an upper boundary position and a lower boundary position of a content area in the predicted image are obtained.

As an example, the upper and lower boundary positions of the content area predicted by the first image recognition model may be expressed as: the ratio of the pixel position at which the upper boundary of the content area is located to the image height, and the ratio of the pixel position at which the lower boundary of the content area is located to the image height.

As an example, the first image recognition model may be trained using a training method as described in the above exemplary embodiment.

As an example, the image recognition method according to an exemplary embodiment of the present disclosure may further include: and inputting the image or the characteristics of the image into a second image recognition model to obtain the predicted background type of the background area in the image.

As an example, the feature of the image may be a feature obtained by performing feature extraction on the input image by the first image recognition model.

As an example, the image or the features of the image may be input into a second image recognition model when a user instruction requesting to recognize a background type is received, resulting in a predicted background type of a background region in the image.

As an example, the second image recognition model may be trained using the training method described in the above exemplary embodiment.

As an example, the image recognition method according to an exemplary embodiment of the present disclosure may further include: inputting the image or the characteristics of the image into a third image recognition model to obtain the predicted upper boundary position and lower boundary position of the content area in the image; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

As an example, the upper and lower boundary positions of the content area predicted by the third image recognition model may be expressed as: and a probability map for indicating the probability that each pixel in the image is located at the upper and lower boundaries of the content area, wherein the higher the probability that the pixel corresponds to, the closer to the upper or lower boundary is.

As an example, the image or the features of the image may be input into a third image recognition model when a user instruction requesting to accurately locate the boundary of the content area is received, resulting in predicted upper and lower boundary positions of the content area in the image.

As an example, the third image recognition model may be trained using the training method described in the above exemplary embodiment.

As an example, the image recognition method according to an exemplary embodiment of the present disclosure may further include: and evaluating and/or processing the image according to the predicted background type of the background area in the image. As an example, the picture quality of the image may be evaluated according to a predicted background type of a background region in the image. As an example, the image may be color enhanced according to a predicted background type of a background region in the image.

As an example, the image recognition method according to an exemplary embodiment of the present disclosure may further include: and evaluating and/or processing the image according to the upper boundary position and the lower boundary position of the content area in the image predicted by the first image recognition model. As an example, the picture quality of the image may be evaluated based on the upper and lower boundary positions of the content area in the image predicted by the first image recognition model. As an example, the image may be color-enhanced according to upper and lower boundary positions of a content area in the image predicted by the first image recognition model.

As an example, the image recognition method according to an exemplary embodiment of the present disclosure may further include: and evaluating and/or processing the image according to the upper boundary position and the lower boundary position of the content area in the image predicted by the third image recognition model. As an example, the picture quality of the image may be evaluated according to the upper and lower boundary positions of the content area in the image predicted by the third image recognition model. As an example, the image may be color-enhanced according to upper and lower boundary positions of a content area in the image predicted by the third image recognition model.

As an example, the three-segment image may be evaluated and/or image processed distinctively and specifically with respect to a normal image (e.g., an image in which all regions of the image are content regions and no background regions are present).

As an example, in evaluating the picture quality, the background area and the content area may be evaluated separately.

As an example, it is possible to perform appropriate image processing in consideration of a screen displayed in the background area and a screen displayed in the content area.

The specific processing in the image recognition method according to the exemplary embodiment of the present disclosure has been described in detail in the embodiment of the above-described related training method of the image recognition model, and will not be explained in detail here.

Fig. 7 illustrates an example of an image recognition method according to an exemplary embodiment of the present disclosure.

As shown in fig. 7, the three-stage detection model may be composed of three parts: the image recognition system comprises a first image recognition model, a second image recognition model and a third image recognition model. For an input image, the image can be input into a first image recognition model firstly, the first image recognition model can output the probability of whether the image is a three-segment image, and meanwhile, a rough boundary position can be output. Whether the accurate positioning function is started or not can be selected by a user, after the accurate positioning function is started, the image or the characteristics of the image are input into a third image recognition model, and the third image recognition model can predict a more accurate boundary position and correct the prediction result of the first image recognition model. In addition, the user can select whether to perform background classification, and if the user is started, the second image recognition model outputs the background classification of the three-segment image. The user can start the corresponding function as required, and if the image is only required to be detected to be of a three-section type, only the first image recognition model is required to be used, so that the effect of reducing the calculated amount is achieved, and the resource utilization efficiency is optimized; and if an accurate three-section boundary position is required, starting a third image recognition model to ensure accurate prediction of a downstream algorithm.

Because the present disclosure is directed to image-level prediction, if applied to video, a frame extraction strategy can be used to calculate the three-segment information of a video frame, and then the frame information is converted into video information by averaging or by other schemes, which has good extensibility.

The performance verification is carried out on the constructed test set, the first part of the test set is divided into 500 normal images and 500 three-section images, and the three-section images can be used for testing the accuracy of three-section classification, wherein the three-section images have accurate boundary positions and can be used for testing the searching effect of the upper and lower boundary positions (the experiment is used for verifying the intersection ratio of the predicted content area and the actual content area). The second part of the test set is 1000 various background images for predicting the classification accuracy of the background.

TABLE 2 image recognition model Effect

Fig. 8 illustrates a block diagram of a training apparatus of an image recognition model according to an exemplary embodiment of the present disclosure.

As shown in fig. 8, the training apparatus 10 of an image recognition model according to an exemplary embodiment of the present disclosure includes: sample acquisition section 101, first prediction section 102, first loss function calculation section 103, second loss function calculation section 104, and first training section 105.

In particular, the sample acquiring unit 101 is configured to acquire a training sample, wherein the training sample comprises an image with an annotation, the annotation comprising: an identification of whether it is a three-segment image, and an upper boundary position and a lower boundary position of the content area.

The first prediction unit 102 is configured to input the acquired training sample into a first image recognition model, and obtain a prediction result of whether the training sample is a three-segment image, and predicted upper and lower boundary positions of a content area in the training sample.

The first loss function calculation unit 103 is configured to determine a first loss function of the first image recognition model according to the prediction result of whether the training sample is a three-segment image and the label of the training sample.

The second loss function calculation unit 104 is configured to determine a second loss function of the first image recognition model based on the labels of the training samples and the upper and lower boundary positions of the predicted content area.

The first training unit 105 is configured to train the first image recognition model by adjusting model parameters of the first image recognition model according to a first loss function and a second loss function; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

As an example, the training apparatus 10 of an image recognition model according to an exemplary embodiment of the present disclosure may further include: a second prediction unit (not shown), a third loss function calculation unit (not shown), and a second training unit (not shown).

Specifically, the second prediction unit is configured to input a positive sample or features of the positive sample in the acquired training samples into a second image recognition model, and obtain a predicted background type of the positive sample, where the positive sample is a training sample with an identifier that is a three-segment image, and the labeling of the positive sample further includes: a background type for indicating a type of background region in the positive sample.

A third loss function calculation unit configured to determine a third loss function of the second image recognition model according to the predicted background type and the labeling of the positive sample.

A second training unit configured to train the second image recognition model by adjusting model parameters of the second image recognition model according to a third loss function.

As an example, the training apparatus 10 of an image recognition model according to an exemplary embodiment of the present disclosure may further include: a third prediction unit (not shown), a fourth penalty function calculation unit (not shown), and a third training unit (not shown).

Specifically, the third prediction unit is configured to input the obtained training samples or the features of the training samples into a third image recognition model, and obtain predicted upper boundary positions and lower boundary positions of the content areas in the training samples.

And the fourth loss function calculation unit is configured to determine a fourth loss function of the third image recognition model according to the labels of the training samples and the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model.

A third training unit configured to train the third image recognition model by adjusting model parameters of the third image recognition model according to a fourth loss function; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

As an example, the upper and lower boundary positions of the content area predicted by the first image recognition model may be expressed as: the ratio of the pixel position of the upper boundary of the content area to the image height, and the ratio of the pixel position of the lower boundary of the content area to the image height; and/or, the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model can be expressed as: and a probability map for indicating the probability that each pixel in the image is located at the upper and lower boundaries of the content area, wherein the higher the probability that the pixel corresponds to, the closer to the upper or lower boundary is.

Fig. 9 illustrates a block diagram of a structure of an image recognition apparatus according to an exemplary embodiment of the present disclosure.

As shown in fig. 9, the image recognition apparatus 20 according to an exemplary embodiment of the present disclosure includes: an image acquisition unit 201, and a first prediction unit 202.

Specifically, the image acquisition unit 201 is configured to acquire an image to be recognized.

The first prediction unit 202 is configured to input an image to be recognized into a first image recognition model, and obtain a prediction result of whether the image is a three-segment image or not and predicted upper and lower boundary positions of a content area in the image; the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

As an example, the image recognition apparatus 20 according to an exemplary embodiment of the present disclosure may further include: a second prediction unit (not shown) configured to input the image or features of the image into a second image recognition model, resulting in a predicted background type of a background region in the image.

As an example, the image recognition apparatus 20 according to an exemplary embodiment of the present disclosure may further include: a third prediction unit (not shown) configured to input the image or features of the image into a third image recognition model, resulting in predicted upper and lower boundary positions of a content area in the image; wherein the accuracy of the upper boundary position and the lower boundary position of the content area predicted by the third image recognition model is higher than that of the upper boundary position and the lower boundary position of the content area predicted by the first image recognition model.

As an example, the second prediction unit may be configured to, when receiving a user instruction requesting identification of a background type, input the image or a feature of the image into a second image identification model, resulting in a predicted background type of a background region in the image; and/or, the second image recognition model may be trained using a training apparatus as described in the above exemplary embodiments.

As an example, the third prediction unit may be configured to, when receiving a user instruction requesting accurate positioning of a boundary of a content area, input the image or a feature of the image into a third image recognition model, resulting in predicted upper and lower boundary positions of the content area in the image; and/or, the third image recognition model may be trained using a training apparatus as described in the above exemplary embodiments.

As an example, the first image recognition model may be trained using a training apparatus as described in the above exemplary embodiments.

As an example, the image recognition apparatus 20 according to an exemplary embodiment of the present disclosure may further include: a processing unit (not shown).

As an example, the processing unit may be configured to evaluate and/or image process the image depending on a predicted background type of a background region in the image.

As an example, the processing unit may be configured to evaluate and/or image process the image according to an upper boundary position and a lower boundary position of a content area in the image predicted by the first image recognition model.

As an example, the processing unit may be configured to evaluate and/or image process the image according to the upper and lower boundary positions of the content area in the image predicted by the third image recognition model.

As an example, the process of evaluating the image may include: evaluating the picture quality of the image; and/or, the processing of the image may comprise: and carrying out color enhancement processing on the image.

With regard to the apparatus in the above-described embodiment, the specific manner in which the respective units perform operations has been described in detail in the embodiment related to the method, and will not be elaborated upon here.

Further, it should be understood that the respective units in the training apparatus 10 and the image recognition apparatus 20 of the image recognition model according to the exemplary embodiments of the present disclosure may be implemented as hardware components and/or software components. The individual units may be implemented, for example, using Field Programmable Gate Arrays (FPGAs) or Application Specific Integrated Circuits (ASICs), depending on the processing performed by the individual units as defined by the skilled person.

Fig. 10 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure. Referring to fig. 10, the electronic device 30 includes: at least one memory 301 and at least one processor 302, the at least one memory 301 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 302, perform a method of training an image recognition model and/or a method of image recognition as described in the above exemplary embodiments.

By way of example, the electronic device 30 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the electronic device 30 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 30 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 30, the processor 302 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 302 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 302 may execute instructions or code stored in the memory 301, wherein the memory 301 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 301 may be integrated with the processor 302, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 301 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 301 and the processor 302 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 302 is able to read files stored in the memory.

In addition, the electronic device 30 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 30 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, which when executed by at least one processor, cause the at least one processor to perform the training method of the image recognition model and/or the image recognition method as described in the above exemplary embodiments. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, in which instructions are executable by at least one processor to perform the training method of the image recognition model and/or the image recognition method as described in the above exemplary embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image recognition method, comprising:

acquiring an image to be identified;

inputting an image to be recognized into a first image recognition model, and obtaining a prediction result of whether the image is a three-section image or not and an upper boundary position and a lower boundary position of a content area in the predicted image;

the middle section of the three-section image is a content area, and the parts above the middle section and below the middle section are background areas.

2. The method of claim 1, further comprising:

and inputting the image or the characteristics of the image into a second image recognition model to obtain the predicted background type of the background area in the image.

3. The method according to claim 2, wherein the feature of the image is a feature obtained by feature extraction of the input image by the first image recognition model.

4. The method of claim 2, wherein the context type comprises at least one of the following types: a solid background type, a type of solid background with letters or patterns, a frosted glass background type, a type of frosted glass background with letters or patterns, a top-middle-bottom repeated background type, and other types,

the upper, middle and lower repeated background type is the same type of the picture displayed in the background area and the picture displayed in the content area.

5. A training method of an image recognition model is characterized by comprising the following steps:

obtaining a training sample, wherein the training sample comprises an image with an annotation, and the annotation comprises: whether the image is the mark of the three-section image or not, and the upper boundary position and the lower boundary position of the content area;

inputting an obtained training sample into a first image recognition model to obtain a prediction result of whether the training sample is a three-section image or not and the predicted upper boundary position and lower boundary position of a content area in the training sample;

determining a first loss function of a first image recognition model according to the prediction result of whether the training sample is a three-section image and the label of the training sample;

determining a second loss function of the first image recognition model according to the label of the training sample and the upper boundary position and the lower boundary position of the predicted content area;

training the first image recognition model by adjusting model parameters of the first image recognition model according to the first loss function and the second loss function;

6. An image recognition apparatus characterized by comprising:

an image acquisition unit configured to acquire an image to be recognized;

a first prediction unit configured to input an image to be recognized into a first image recognition model, and obtain a prediction result of whether the image is a three-segment image, and predicted upper and lower boundary positions of a content area in the image;

7. An apparatus for training an image recognition model, comprising:

a sample acquisition unit configured to acquire a training sample, wherein the training sample comprises an image with an annotation, the annotation comprising: whether the image is the mark of the three-section image or not, and the upper boundary position and the lower boundary position of the content area;

a first prediction unit configured to input an acquired training sample into a first image recognition model, and obtain a prediction result of whether the training sample is a three-segment image, and predicted upper and lower boundary positions of a content area in the training sample;

a first loss function calculation unit configured to determine a first loss function of a first image recognition model according to a prediction result of whether the training sample is a three-segment image and an annotation of the training sample;

a second loss function calculation unit configured to determine a second loss function of the first image recognition model according to the labels of the training samples and the upper boundary position and the lower boundary position of the predicted content area;

a first training unit configured to train the first image recognition model by adjusting model parameters of the first image recognition model according to a first loss function and a second loss function;

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the image recognition method of any one of claims 1 to 4 and/or the training method of the image recognition model of claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the image recognition method of any one of claims 1 to 4 and/or the training method of the image recognition model of claim 5.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement the image recognition method of any one of claims 1 to 4 and/or the training method of the image recognition model of claim 5.