CN115147757A

CN115147757A - Image recognition method and device, computer readable storage medium and computing device

Info

Publication number: CN115147757A
Application number: CN202210705722.1A
Authority: CN
Inventors: 吴承昊; 唐百川; 王烨诚
Original assignee: Shanghai Yilueming Digital Technology Co ltd
Current assignee: Shanghai Yilueming Digital Technology Co ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-10-04

Abstract

An image recognition method and device, a computer-readable storage medium and a computing device are provided, wherein the method comprises the following steps: acquiring a video, and acquiring a target image from the video, wherein the video is obtained by recording a screen on an operation interface of a target application program; carrying out optical character recognition on the target image to obtain the text content and first position information contained in the target image, wherein the first position information is used for indicating the position of the text content in the target image; inputting the target image into the trained recognition model to recognize the selected option frame in the target image and obtain second position information, wherein the second position information is used for indicating the position of the selected option frame in the target image; and determining first position information matched with the second position information, and taking the text content of the matched first position information as the recognition result of the target image. Therefore, when the operation information of the user is acquired from the recorded video, the acquisition efficiency can be effectively improved.

Description

Image recognition method and device, computer readable storage medium and computing device

Technical Field

The present invention relates to the field of image recognition, and in particular, to an image recognition method and apparatus, a computer-readable storage medium, and a computing device.

Background

With the development of mobile terminals and network technologies, videos of application program operation interfaces can be recorded through a screen recording technology, but how to acquire operation information of users from the recorded videos becomes a problem which needs to be solved urgently.

Conventionally, character Recognition (OCR) is performed on an image in a recording screen to obtain text information in the image, and then manual detection and other manners are combined to determine the operation performed by a user in the recording screen, so as to obtain operation information of the user from the recorded video.

However, this conventional method requires a lot of labor cost and is inefficient.

Disclosure of Invention

The invention solves the technical problem of how to effectively improve the acquisition efficiency when acquiring the operation information of a user from a recorded video.

In order to solve the above technical problem, an embodiment of the present invention provides an image recognition method, where the method includes: acquiring a video, and acquiring a target image from the video, wherein the video is obtained by recording a screen of an operation interface of a target application program; carrying out optical character recognition on the target image to obtain the text content and first position information contained in the target image, wherein the first position information is used for indicating the position of the text content in the target image; inputting the target image into a trained recognition model to recognize the selected option frame in the target image and obtain second position information, wherein the second position information is used for indicating the position of the selected option frame in the target image; and determining first position information matched with the second position information, and taking the text content of the matched first position information as the recognition result of the target image.

Optionally, before inputting the target image into the trained recognition model, the method further includes: acquiring a training image, and dividing the training image into a training set and a test set; training an initial model by taking the training set as a training sample to obtain an identification model; and testing the recognition model through the test set, and obtaining the trained recognition model after the test is passed.

Optionally, the training image includes a stitched image, and the method further includes: acquiring original images, splicing a plurality of original images, and reducing the spliced images according to the sizes of the original images to obtain the spliced images; the original image is an image obtained by recording a screen of an operation interface of a training application program, and the operation interface of the training application program is provided with at least one option box.

Optionally, the training the initial model with the training set as a sample to obtain the recognition model includes: step A, labeling a first part of training images in the training set, and training the initial model by using the labeled images as training samples to obtain an intermediate model; step B, acquiring a next part of training images in the training set, inputting the next part of training images into the intermediate model to identify option boxes in the next part of training images, outputting identification results, and receiving external correction aiming at the identification results, wherein the correction comprises marking the option boxes which are inaccurately identified in the identification results; step C, taking the image marked in the step B as a training sample to train the intermediate model; and C, repeatedly executing the step B and the step C until the training of the intermediate model is finished, and acquiring the trained intermediate model as the recognition model.

Optionally, the initial model adopts a YOLO algorithm, and labeling the first part of the training images in the training set includes: identifying a block diagram in the first portion of training images by the initial model; clustering the identified block diagrams to obtain clustered block diagrams; selecting one or more types of block diagrams in the clustered block diagrams as block diagrams to be labeled based on the similarity between the clustered block diagrams and the option blocks; and labeling the block diagram to be labeled to obtain a labeling result.

Optionally, the labeling the block diagram to be labeled to obtain a labeling result, and further includes: outputting a labeling result, and obtaining the adjustment of the labeling result from the outside; wherein the adjusting comprises: and deleting wrong labels in the labeling result, and/or labeling the unidentified block diagrams to be labeled.

Optionally, before acquiring the target image from the video, the method further includes: and selecting a key frame from the video as the target image.

Optionally, after selecting a key frame from the video as the target image, the method further includes: calculating the similarity between different target images according to at least one of a histogram comparison method, a block histogram comparison method and a perceptual hash algorithm; and removing the duplicate of the target image according to the similarity.

An embodiment of the present invention further provides an image recognition apparatus, where the apparatus includes: the target image acquisition module is used for acquiring a video and acquiring a target image from the video, wherein the video is obtained by recording a screen on an operation interface of a target application program; the optical character recognition module is used for carrying out optical character recognition on the target image to obtain the text content and first position information contained in the target image, and the first position information is used for indicating the position of the text content in the target image; the option frame recognition module is used for inputting the target image into a trained recognition model so as to recognize a selected option frame in the target image and obtain second position information, and the second position information is used for indicating the position of the selected option frame in the target image; and the result acquisition module is used for determining first position information matched with the second position information and taking the text content of the matched first position information as the recognition result of the target image.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of any of the methods.

An embodiment of the present invention further provides a computing device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of any one of the methods when executing the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides an image identification method, which comprises the following steps: acquiring a video, and acquiring a target image from the video, wherein the video is obtained by recording a screen of an operation interface of a target application program; performing optical character recognition on the target image to obtain text content and first position information contained in the target image, wherein the first position information is used for indicating the position of the text content in the target image; inputting the target image into a trained recognition model to recognize the selected option box in the target image and obtain second position information, wherein the second position information is used for indicating the position of the selected option box in the target image; and determining first position information matched with the second position information, and taking the text content of the matched first position information as the recognition result of the target image. Compared with the prior art, the image recognition method of the embodiment is adopted to perform full optical character recognition on the target image to obtain all the character contents contained in the target image, and automatically recognize all the selected option frames contained in the target image through the trained recognition model. And obtaining a recognition result of the text content corresponding to the selected option frame as a target image based on the position relation between the option frame and the text content, and using the recognition result as the acquired operation information of the user. Therefore, the acquisition efficiency of acquiring the operation information of the user from the recorded video can be effectively improved.

Further, in the model training stage, the training images are divided into a training set and a verification set, the training set is used for training the model, and the verification set is used for verifying the recognition effect of the trained model. Therefore, the recognition effect of the trained recognition model can be verified in the model training stage. If the training effect does not meet the requirement, more training samples can be obtained to repeat the training steps of the embodiment until the recognition model meeting the requirement is obtained. And taking the recognition model meeting the requirements as a trained recognition model.

Further, after the original image is obtained (for example, the original image is an image which is directly obtained from a historically collected video and is not subjected to any processing), a stitched image, an enhanced image, a processed image, and the like can be generated according to the original image, so that the number of training samples during model training is increased, the sizes of the background and the option box of image recognition are enriched, and overfitting of the model can be avoided.

Further, the training set is divided into a plurality of parts, model training is started after the training images of the first part in the training set are labeled, and then inaccurate option boxes are identified in the identification results based on the identification results of the trained models. Compared with the conventional method for manually labeling the training images of the training set in the training mode, the method provided by the embodiment of the invention can reduce the labeling cost (for example, the time cost and the labor cost can be included). In addition, the training images in the training set are divided into a plurality of parts, and each part is acquired in sequence for model training, so that the method is beneficial to a annotator to find the capability boundary of the currently trained model, and is beneficial to the subsequent model tuning process.

Further, the recognition model is a YOLO model, characteristics of a rectangular box (bounding box) may be output through the YOLO model, and all the frame diagrams (for example, the frame diagrams are rectangular boxes) included in the first part of the training image may be obtained at an output end of the YOLO model. And selecting one or more types of block diagrams from the clustered block diagrams as block diagrams to be labeled through clustering and similarity calculation, and automatically labeling the block diagrams to be labeled. Compared with the traditional method for manually labeling the training images, the automatic labeling of the embodiment can reduce the cost of labeling the samples to a great extent and improve the efficiency of model training. In addition, when the YOLO model is trained, an Anchor box image (Anchor box) is obtained by model self-adaptation through clustering and similarity calculation, and a prior frame selected by hand is replaced by the self-adaptive Anchor box image, so that the YOLO model can know how large a rectangular frame is used for searching for a recognized target, and the model can be helped to be rapidly converged.

Drawings

Fig. 1 is a schematic flowchart of an image recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a training method for identifying a model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention.

Detailed Description

As background art shows, in the prior art, a large amount of labor cost is required to obtain operation information of a user from a recorded video, and efficiency is low.

In order to solve the problem, an embodiment of the present invention provides an image recognition method, including: acquiring a video, and acquiring a target image from the video, wherein the video is obtained by recording a screen of an operation interface of a target application program; carrying out optical character recognition on the target image to obtain the text content and first position information contained in the target image, wherein the first position information is used for indicating the position of the text content in the target image; inputting the target image into a trained recognition model to recognize the selected option box in the target image and obtain second position information, wherein the second position information is used for indicating the position of the selected option box in the target image; and determining first position information matched with the second position information, and taking the text content of the matched first position information as the recognition result of the target image. By adopting the method of the embodiment, when the operation information of the user is acquired from the recorded video, the acquisition efficiency can be effectively improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In an embodiment, please refer to fig. 1, and fig. 1 is a schematic structural diagram of an image recognition method according to an embodiment of the present invention. The image recognition method can be executed by a terminal or a server, the terminal can comprise equipment such as a smart phone, a computer, a tablet computer and a smart watch, and one or more Application programs (APP for short) are installed on the terminal.

The image recognition method may include steps S101 to S104 as described below in detail.

Step S101, acquiring a video and acquiring a target image from the video, wherein the video is obtained by recording a screen of an operation interface of a target application program.

The target application program may include one or more APPs installed on the terminal. In a specific example, the server sends an instruction to the terminal, where the instruction may carry related information of the target application program to specify one or more APPs on the terminal as the target application program. In another specific example, a user enters an instruction on the terminal to authorize the terminal to target one or more APPs as target applications.

In a non-limiting example, a user authorizes a terminal to use one or more APPs as target applications, and when the terminal opens one of the target applications, a screen recording process on the terminal is triggered to record a screen on an interface of the opened target application to obtain a video.

Optionally, after the terminal records the screen to obtain the video, the image recognition method described in step S101 to step S104 may be directly performed on the video obtained by recording the screen. Or, the terminal may send a video obtained by screen recording to the server, and after the server obtains the video, the image recognition method described in step S101 to step S104 is executed.

Step S102, performing Optical Character Recognition (OCR) on the target image to obtain the text content and first position information contained in the target image, wherein the first position information is used for indicating the position of the text content in the target image.

Specifically, the textual content contained in the target image is identified by OCR techniques. Before/after/simultaneously with the recognition of the text content contained in the target image, the position of the recognized text content in the target image is determined, and information of this position is written as first position information.

In one specific example, a coordinate system is established for the target image, and the first location information may represent coordinates of the recognized text content in the coordinate system of the target image.

In another specific example, the target image is divided into a plurality of regions, and the first position information may indicate that the recognized text content belongs to at least one region of the target image.

Step S103, inputting the target image into a trained recognition model to recognize the selected option frame in the target image and obtain second position information, wherein the second position information is used for indicating the position of the selected option frame in the target image.

The recognition model is used for recognizing an option box in the target image, and the recognition model may be a neural network model, for example, the recognition model may include a "Look-Once-Only Look" model (YOLO for short). YOLO adopts a Convolutional Neural Network (CNN) model to realize end-to-end target detection.

The option frame is a frame which is arranged in one or more interfaces of the APP and is used for a user to select, and the option frame can be a frame diagram in various shapes such as a rectangular frame and a circular frame. When a user selects a certain option box, that is, when a certain option box is selected, the option box may include a mark such as a "check mark" or a "dot" and the like; when a box is not selected, the box does not contain any identification.

Before/after/simultaneously with the option frame in the target image is identified, the position of each identified selected option frame in the target image is determined, and the information of the position is recorded as second position information. For the detailed description of the second location information, reference may be made to the first location information, which is not described herein again.

Note that step S103 is executed before, after, or simultaneously with step S102.

And step S104, determining first position information matched with the second position information, and taking the text content of the matched first position information as the recognition result of the target image.

The first location information and the second location information are matched, that is, the difference between the first location information and the second location information is small, that is, the first location information and the second location information are closer to each other. Alternatively, the matching of the first position information and the second position information means that one coordinate of the two is the same and the other coordinate of the two corresponds to a preset relative position relationship, for example, the ordinate is the same, and the abscissa of the second position information is smaller than the abscissa of the first position information (i.e., the option box and the text are in the same row and the option box is on the left side of the text). Generally speaking, each option box in an interface of an APP has a certain relationship with the position of the corresponding text content (the text content may include the description of the option box such as a product name) in the interface, that is, the position of each option box matches the position of the corresponding text content.

As a non-limiting example, when the text content is recognized by OCR, the option box associated with the text content in the target image and the third position information thereof may be recognized by OCR; when the option box is identified by the identification model, the text content associated with the option box in the target image and the fourth position information thereof can be identified by the identification model. Thus, after the first position information and the second position information are matched, assistance can be performed by using the third position information and the fourth position information. Specifically, after the first location information and the second location information are matched, whether the third location information corresponding to the first location information is the same as the second location information (that is, the distance is smaller than the preset error) and whether the fourth location information corresponding to the second location information is the same as the first location information (that is, the distance is smaller than the preset error) may be continuously determined, if both are the same, the first location information and the second location information are determined to be matched, otherwise, the first location information and the second location information are determined to be unmatched. By adopting the scheme, the option boxes and the characters are respectively recognized by the OCR and the recognition model, and the recognition result is subjected to secondary matching, so that the recognition reliability is improved. In a specific example, a two-dimensional coordinate system is constructed for the target image, and the first position information and the second position information are each expressed in two-dimensional coordinates including an abscissa and an ordinate. The matching of the first position information and the second position information means that the abscissa of the first position information is the same as the ordinate of the second position information, that is, the text content at the first position information is aligned with the direction of the selected option box at the second position information in the ordinate. Further, the ordinate direction may refer to a vertical direction of the display interface.

It should be noted that, the matching of the first location information and the second location information includes, but is not limited to, the above examples, and the condition for matching the first location information and the second location information may be set according to the requirement to meet the identification requirement.

The recognition result of the target image comprises the text content at the first position information and the selected option box at the second position information matched with the text content. That is, the recognition result is the selected option box in the target image and the text content corresponding to the option box. And obtaining whether the user selects an option frame in the interface corresponding to the target image according to the recognition result, and if a certain option frame is selected, obtaining the text content corresponding to the selected option frame. Therefore, the operation of the user on the interface corresponding to the target image can be recognized. The identification result can be used for subsequent big data analysis, and the identification result contains the selections of the users for different options, so that the big data analysis can be performed on a related user group based on the identification results of a plurality of users to obtain the analysis results such as the user portrait.

Through the image recognition method described in fig. 1, the full amount of optical character recognition is performed on the target image to obtain all the text contents contained in the target image, and all the selected option boxes contained in the target image are automatically recognized through the trained recognition model. And obtaining a recognition result of the text content corresponding to the selected option frame as a target image based on the position relation between the option frame and the text content, and using the recognition result as the acquired operation information of the user. Therefore, the acquisition efficiency of acquiring the operation information of the user from the recorded video can be effectively improved.

Furthermore, all the character contents contained in the target image and the selected option frame can be reserved through the full-amount optical character recognition and the option frame recognition, and information loss can be effectively avoided.

In one embodiment, before the inputting the target image into the trained recognition model in step S103, the method may further include: acquiring a training image, and dividing the training image into a training set and a test set; training an initial model by taking the training set as a training sample to obtain an identification model; and testing the recognition model through the test set, and obtaining the trained recognition model after the test is passed.

The training images comprise a plurality of images and are used for training the recognition model. Optionally, the training image is acquired from a historically acquired video, and the historically acquired video can be obtained by recording a screen of an operation interface of a target application program and can also be obtained by recording screens of operation interfaces of other APPs.

Optionally, the dividing the training image into a training set and a test set includes: the training images are divided into training sets and test sets according to a certain proportion. For example, the training images are divided into training set test sets in proportion to 7:3 or 8:2.

If the training images are obtained from videos obtained by recording the operation interfaces of the multiple APPs, the training sets and the test sets can be divided according to the number of the training images contained in the different APPs. For example, the training image is obtained from a video obtained by recording a screen of an operation interface of 3 APPs, and the 3 APPs are respectively denoted as APP1, APP2, and APP3.APP1, APP2 and APP3 comprise a ratio of the number of training images of 7. The training set and the test set are divided according to the number of training images contained in different APPs, so that an example of extreme conditions brought by Simple Random Sampling (SRS) can be avoided.

In this embodiment, in the model training stage, the training image is divided into a training set and a verification set, so that the recognition effect of the trained recognition model can be verified in the model training stage. If the training effect does not meet the requirement, more training samples can be obtained to repeat the training steps of the embodiment until the recognition model meeting the requirement is obtained. And taking the recognition model meeting the requirement as the recognition model trained in the step S103 in the step S1.

In one embodiment, the training set is used as a training sample to train an initial model, and when a recognition model is obtained, two types of labels need to be added to the training sample: a selected option box and a non-selected option box. Therefore, the trained recognition model can recognize the selected option boxes and the unselected option boxes. In step S103, only the selected option box and the second position information may be output. Therefore, the trained recognition model can output unselected option boxes and position information thereof. In addition, model training is carried out through the training samples added with the two types of labels, so that the recognition model can better learn the difference between the selected option frame and the unselected option frame, and the accuracy of model recognition is improved.

In one embodiment, the training images comprise stitched images, the method may further comprise: acquiring an original image, splicing a plurality of original images, and reducing the spliced image according to the size of the original image to obtain the spliced image; the original image is an image obtained by recording a screen of an operation interface of a training application program, and the operation interface of the training application program is provided with at least one option box.

The original image is an image which is directly obtained from a historically collected video and is not processed. One specific example of stitching the original images includes: acquiring 4 (2, 6, 8, etc.) original images, and splicing the 4 original images into one image, wherein the four original images are respectively positioned at the upper left, the upper right, the lower left and the lower right of the spliced image.

In a non-limiting example, the reducing the stitched image according to the size of the original image may include: and reducing the size of the spliced image so that the size of the obtained training image is the same as that of the original image. Since the distribution position of the option box is located at the edge of the interface in the partial page of the APP, the option box is located at the left side or the right side of the page. The option boxes can be additionally arranged at other positions except the edge of the interface through image splicing, and the trained recognition model can recognize the option boxes at different positions of different images.

In one embodiment, the training images may further include enhanced images, the method further comprising: the method comprises the steps of obtaining an original image, and conducting Mosaic (Mosaic) enhancement on a plurality of original images to obtain an enhanced image. The mosaic enhancement of the plurality of original images means that the plurality of images are randomly cut, and then the randomly cut images are spliced into one image serving as an enhanced image.

In this embodiment, the mosaic enhancement is performed on a plurality of original images, and the data of the plurality of original images is calculated once when Batch Normalization is calculated, so that the size of a required minimum Batch (mini-Batch) can be effectively reduced. In a scene of a single Graphics Processing Unit (GPU for short), the image recognition method of the embodiment of the present invention can obtain the same good effect.

In one embodiment, the training images include stitched or enhanced images and original images. Therefore, the number of training images is increased through image splicing, the sizes of the background and the option box of image recognition can be enriched, the accuracy of the recognition model is improved, and overfitting of the model can be avoided.

In one embodiment, the training image may further include a processed image obtained by scaling, cropping, and the like, the original image. When the original images are scaled, the original images may be scaled in the same scale according to a preset scale, or different original images may be scaled randomly according to a randomly generated scale. When the original image is cut, the area with the option box can be reserved, and the original image can also be cut randomly. Therefore, the number of training images can be further increased, the sizes of the background and the option box of image recognition can be enriched, and model overfitting can be avoided.

In one embodiment, when the recognition model is trained by training the image, the option box to be recognized in the training image needs to be labeled, and the original image can be labeled and then generated into a spliced image, an enhanced image or a processed image, so that the workload of labeling the sample can be effectively reduced, and the efficiency of training the model can be improved.

In an embodiment, please refer to fig. 2, fig. 2 is a flowchart illustrating a training method of a recognition model according to an embodiment of the present invention, that is, the training of the initial model with the training set as a sample to obtain the recognition model may include the following steps a to D.

And step A, labeling a first part of training images in the training set, and training the initial model by using the labeled images as training samples to obtain an intermediate model.

Alternatively, the training images in the training set may be divided into a plurality of portions (denoted as first partial image, second partial image, …, respectively). Further, the training images in the training set may be divided into a plurality of parts on average, for example, the training set includes 500 training images, and the training set is divided into 10 parts on average, each part including 50 training images.

And labeling the first part of training images in the training set, namely manually/automatically labeling option boxes in the first part of training images. Training the initial model through the labeled first portion of training images may enable the trained intermediate images to learn the ability to recognize the option boxes in the input images.

And B, acquiring a next part of training images in the training set, inputting the next part of training images into the intermediate model to identify option boxes in the next part of training images, outputting an identification result, and receiving external correction aiming at the identification result, wherein the correction comprises marking the option boxes which are inaccurately identified in the identification result.

And C, taking the image labeled in the step B as a training sample to train the intermediate model.

In a non-limiting specific example of the steps B and C, after the intermediate model is obtained by training, the option box in the next part of the training image may be identified through the intermediate model, and the identification result is obtained at the output end of the intermediate model. And carrying out manual labeling outside the intermediate model for the option frame which is identified accurately in the identification result without processing. And continuing to train the intermediate model by taking the manually marked image as a training sample, namely step C.

In one non-limiting example, after the intermediate model identifies a box in the training image, the identified box is labeled. The recognition result may include the training image and an option box labeled in the training image.

And C, sequentially acquiring training images of all parts in the training set, and repeatedly executing the step B and the step C until the training of the intermediate model is finished. Wherein, the completion of the training of the intermediate model refers to that all training images of the training set are subjected to model training according to the steps a to C, or that the recognition effect of the trained intermediate model meets the requirement is detected (for example, the recognition accuracy of the intermediate model reaches a preset threshold and the recognition accuracy is converged, and the preset threshold may be 99% or the like).

And D, when the training of the intermediate model is finished, acquiring the trained intermediate model as the recognition model.

In this embodiment, after the labeling of the first part of the training images in the training set is completed, model training is started, and then, based on the recognition result of the trained model, the option box with inaccurate recognition in the recognition result is manually labeled. Compared with the method for manually labeling the training images of the whole training set in the traditional training mode, the scheme of the embodiment can reduce the labeling cost (which can include time cost and labor cost). In addition, the training images in the training set are divided into a plurality of parts, and each part is acquired in sequence for model training, so that the method is beneficial to a annotator to find the capability boundary of the currently trained model, and is beneficial to the subsequent model tuning process.

In one embodiment, the initial model adopts a YOLO algorithm, and the labeling of the first part of the training images in the training set includes: identifying a block diagram in the first portion of training images by the initial model; clustering the identified block diagrams to obtain clustered block diagrams; selecting one or more types of block diagrams in the clustered block diagrams as block diagrams to be labeled based on the similarity between the clustered block diagrams and the option blocks; and labeling the block diagram to be labeled to obtain a labeling result.

Specifically, the block diagram in the first part of training image is identified through the initial model (i.e. a model adopting a YOLO algorithm, which is called a YOLO meter model for short). If the shape of the block diagram to be recognized is a rectangle, the characteristics of a rectangular frame (bounding box) can be output through the YOLO model, and all block diagrams included in the first part of the training image, that is, all rectangular frames included in the training image, can be obtained at the output end of the YOLO model.

Optionally, based on the similarity between the various clustered block diagrams and the option block, selecting one or more of the various clustered block diagrams as the block diagram to be labeled may include: and when the similarity between a certain type of clustered block diagram and the option box is higher than a preset similarity threshold value, taking the type of block diagram as a block diagram to be labeled.

In a non-limiting example, clustering the identified block diagrams to obtain clustered block diagrams may include: and clustering the identified block diagrams to obtain clustered block diagrams by taking the areas of the identified block diagrams as clustering bases. For example, tiles with tile areas in a first area interval are divided into a first category, tiles with tile areas in a second area interval are divided into a second category, and so on …

In another non-limiting example, clustering the identified block diagrams to obtain clustered block diagrams may include: and clustering the identified block diagrams to obtain clustered block diagrams by taking the side lengths of the identified block diagrams as clustering bases. For example, tiles with side lengths in a first side length interval are divided into a first category, tiles with tile areas in a second area interval are divided into a second category, and so on …

Optionally, the side length of the identified block diagram may be used as a clustering basis through a K-means clustering algorithm, and the identified block diagram may be clustered. The K-means clustering algorithm is a typical clustering algorithm based on distance, and the distance is used as an evaluation index of similarity, that is, the closer the distance between two objects is, the greater the similarity of the two objects is.

It should be noted that, besides the identified side length/area of the block diagram as the basis for clustering, other clustering bases may also be set. Other Clustering algorithms besides K-means, such as Mean shift (Mean shift) Clustering algorithm, density-Based Clustering with Noise (DBSCAN) Clustering algorithm, etc., can be selected as required.

In a non-limiting example, the selecting one or more of the clustered block diagrams as the block diagram to be labeled based on the similarity between the clustered block diagrams and the option block may include: and selecting one or more types of block diagrams with similar areas from the various block diagrams after clustering as the block diagrams to be labeled according to the areas of the option boxes. Or selecting one or more types of block diagrams with similar side lengths from the various block diagrams after clustering as the block diagrams to be labeled according to the side lengths of the option blocks.

In this embodiment, characteristics of a rectangular frame (bounding box) may be output through the YOLO model, and all the frame diagrams (the frame diagrams are rectangular frames) included in the first part of the training image may be obtained at an output end of the YOLO model. And selecting one or more types of block diagrams from the clustered block diagrams as block diagrams to be labeled through clustering and similarity calculation, and automatically labeling the block diagrams to be labeled. Compared with the traditional method for manually labeling the training images, the automatic labeling of the embodiment can reduce the cost of labeling the samples to a great extent and improve the efficiency of model training. In addition, when the YOLO model is trained, an Anchor box image (Anchor box) is obtained by model self-adaptation through clustering and similarity calculation, and a prior frame selected by hand is replaced by the self-adaptive Anchor box image, so that the YOLO model can know how large a rectangular frame is used for searching for a recognized target, and the model can be helped to be rapidly converged.

In one embodiment, the labeling the block diagram to be labeled and obtaining the labeling result further includes: outputting a labeling result, and obtaining the adjustment of the labeling result from the outside; wherein the adjusting comprises: and deleting wrong labels in the labeling result, and/or labeling the unidentified block diagrams to be labeled.

Wherein the externally obtaining the adjustment of the labeling result may include: and checking the marking result manually, and adjusting the marking result manually when the marking result is found to be inaccurate by the checking. The inaccurate labeling condition can be divided into two conditions of labeling error and omission labeling, and different adjustment modes can be adopted for each condition, specifically comprising: for the block diagram marking the error, the marking of the error can be directly deleted. For the case of the missing label, the block diagram of the missing label (i.e. the unidentified block diagram to be labeled) can be manually determined and labeled.

Further, the reason for the labeling error and the missing label can be determined, for example, if the step is based on the similarity between the various clustered block diagrams and the option block, one or more types of block diagrams in the clustered block diagrams are selected as the block diagrams to be labeled, so that the labeling error and the missing label are caused. The basis for selecting the block diagram to be labeled can be determined again, such as adjusting the similarity threshold value.

Optionally, after the adjustment of the labeling result is obtained from the outside, model training is performed on the intermediate model according to the adjusted labeling result.

In the embodiment, when the initial model is subjected to model training, automatic labeling and supervised training are combined, and the accuracy of the model training can be ensured on the basis of reducing the cost of sample labeling.

In an embodiment, referring to fig. 1 again, before acquiring the target image from the video in step S101, the method may further include: and selecting key frames from the video as the target images.

The key frame may include, but is not limited to, examples 1 to 3 below.

Example 1, the key frame may refer to a key frame (generally referred to as an I frame) when a plurality of image frames in a video are inter-frame compression encoded.

Example 2, N image frames are further included between two adjacent key frames in the video, where N is a natural number, and a value of N may be a preset value.

Example 3, the key frame refers to a last image of each operation interface in the video in the time sequence. Optionally, whether each image frame corresponds to the same operation interface may be determined by detecting similarity between different image frames, or whether each image frame corresponds to the same operation interface may be determined by detecting whether header lines of different image frames are the same.

In the embodiment, the other image frames except the key frame in the video are deleted by selecting the key frame, and compared with the case that all the images in the video are used as the target images, the scheme in the embodiment can effectively reduce the number of the target images to be processed, save the computing resources and improve the efficiency of image identification.

In one embodiment, after the selecting a key frame from the video as the target image, the method may further include: calculating the similarity between different target images according to at least one of a histogram comparison method, a block histogram comparison method and a perceptual hash algorithm; and carrying out duplicate removal on the target image according to the similarity.

In practical applications, each frame of image in the video may be distorted, that is, for two images with identical contents, the color of the same pixel of the two images may be slightly different. One of the simplest examples is that after subtracting the two images, an afterimage may appear. The condition of picture distortion can be well ignored through the perceptual hash algorithm, and the perceptual hash algorithm is very sensitive to the change of the actual content of the image, so that the perceptual hash algorithm has a better effect on the duplication elimination of the target image under the condition of image distortion.

In the embodiment, the weights of the training samples are removed, the training samples with high similarity are removed, and the problem of model overfitting caused by high similarity of the training samples is solved. For example, when the model is over-fitted, it may cause the recognition model to recognize only a few cases where 1 or 2 options boxes are selected.

With the development of networks and mobile terminals, a current marketing approach in which a consumer purchases a full link may include: front end (publicity): the consumer is triggered by an advertisement, live broadcast, etc. message, resulting in an impulse to purchase the goods. Medium (purchase): for example, a consumer purchases a commodity on order at an interface of an e-commerce APP or in a live broadcast room, and enters a commodity distribution link. Back-end (customer maintenance): for example, a consumer joins a private domain of fans of the anchor (such as a fan group and the like), and the anchor completes the precipitation of fans; consumers become brand members of the merchandise, and so on.

In one embodiment, the image recognition method of the present invention may be applied to at least one of the front end, middle end and back end of a consumer's purchase chain. When the method is applied to the front end, the operation of the user on the interface of the target application program can be identified through the image identification method so as to determine the information such as which products the user is triggered by, or which products in the live broadcast have purchasing intention. When the method is applied to the middle-end, the information such as which commodities the user purchases can be determined. When the method is applied to the back end, the information that the user joins in a fan private domain of a certain anchor or becomes a brand member of a certain commodity and the like can be determined. Further, the image recognition method can be applied to each stage in the purchase full link of the consumer to track the operation of the same user on different APP interfaces at each stage, so that the data of the user at a plurality of marketing contacts (including a plurality of stages and a plurality of APPs) can be acquired, and the personalized analysis of the full link can be performed on the user.

Referring to fig. 3, an embodiment of the invention further provides an image recognition apparatus 30, including: the target image acquisition module 301 is configured to acquire a video and acquire a target image from the video, where the video is obtained by recording a screen on an operation interface of a target application program; an optical character recognition module 302, configured to perform optical character recognition on the target image to obtain text content and first position information included in the target image, where the first position information is used to indicate a position of the text content in the target image; an option box recognition module 303, configured to input the target image into a trained recognition model to recognize a selected option box in the target image and obtain second position information, where the second position information is used to indicate a position of the selected option box in the target image; and the result acquiring module 304 is configured to determine first position information matched with the second position information, and use the text content of the matched first position information as the recognition result of the target image.

In one embodiment, before inputting the target image into the trained recognition model, the image recognition apparatus 30 may further include: the training image acquisition module is used for acquiring a training image and dividing the training image into a training set and a test set; the training module is used for training the initial model by taking the training set as a training sample to obtain an identification model; and the test module is used for testing the recognition model through the test set, and obtaining the trained recognition model after the test is passed.

In one embodiment, the training images include stitched images, and the image recognition device 30 may further include: further comprising: the image splicing module is used for acquiring original images, splicing a plurality of original images, and reducing the spliced images according to the sizes of the original images to obtain the spliced images; the original image is an image obtained by recording a screen of an operation interface of a training application program, and the operation interface of the training application program is provided with at least one option box.

In one embodiment, the training module may include:

a first training unit, configured to execute step a, label a first part of training images in the training set, train the initial model by using the labeled images as training samples, and obtain an intermediate model;

a second training unit, configured to perform step B, obtain a next part of the training images in the training set, input the next part of the training images into the intermediate model, so as to identify an option box in the next part of the training images, output a recognition result, and receive a correction for the recognition result from the outside, where the correction includes labeling an option box in the recognition result that is not accurately identified;

the supplementary training module is used for executing the step C and taking the image marked in the step B as a training sample to train the intermediate model;

a cycle module, configured to repeatedly execute the step B and the step C until the intermediate model is trained;

and the identification model acquisition module is used for acquiring the trained intermediate model as the identification model.

In one embodiment, the initial model adopts a YOLO algorithm, and the first training unit further includes: a block diagram identification subunit, configured to identify a block diagram in the first part of the training images through the initial model; the clustering subunit is used for clustering the identified block diagrams to obtain clustered block diagrams; the selecting subunit is used for selecting one or more types of block diagrams in the clustered block diagrams as block diagrams to be labeled based on the similarity between the clustered block diagrams and the option blocks; and the result acquisition subunit is used for labeling the block diagram to be labeled to obtain a labeling result.

In one embodiment, after the labeling of the block diagram to be labeled is performed and a labeling result is obtained, the first training unit further includes: the adjusting subunit is used for outputting the labeling result and acquiring the adjustment of the labeling result from the outside; wherein the adjusting comprises: and deleting wrong labels in the labeling result, and/or labeling the unidentified block diagrams to be labeled.

In one embodiment, before the acquiring the target image from the video, the image recognition apparatus 30 further includes: and the key frame selecting module is used for selecting key frames from the video as the target images.

In one embodiment, after said selecting a key frame from said video as said target image, said image recognition device 30 further comprises: the image similarity calculation module is used for calculating the similarity between different target images according to at least one of a histogram comparison method, a blocking histogram comparison method and a perceptual hash algorithm; and the duplication removing module is used for carrying out duplication removal on the target image according to the similarity.

For more details of the operation principle and the operation mode of the image recognition apparatus 30, reference may be made to the description of the image recognition method in fig. 1 and fig. 2, and details are not repeated here.

Further, the embodiment of the present invention also discloses a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the technical solution for the image recognition method in fig. 1 and fig. 2 is executed.

Further, the embodiment of the present invention also discloses a computing device, which includes a memory and a processor, where the memory stores a computer program capable of running on the processor, and the processor executes the technical solution for the image recognition method in fig. 1 and fig. 2 when running the computer program.

Specifically, in the embodiment of the present invention, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (enhanced SDRAM), synchronous DRAM (SLDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.

The "plurality" appearing in the embodiments of the present application means two or more.

The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.

The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An image recognition method, characterized in that the method comprises:

acquiring a video, and acquiring a target image from the video, wherein the video is obtained by recording a screen of an operation interface of a target application program;

carrying out optical character recognition on the target image to obtain the text content and first position information contained in the target image, wherein the first position information is used for indicating the position of the text content in the target image;

inputting the target image into a trained recognition model to recognize the selected option box in the target image and obtain second position information, wherein the second position information is used for indicating the position of the selected option box in the target image;

and determining first position information matched with the second position information, and taking the text content of the matched first position information as the recognition result of the target image.

2. The method of claim 1, wherein before inputting the target image into the trained recognition model, further comprising:

acquiring a training image, and dividing the training image into a training set and a test set;

training an initial model by taking the training set as a training sample to obtain an identification model;

and testing the recognition model through the test set, and obtaining the trained recognition model after the test is passed.

3. The method of claim 2, wherein the training images comprise stitched images, the method further comprising:

acquiring an original image, splicing a plurality of original images, and reducing the spliced image according to the size of the original image to obtain the spliced image;

the original image is an image obtained by recording a screen of an operation interface of a training application program, and the operation interface of the training application program is provided with at least one option box.

4. The method according to claim 2 or 3, wherein the training an initial model by using the training set as a sample to obtain the recognition model comprises:

step A, labeling a first part of training images in the training set, and training the initial model by using the labeled images as training samples to obtain an intermediate model;

step B, acquiring a next part of training images in the training set, inputting the next part of training images into the intermediate model to identify option boxes in the next part of training images, outputting identification results, and receiving external correction aiming at the identification results, wherein the correction comprises marking the option boxes which are inaccurately identified in the identification results;

step C, taking the image marked in the step B as a training sample to train the intermediate model; and C, repeatedly executing the step B and the step C until the training of the intermediate model is finished, and acquiring the trained intermediate model as the recognition model.

5. The method of claim 4, wherein the initial model adopts a YOLO algorithm, and the labeling of the first part of the training images in the training set comprises:

identifying, by the initial model, a block diagram in the first portion of training images;

clustering the identified block diagrams to obtain clustered block diagrams;

selecting one or more types of block diagrams in the clustered block diagrams as block diagrams to be labeled based on the similarity between the clustered block diagrams and the option blocks;

and labeling the block diagram to be labeled to obtain a labeling result.

6. The method according to claim 5, wherein the labeling the block diagram to be labeled, after obtaining the labeling result, further comprises:

outputting a labeling result, and obtaining the adjustment of the labeling result from the outside;

wherein the adjusting comprises: and deleting wrong labels in the labeling result, and/or labeling the unidentified block diagrams to be labeled.

7. The method according to any one of claims 1 to 3, wherein said obtaining the target image from the video further comprises:

and selecting a key frame from the video as the target image.

8. The method of claim 7, wherein after said selecting a key frame from said video as said target image, further comprising:

calculating the similarity between different target images according to at least one of a histogram comparison method, a block histogram comparison method and a perceptual hash algorithm;

and carrying out duplicate removal on the target image according to the similarity.

9. An image recognition apparatus, characterized in that the apparatus comprises:

the target image acquisition module is used for acquiring a video and acquiring a target image from the video, wherein the video is obtained by recording a screen on an operation interface of a target application program;

the optical character recognition module is used for carrying out optical character recognition on the target image to obtain the text content and first position information contained in the target image, and the first position information is used for indicating the position of the text content in the target image;

the option frame recognition module is used for inputting the target image into a trained recognition model so as to recognize the selected option frame in the target image and obtain second position information, and the second position information is used for indicating the position of the selected option frame in the target image;

and the result acquisition module is used for determining first position information matched with the second position information and taking the text content of the matched first position information as the recognition result of the target image.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.

11. A computing device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method of any one of claims 1 to 8.