WO2020006961A1

WO2020006961A1 - Image extraction method and device

Info

Publication number: WO2020006961A1
Application number: PCT/CN2018/116334
Authority: WO
Inventors: 周恺卉; 王长虎
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2018-07-03
Filing date: 2018-11-20
Publication date: 2020-01-09
Also published as: CN108898186B; CN108898186A

Abstract

Embodiments of the present application disclose an image extraction method and device. A specific embodiment of the method comprises: acquiring a reference object image and a set of images to be matched; and inputting the reference object image into a first sub network comprised in a pre-trained image recognition model, and acquiring a feature vector of the reference object image as a reference feature vector, wherein the following extraction steps are executed on images to be matched in the set of images to be matched: inputting the image to be matched into a second sub network comprised in the image recognition model, so as to acquire at least one piece of position information and a feature vector to be matched corresponding to the position information; determining the distance between the acquired feature vector to be matched and the reference feature vector; and in response to determining that the determined distances comprise a distance less than or equal to a preset distance threshold, extracting the image to be matched as an image matching the reference object image. The embodiments improve the flexibility of image recognition, and enrich the means of image recognition.

Description

Method and device for extracting images

This patent application claims the priority of a Chinese patent application filed on July 3, 2018, with application number 201810715195.6, the applicant being Beijing BYTE Network Technology Co., Ltd., and the invention name "Method and Device for Extracting Images" , The entire application of which is incorporated herein by reference.

Technical field

Embodiments of the present application relate to the field of computer technology, and in particular, to a method and an apparatus for extracting an image.

Background technique

At present, the application fields of image recognition technology are becoming more and more extensive. Using image recognition models to identify images is a common method of image recognition technology. An image recognition model is usually a model that is trained using a large number of training samples. In order for the image recognition model to recognize a target image (such as a watermark image, a person image, an object image, etc.) in an image, it is usually necessary to use a The sample images are trained to obtain an image recognition model.

Summary of the invention

The embodiments of the present application provide a method and a device for extracting an image.

In a first aspect, an embodiment of the present application provides a method for extracting an image. The method includes: obtaining a reference object image and a set of images to be matched; and inputting the reference object image into a first subnetwork included in a pre-trained image recognition model. To obtain the feature vector of the reference object image as the reference feature vector; for the to-be-matched images in the to-be-matched image set, perform the following extraction steps: input the to-be-matched image into a second sub-network included in the image recognition model, and obtain at least one position information The feature vector to be matched corresponding to the position information, wherein the feature vector to be matched is the feature vector of the region image included in the image to be matched, and the position information is used to characterize the position of the region image in the to-be-matched image; The distance between the matching feature vector and the reference feature vector; in response to determining that there is a distance less than or equal to a preset distance threshold in the determined distance, the image to be matched is extracted as an image matching the reference object image.

In some embodiments, the extracting step further includes: determining position information of an area image corresponding to a distance less than or equal to a distance threshold, and outputting the determined position information.

In some embodiments, the extracting step further includes: generating a matched image including a position marker based on the output position information and the image to be matched, where the position marker is used to mark the image of the region to be matched corresponding to the output position information during matching. Position in the back image.

In some embodiments, the second sub-network includes a dimension transformation layer for transforming the feature vector to a target dimension; and inputting the image to be matched into the second sub-network included in the image recognition model to obtain at least one feature vector to be matched, The method includes: inputting the to-be-matched image into a second sub-network included in the image recognition model to obtain at least one to-be-matched feature vector having the same dimension as the reference feature vector.

In some embodiments, the image recognition model is obtained by training in the following steps: obtaining a training sample set, where the training samples include sample object images, sample matching images, and labeled position information of the sample matching images. The labeled position information indicates that the sample matching image includes Position of the regional image; select training samples from the training sample set, and perform the following training steps: input the sample object images included in the selected training samples into the first subnetwork included in the initial model, obtain the first feature vector, and input the sample matching images The second sub-network included in the initial model obtains at least one position information and a second feature vector corresponding to the position information; and from the obtained at least one position information, the position information representing the target region image in the sample matching image is determined as a target Position information, and determine the second feature vector corresponding to the target position information as the target second feature vector; based on the first loss value representing the error of the target position information and the second feature vector representing the distance between the target second feature vector and the first feature vector Two loss values, Whether training is complete initial model; in response to determining that completion of the training, the initial model is determined as the image recognition model.

In some embodiments, determining whether the initial model training is completed based on the first loss value representing the error of the target position information and the second loss value representing the difference between the distance between the target second feature vector and the first feature vector, includes: The preset weight value uses the weighted summation result of the first loss value and the second loss value as the total loss value, and compares the total loss value with the target value, and determines whether the initial model is completed according to the comparison result.

In some embodiments, the step of training to obtain an image recognition model further includes: in response to determining that the initial model is not trained, adjusting parameters of the initial model, and selecting training samples from unselected training samples in the training sample set, Use the adjusted initial model as the initial model and continue with the training steps.

In a second aspect, an embodiment of the present application provides an apparatus for extracting an image. The apparatus includes: an acquiring unit configured to acquire a reference object image and a set of images to be matched; and a generating unit configured to input the reference object image. The first sub-network included in the pre-trained image recognition model, obtains the feature vector of the reference object image as the reference feature vector; the extraction unit is configured to perform the following extraction steps on the to-be-matched images in the to-be-matched image set: The matching image is input to a second sub-network included in the image recognition model to obtain at least one position information and a feature vector corresponding to the position information, where the feature vector to be matched is a feature vector of an area image included in the image to be matched. Determining the distance of the region image in the to-be-matched image; determining the distance between the obtained to-be-matched feature vector and the reference feature vector; and in response to determining that a distance less than or equal to a preset distance threshold exists in the determined distance, extracting the to-be-matched feature vector The matched image is an image that matches the reference target image.

In some embodiments, the extraction unit includes: an output module configured to determine position information of a region image corresponding to a distance less than or equal to a distance threshold, and output the determined position information.

In some embodiments, the extraction unit further includes a generating module configured to generate a matched image including a position marker based on the output position information and the image to be matched, where the position marker is used to mark the output position information corresponding to The position of the image to be matched in the matched image.

In some embodiments, the second sub-network includes a dimensional transformation layer for transforming the feature vector to a target dimension; and the extraction unit is further configured to: input the image to be matched into the second sub-network included in the image recognition model to obtain At least one feature vector to be matched having the same dimension as the reference feature vector.

In some embodiments, the image recognition model is obtained by training in the following steps: obtaining a training sample set, where the training samples include sample object images, sample matching images, and labeled position information of the sample matching images. The labeled position information indicates that the sample matching image includes Position of the regional image; select training samples from the training sample set, and perform the following training steps: input the sample object images included in the selected training samples into the first subnetwork included in the initial model, obtain the first feature vector, and input the sample matching images The second sub-network included in the initial model obtains at least one position information and a second feature vector corresponding to the position information; and from the obtained at least one position information, the position information representing the target region image in the sample matching image is determined as a target Position information, and determine the second feature vector corresponding to the target position information as the target second feature vector; based on the first loss value representing the error of the target position information and the second feature vector representing the distance between the target second feature vector and the first feature vector Two loss values, It is determined whether the initial model is trained; in response to determining that the training is complete, the initial model is determined as an image recognition model.

According to a third aspect, an embodiment of the present application provides an electronic device. The electronic device includes: one or more processors; a storage device on which one or more programs are stored; when one or more programs are read by one or more Each processor executes such that one or more processors implement the method as described in any implementation of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable medium having stored thereon a computer program that, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

The method and device for extracting images provided by the embodiments of the present application obtain a reference feature vector of a reference image and at least one feature vector to be matched of an image to be matched by using a pre-trained image recognition model, and then compare the reference feature vector and The distance of the feature vector to be matched is used to obtain an image matching the reference image, so that when the training sample required for training the image recognition model does not include the reference image, the image recognition model is used to extract the image that matches the reference image, improving It increases the flexibility of image recognition and enriches the means of image recognition.

BRIEF DESCRIPTION OF THE DRAWINGS

Other features, objects, and advantages of the present application will become more apparent by reading the detailed description of the non-limiting embodiments with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied; FIG.

2 is a flowchart of an embodiment of a method for extracting an image according to the present application;

3 is a flowchart of an image recognition model obtained by training of a method for extracting an image according to the present application;

4 is a schematic diagram of an application scenario of a method for extracting an image according to the present application;

5 is a flowchart of still another embodiment of a method for extracting an image according to the present application;

6 is a schematic structural diagram of an embodiment of an apparatus for extracting an image according to the present application;

FIG. 7 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.

detailed description

The following describes the present application in detail with reference to the accompanying drawings and embodiments. It can be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. It should also be noted that, for convenience of description, only the parts related to the related invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The application will be described in detail below with reference to the drawings and embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which a method for extracting an image or an apparatus for extracting an image of an embodiment of the present application can be applied.

As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, and 103, a network 104, and a server 105. The network 104 is a medium for providing a communication link between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications can be installed on the

terminal devices

101, 102, and 103, such as image processing applications, shooting applications, social platform software, and the like.

The

terminal devices

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they can be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, laptop computers, and desktop computers. When the

terminal devices

101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (such as software or software modules used to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.

The server 105 may be a server that provides various services, such as a background server that provides support for various applications on the

terminal devices

101, 102, and 103. The background server may perform processing such as analysis on the acquired image, and output the processing result (for example, the extracted image matching the reference image).

It should be noted that the method for extracting an image provided in the embodiment of the present application may be executed by the server 105, or may be executed by the

terminal devices

101, 102, and 103. Correspondingly, the device for extracting an image may be provided in the server 105 or in the

terminal devices

101, 102, 103.

It should be noted that the server may be hardware or software. When the server is hardware, it can be implemented as a distributed server cluster consisting of multiple servers or as a single server. When the server is software, it can be implemented as multiple software or software modules (such as software or software modules used to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely exemplary. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.

With continued reference to FIG. 2, a flowchart 200 of one embodiment of a method for extracting an image according to the present application is shown. The method for extracting an image includes the following steps:

Step 201: Obtain a reference object image and a set of images to be matched.

In this embodiment, an execution subject of the method for extracting an image (for example, a server or a terminal device shown in FIG. 1) may obtain a reference object image and a set of images to be matched remotely or locally through a wired connection or a wireless connection. . The reference object image may be an image to be compared with other images, and the reference object image may be an image characterizing an object. Objects can be various things, such as watermarks, signs, faces, objects, and so on. The set of images to be matched may be a set of certain types of images (for example, images containing a trademark) stored in advance.

Step 202: Input a reference object image into a first sub-network included in a pre-trained image recognition model, and obtain a feature vector of the reference object image as a reference feature vector.

In this embodiment, the execution subject may input the reference object image into a first sub-network included in a pre-trained image recognition model, and obtain a feature vector of the reference object image as the reference feature vector. The first sub-network is used to characterize the correspondence between the image and the feature vector of the image. In this embodiment, the image recognition model may be various neural network models created based on machine learning technology. The neural network model may have a structure of various neural networks (for example, DenseBox, VGGNet, ResNet, SegNet, etc.). The above reference feature vector may be a feature (e.g., shape, color, texture) extracted from a first sub-network included in a neural network model (e.g., a network composed of one or some convolutional layers included in the neural network model), which characterizes an image And other characteristics).

Step 203: For the to-be-matched images in the to-be-matched image set, perform the following extraction steps: input the to-be-matched images into a second sub-network included in the image recognition model to obtain at least one position information and a feature vector to be matched corresponding to the location information; Determine the distance between the obtained feature vector to be matched and the reference feature vector; and in response to determining that there is a distance less than or equal to a preset distance threshold in the determined distance, extract the to-be-matched image as an image matching the reference object image.

In this embodiment, for each to-be-matched image in the to-be-matched image set, the execution subject may perform the following extraction step on the to-be-matched image:

Step 2031: Input the image to be matched into the second sub-network included in the image recognition model, and obtain at least one location information and a feature vector to be matched corresponding to the location information. The second sub-network is used to characterize the correspondence between the image and the position information of the image and the feature vector to be matched of the image. The position information is used to characterize the position of the area image corresponding to the feature vector to be matched in the to-be-matched image. The feature vector to be matched is a feature vector of an area image included in the image to be matched. In this embodiment, the second sub-network (for example, a network composed of one or some convolutional layers included in the neural network model) may determine each position information from the images to be matched according to the determined at least one position information. Characterize the region images, and determine the feature vector of each region image. The area image may be an image characterizing an object (eg, a watermark, a logo, etc.). Optionally, the location information may include coordinate information and identification information. Among them, the coordinate information (such as the coordinates of the corner points of the regional image, the size of the regional image, etc.) is used to indicate the position of the regional image in the image to be matched, and the identification information (such as the serial number of the regional image and the category of the regional image) is used to identify the area image. As an example, suppose that an image to be matched includes two watermarked images, and the position information determined by the second subnet is "(1, x1, y1, w1, h1)" and (2, x2, y2, w2, h2) , Where 1, 2 are the serial numbers of the two watermarked images, (x1, y1), (x2, y2) are the coordinates of the upper left corner of the two watermarked images, and w1 and w2 are the widths of the two watermarked images. , H1, h2 are the heights of the two watermark images, respectively. By using the second sub-network, the execution subject can extract feature vectors of the to-be-matched images, and extract feature vectors corresponding to the two position information as feature-to-be-matched feature vectors from the feature vectors of the to-be-matched images, respectively.

In practice, the second sub-network may be a neural network based on an existing target detection network (for example, SSD (Single Shot MultiBox Detector), R-CNN (Region-based Convolutional Neural Networks, Faster R-CNN, etc.). By using the second sub-network, the feature vector of the image of the region to be matched can be extracted from the image to be matched, thereby improving the pertinence of matching between the images, which is beneficial to the efficiency and accuracy of image recognition.

In some optional implementation manners of this embodiment, the foregoing second sub-network includes a dimension transformation layer for transforming a feature vector to a target dimension. The dimension transformation layer may process the feature vector (for example, the values of some dimensions included in the feature vector are combined by taking an average value); it may also be a pooling layer included in the second sub-network. The pooling layer can be used to down-sample or up-sample the input data to compress or increase the amount of data. The above target dimensions may be various dimensions set by a technician, for example, the same dimensions as those of the reference feature vector. The execution subject may input the image to be matched into a second sub-network included in the image recognition model, and the second sub-network extracts at least one feature vector of the image to be matched, and then extracts the dimensional transformation layer pair included in the second sub-network. Each feature vector of the dimensional transform is performed to obtain at least one feature vector to be matched that has the same dimension as the reference feature vector. In practice, a ROI Pooling (Region Of Interest, Pooling) layer can be used, so that each feature vector to be matched has the same dimension as the reference feature vector. Among them, the ROI Pooling layer is a well-known technology that is widely studied and applied at present, and will not be repeated here.

Step 2032: Determine the distance between the obtained feature vector to be matched and the reference feature vector. Specifically, the execution body may determine a distance between each of the obtained at least one to-be-matched feature vector and a reference feature vector. The distance may be any of the following: Euclidean distance, Mahalanobis distance, and the like. The preset distance may be any value greater than or equal to 0. Among them, the distance can represent the degree of similarity between two feature vectors, that is, the degree of similarity between two images. As an example, if the distance between the two feature vectors is larger, the images that are corresponding to the two feature vectors are less similar.

Step 2033: In response to determining that a distance smaller than or equal to a preset distance threshold exists in the determined distance, extract the image to be matched as an image matching the reference object image. The distance threshold may be a value set by a technician according to experience, or may be a value calculated (for example, by calculating an average value) calculated by the execution subject based on historical data (for example, a recorded historical distance threshold). Specifically, if there is a distance smaller than the distance threshold in each of the determined distances, it indicates that an area image similar to the reference object image exists in the image to be matched, that is, the image to be matched matches the reference object image.

By performing this step, when a reference object image is not included in the training sample used for training the image recognition model, an image matching the reference object image is extracted from the set of images to be matched, and the image to be matched is included in Comparing the area image with the reference object can improve the pertinence of image matching, and then improve the accuracy of image recognition.

In some optional implementations of this embodiment, as shown in FIG. 3, the image recognition model can be obtained by training in advance through the following steps:

Step 301: Obtain a training sample set. The training samples include sample object images, sample matching images, and sample matching image labeling position information. The labeling position information represents the position of the region image included in the sample matching image. The sample object image may be an image characterizing an object (for example, a watermark, a logo, a human face, an object, etc.). The number of labeled position information may be at least one, and each labeled information may correspond to a region image, and each region image includes a region image in which the characterized object and the sample object image are the same.

Step 302: Select training samples from the training sample set. The selection method and the number of training samples are not limited in this application. For example, the training samples may be selected from the training sample set in a random selection or in the order of the number of training samples.

Step 303: input the sample object image included in the selected training sample into the first sub-network included in the initial model to obtain a first feature vector, and input the sample matching image into the second sub-network included in the initial model to obtain at least one position information and the The second feature vector corresponding to the position information.

The initial model may be various existing neural network models created based on machine learning technology. The neural network model may have various existing neural network structures (for example, DenseBox, VGGNet, ResNet, SegNet, etc.). Each of the above feature vectors may be a vector composed of data extracted from certain layers (such as a convolution layer) included in the neural network model. The above-mentioned first sub-network and second sub-network are the same as the first sub-network and the second sub-network described in step 202 and step 203, respectively, and details are not described herein again.

Step 304: From the obtained at least one position information, determine position information representing the target region image in the sample matching image as the target position information, and determine a second feature vector corresponding to the target position information as the target second feature vector.

Specifically, the above-mentioned target area image may be an area image in which the characterized object is the same as the object characterized by the sample object image. The execution subject performing this step may use the position information specified by the technician as the target position information, the area image characterized by the target position information as the target area image, and the second feature vector of the target area image as the target second feature vector; or The execution subject executing this step may determine the similarity between the area image corresponding to each position information and the sample object image and determine the area image with the greatest similarity to the sample object image as the target area image according to the obtained position information. , Determining the position information of the target area image as the target position information, and determining the second feature vector of the target area image as the target second feature vector.

Step 305: Determine whether the initial model training is completed based on the first loss value representing the error of the target position information and the second loss value representing the difference between the distance between the target second feature vector and the first feature vector.

The first loss value may represent a gap between the target position information and the labeled position information corresponding to the target area image. Generally, the smaller the first loss value, the smaller the difference between the target position information and the labeled position information corresponding to the target area image, that is, the closer the target position information is to the labeled position information. In practice, the first loss value can be obtained according to any of the following loss functions: Softmax loss function, Smooth L1 (smooth L1 norm) loss function, and the like.

The second loss value may represent a distance between the target second feature vector and the first feature vector. Generally, the larger the second loss value, the greater the distance between the target second feature vector and the first feature vector, that is, the less similar the target area image and the sample object image. As an example, the second loss value may be a distance between the target second feature vector and the first feature vector (eg, Euclidean distance, Mahalanobis distance, etc.).

As another example, the second loss value can be obtained by the Triplet loss function, where the Triplet error function is as follows:

Among them, L is the second loss value, Σ is the sum sign, i is the serial number of each training sample selected this time, a represents the sample object image, and p represents the positive sample image (that is, the target area image). n represents a negative sample image (that is, an area image other than the target area image in the sample matching image; or a preset, characterized image where the object is different from the object characterized by the sample object image).

A feature vector representing a sample object image included in the training sample with the sequence number i,

A feature vector representing a positive sample image (such as an image of a target region) corresponding to the training sample with the number i,

Characterize the feature vector of a negative sample image (for example, a region image other than the target region image in the sample matching image) corresponding to the training sample with the number i. threshold represents a preset distance,

Characterizing the first distance (that is, the distance between the first feature vector and the feature vector of the positive sample image),

Represent the first distance (that is, the distance between the first feature vector and the feature vector of the negative sample image). The "+" on the lower right side of the square brackets in the above formula means taking a positive value, that is, when the expression of the expression in the square brackets is positive, the positive value is taken, and when it is negative, 0 is taken. In practice, during the training process, the parameters of the initial model can be adjusted according to the back-propagation algorithm so that the L value is minimum or the L value converges, indicating that the training is complete.

In this embodiment, the execution subject performing this step may obtain the total loss value based on the first loss value and the second loss value, compare the total loss value with the target value, and determine whether the initial model is completed according to the comparison result. The target value may be a preset loss value threshold. When the difference between the total loss value and the target value is less than or equal to the loss value threshold, it is determined that the training is completed.

In some optional implementation manners of this embodiment, an executing subject executing this step may use a weighted summation result of the first loss value and the second loss value as a total loss value according to a preset weight value, and use the total loss value. The loss value is compared with the target value, and whether the initial model is trained is determined based on the comparison result. The above weight value can adjust the ratio of the first loss value and the second loss value to the total loss value, so that the image recognition model can achieve different functions in different application scenarios (for example, some scenes focus on extracting position information, and some scenes focus on Compare distances of feature vectors).

Step 306: In response to determining that the training is completed, determine the initial model as an image recognition model.

In some optional implementation manners of this embodiment, the execution subject of the image recognition model that is trained may respond to determining that the initial model is not trained, adjusting the parameters of the initial model, and unselected training samples from the training sample set. In the process, a training sample is selected, and an initial model adjusted by parameters is used as an initial model, and the training step is continued. For example, assuming the initial model is a convolutional neural network, the backpropagation algorithm can be used to adjust the weights in each convolutional layer in the initial model. Then, a training sample may be selected from the unselected training samples in the training sample set, and the initial model adjusted by parameters is used as the initial model, and steps 303 to 306 are continuously performed.

It should be noted that the execution subject of the image recognition model obtained through the training may be the same as or different from the execution subject of the method for extracting an image. If they are the same, the training subject who has obtained the image recognition model may store the structure information and parameter values of the parameters of the trained image recognition model locally after obtaining the image recognition model. If they are different, the execution subject trained in the image recognition model may send the structure information and parameter values of the trained image recognition model to the execution subject of the method for extracting the image after the image recognition model is trained.

Continuing to refer to FIG. 4, FIG. 4 is a schematic diagram of an application scenario of a method for extracting an image according to this embodiment. In the application scenario of FIG. 4, the server 401 first obtains a watermark image 402 (ie, a reference object image) uploaded by the terminal device 408, and obtains a set of images 403 to be matched locally. The server 401 inputs the watermark image 402 into the first sub-network 4041 included in the pre-trained image recognition model 404, and obtains a feature vector of the watermark image 402 as a reference feature vector 405.

Then, the server 401 selects one to-be-matched image 4031 from the to-be-matched image set 403, inputs the to-be-matched image 4031 into the second sub-network 4041 included in the image recognition model 404, and obtains the

position information

4061, 4062, and 4063, and the corresponding position information. The feature vectors to be matched 4071, 4072, and 4073. The feature vectors to be matched 4071, 4072, and 4073 are the feature vectors of the

watermark images

40311, 40312, and 40313 included in the image 4031 to be matched, respectively.

Finally, the server 401 determines that the distance between the feature vector to be matched 4071 and the reference feature vector 405 is less than or equal to a preset distance threshold, extracts the to-be-matched image 4031 as an image matching the reference object image, and sends the matched image to the terminal device 408. The server 401 repeatedly selects the image to be matched from the set of images to be matched 403 and the watermarked image 402 to match, thereby extracting multiple images from the set of images to be matched 403 that match the watermarked image 402.

In the method provided by the foregoing embodiments of the present application, a pre-trained image recognition model is used to obtain a reference feature vector of a reference image and at least one feature vector to be matched of the image to be matched, and then compare the reference feature vector and the feature vector to be matched. Distance to obtain an image matching the reference image, thereby improving the pertinence of matching with the reference image, and realizing that when the training samples required for training the image recognition model do not include the reference image, the image recognition model is used to extract the The reference image matches the image, which improves the flexibility of image recognition and enriches the means of image recognition.

With further reference to FIG. 5, a flowchart 500 of still another embodiment of a method for extracting an image is shown. The process 500 of the method for extracting an image includes the following steps:

Step 501: Obtain a reference object image and a set of images to be matched.

In this embodiment, step 501 is substantially the same as step 501 in the embodiment corresponding to FIG. 2, and details are not described herein again.

Step 502: Input a reference object image into a first sub-network included in a pre-trained image recognition model, and obtain a feature vector of the reference object image as a reference feature vector.

In this embodiment, step 502 is substantially the same as step 502 in the embodiment corresponding to FIG. 2, and details are not described herein again.

Step 503, for the to-be-matched images in the to-be-matched image set, perform the following extraction step: input the to-be-matched images into a second sub-network included in the image recognition model to obtain at least one position information and a feature vector to be matched corresponding to the position information; Determining the distance between the obtained feature vector to be matched and the reference feature vector; in response to determining that there is a distance less than or equal to a preset distance threshold in the determined distance, extracting the image to be matched as an image matching the reference object image; Position information of a region image corresponding to a distance equal to a distance threshold, and output the determined position information.

Step 5031: Input the image to be matched into the second sub-network included in the image recognition model to obtain at least one position information and a feature vector to be matched corresponding to the position information. Step 5031 is basically the same as step 2031 in the embodiment corresponding to FIG. 2, and details are not described herein again.

Step 5032: Determine the distance between the obtained feature vector to be matched and the reference feature vector. Step 5032 is basically the same as step 2032 in the embodiment corresponding to FIG. 2, and details are not described herein again.

Step 5033: In response to determining that a distance smaller than or equal to a preset distance threshold exists in the determined distance, extract the image to be matched as an image matching the reference object image. Step 5033 is basically the same as step 2033 in the embodiment corresponding to FIG. 2, and details are not described herein again.

Step 5034: Determine the position information of the area image corresponding to the distance less than or equal to the distance threshold, and output the determined position information.

In this embodiment, an execution subject of the method for extracting an image (such as a server or a terminal device shown in FIG. 1) may be obtained from step 5031 based on the distance determined in step 5032 and equal to or less than a preset distance threshold. Among the at least one position information, position information corresponding to a distance less than or equal to the distance threshold is determined, and position information corresponding to a distance less than or equal to the distance threshold is output. The execution subject may output position information in various ways. For example, the display body connected to the execution subject may display information such as identification information, coordinate information, and the like of a region image included in the position information.

In some optional implementation manners of this embodiment, the above-mentioned execution subject may generate a matched image including a position mark based on the output location information and the image to be matched after outputting the location information. The position marker is used to mark the position of the region to be matched image corresponding to the output position information in the matched image. Specifically, the execution subject may draw a frame of a preset shape in the image to be matched according to the output position information, use the drawn frame as a position mark, and use the to-be-matched image including the position mark as a matched image.

As can be seen from FIG. 5, compared with the embodiment corresponding to FIG. 2, the process 500 of the method for extracting an image in this embodiment highlights the steps of outputting position information. Therefore, the solution described in this embodiment can further determine the position of the target region image included in the image to be matched, and improve the specificity of image recognition.

With further reference to FIG. 6, as an implementation of the methods shown in the foregoing figures, this application provides an embodiment of an apparatus for extracting an image. The apparatus embodiment corresponds to the method embodiment shown in FIG. 2. The device can be specifically applied to various electronic devices.

As shown in FIG. 6, the apparatus 600 for extracting an image in this embodiment includes: an acquiring unit 601 configured to acquire a reference object image and a set of images to be matched; and a generating unit 602 configured to input a reference object image into a pre-training A first sub-network included in the image recognition model of FIG. 1 and obtains a feature vector of a reference object image as a reference feature vector; an extraction unit 603 is configured to perform the following extraction step on the to-be-matched images in the to-be-matched image set: The image is input to a second sub-network included in the image recognition model to obtain at least one position information and a feature vector to be matched corresponding to the position information, where the feature vector to be matched is a feature vector of an area image included in the image to be matched, and the position information is used for Characterizing the position of the area image in the image to be matched; determining the distance between the obtained feature vector to be matched and the reference feature vector; and in response to determining that a distance less than or equal to a preset distance threshold exists in the determined distance, extracting the to be matched The image is an image matching the reference target image.

In this embodiment, the obtaining unit 601 may obtain the reference object image and the set of images to be matched from a remote or local source through a wired connection method or a wireless connection method. The reference object image may be an image to be compared with other images, and the reference object image is an image representing an object. Objects can be various things, such as watermarks, signs, faces, objects, and so on. The set of images to be matched may be a set of certain types of images (for example, images containing a trademark) stored in advance.

In this embodiment, the generating unit 602 may input the reference object image into a first sub-network included in a pre-trained image recognition model, and obtain a feature vector of the reference object image as a reference feature vector. The first sub-network is used to characterize the correspondence between the image and the feature vector of the image. In this embodiment, the image recognition model may be various neural network models created based on machine learning technology. The neural network model may have a structure of various neural networks (for example, DenseBox, VGGNet, ResNet, SegNet, etc.). The above reference feature vector may be a feature (e.g., shape, color, texture) extracted from a first sub-network included in a neural network model (e.g., a network composed of one or some convolutional layers included in the neural network model), which characterizes an image And other characteristics).

In this embodiment, the extraction unit 603 may perform the following steps on the image to be matched:

First, the image to be matched is input to a second sub-network included in the image recognition model, and at least one position information and a feature vector corresponding to the position information are obtained. The second sub-network is used to characterize the correspondence between the image and the position information of the image and the feature vector to be matched of the image. The position information is used to characterize the position of the area image corresponding to the feature vector to be matched in the to-be-matched image. The feature vector to be matched is a feature vector of an area image included in the image to be matched.

Then, a distance between the obtained feature vector to be matched and the reference feature vector is determined. Specifically, the above-mentioned extraction unit 603 may determine a distance between each of the obtained at least one feature vector to be matched and the reference feature vector. The distance may be any of the following: Euclidean distance, Mahalanobis distance, and the like.

Finally, in response to determining that a distance less than or equal to a preset distance threshold exists in the determined distance, the image to be matched is extracted as an image matching the reference object image. The distance threshold may be a value set by a technician based on experience, or may be a value calculated (for example, by calculating an average value) calculated by the extraction unit 603 according to historical data (for example, a recorded historical distance threshold).

In some optional implementations of this embodiment, the extraction unit 603 may include an output module configured to determine position information of a region image corresponding to a distance less than or equal to a distance threshold, and output the determined position information.

In some optional implementations of this embodiment, the extraction unit 603 may further include: a generating module configured to generate a matched image including a position marker based on the output position information and the image to be matched, where the position marker It is used to mark the position of the region to be matched image corresponding to the output position information in the matched image.

In some optional implementations of this embodiment, the second sub-network may include a dimension transformation layer for transforming a feature vector to a target dimension; and the extraction unit 603 may be further configured to: input the image to be matched into an image Identify the second sub-network included in the model, and obtain at least one feature vector to be matched that has the same dimension as the reference feature vector.

In some optional implementations of this embodiment, the image recognition model is obtained by training in the following steps: obtaining a training sample set, where the training sample includes labeled object information of the sample object image, the sample matched image, and the sample matched image, and the labeled position The information characterizes the position of the region image included in the sample matching image; selects training samples from the training sample set, and executes the following training steps: inputting the sample object image included in the selected training sample into the first sub-network included in the initial model to obtain the first feature Vector, and input the sample matching image into the second sub-network included in the initial model to obtain at least one location information and a second feature vector corresponding to the location information; and determine the target characterizing the sample matching image from the obtained at least one location information The position information of the area image is used as the target position information, and the second feature vector corresponding to the target position information is determined as the target second feature vector; based on the first loss value representing the error of the target position information and the target feature second feature vector and the first feature Distance of vector Loss from the second value, it is determined whether the initial model training is completed; in response to determining the completion of training, the initial model is determined as the image recognition model.

In some optional implementation manners of this embodiment, the execution subject of the image recognition model trained to obtain the weighted summation result of the first loss value and the second loss value as the total loss value according to a preset weight value, and The total loss value is compared with the target value, and whether the initial model is trained is determined based on the comparison result.

In some optional implementations of this embodiment, the step of training to obtain an image recognition model may further include: in response to determining that the initial model is not trained, adjusting parameters of the initial model, and selecting unselected data from the training sample set. Among the training samples, a training sample is selected, and the initial model adjusted by parameters is used as the initial model, and the training step is continued.

The apparatus provided by the foregoing embodiments of the present application obtains a reference feature vector of a reference image and at least one feature vector to be matched of an image to be matched by using a pre-trained image recognition model, and then compares the reference feature vector and the feature vector to be matched by Distance to obtain an image matching the reference image, thereby improving the pertinence of matching with the reference image, and realizing that when the training samples required for training the image recognition model do not include the reference image, the image recognition model is used to extract the The reference image matches the image, which improves the flexibility of image recognition and enriches the means of image recognition.

Reference is now made to FIG. 7, which illustrates a schematic structural diagram of a computer system 700 suitable for implementing an electronic device (such as a server or a terminal device shown in FIG. 1) in the embodiment of the present application. The electronic device shown in FIG. 7 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in FIG. 7, the computer system 700 includes a central processing unit (CPU) 701, which can be based on a program stored in a read-only memory (ROM) 702 or a program loaded from a storage section 708 into a random access memory (RAM) 703. Instead, perform various appropriate actions and processes. In the RAM 703, various programs and data required for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input / output (I / O) interface 705 is also connected to the bus 704.

The following components are connected to the I / O interface 705: an input section 706 including a keyboard, a mouse, etc .; an output section 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc .; and a speaker; And a communication section 709 including a network interface card such as a LAN card, a modem, and the like. The communication section 709 performs communication processing via a network such as the Internet. The driver 710 is also connected to the I / O interface 705 as needed. A removable medium 711, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 710 as needed, so that a computer program read out therefrom is installed into the storage section 708 as needed.

In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing a method shown in a flowchart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 709, and / or installed from a removable medium 711. When the computer program is executed by a central processing unit (CPU) 701, the above-mentioned functions defined in the method of the present application are executed. It should be noted that the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable medium or any combination of the foregoing. The computer-readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable Read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In this application, a computer-readable medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal that is included in baseband or propagated as part of a carrier wave, and which carries computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for performing the operations of this application may be written in one or more programming languages, or a combination thereof, including programming languages such as Java, Smalltalk, C ++, and also conventional Procedural programming language—such as "C" or a similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer, partly on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider) Internet connection).

The flowchart and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, a program segment, or a part of code, which contains one or more functions to implement a specified logical function Executable instructions. It should also be noted that in some alternative implementations, the functions labeled in the blocks may also occur in a different order than those labeled in the drawings. For example, two blocks represented one after the other may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented by a dedicated hardware-based system that performs the specified function or operation , Or it can be implemented with a combination of dedicated hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described unit may also be provided in a processor, for example, it may be described as: a processor includes an acquisition unit, a generation unit, and an extraction unit. Wherein, the names of these units do not constitute a limitation on the unit itself in some cases, for example, the obtaining unit may also be described as a “unit for obtaining a reference target image and an image set to be matched”.

As another aspect, the present application also provides a computer-readable medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device in. The computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device: obtains a reference object image and a set of images to be matched; and inputs the reference object image into a pre-trained The first sub-network included in the image recognition model, obtains the feature vector of the reference object image as the reference feature vector; for the to-be-matched images in the to-be-matched image set, the following extraction step is performed: input the to-be-matched image into the first Two sub-networks to obtain at least one location information and a feature vector corresponding to the location information, wherein the feature vector to be matched is a feature vector of an area image included in the image to be matched, and the location information is used to characterize the area image in the to-be-matched image. In the position; determining the distance between the obtained feature vector to be matched and the reference feature vector; and in response to determining that there is a distance less than or equal to a preset distance threshold in the determined distance, extracting the to-be-matched image as a match with the reference object image image.

The above description is only a preferred embodiment of the present application and an explanation of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to the technical solution of the specific combination of the above technical features, but also covers the above technical features or Other technical solutions formed by arbitrarily combining their equivalent features. For example, a technical solution formed by replacing the above features with technical features disclosed in the present application (but not limited to) having similar functions.

Claims

A method for extracting an image includes:

Obtaining a reference object image and a set of images to be matched;

Inputting the reference object image into a first sub-network included in a pre-trained image recognition model, and obtaining a feature vector of the reference object image as a reference feature vector;

For the to-be-matched images in the to-be-matched image set, the following extraction step is performed: entering the to-be-matched images into a second sub-network included in the image recognition model to obtain at least one position information and a feature vector to be matched corresponding to the position information , Wherein the feature vector to be matched is a feature vector of an area image included in the image to be matched, and the position information is used to characterize the position of the area image in the to-be-matched image; determining the obtained feature vector to be matched and the reference feature vector In response to determining that a distance smaller than or equal to a preset distance threshold exists in the determined distance, extracting the image to be matched as an image matching the reference object image.
The method according to claim 1, wherein the extracting step further comprises:

Determining position information of an area image corresponding to a distance less than or equal to the distance threshold, and outputting the determined position information.
The method according to claim 2, wherein the extracting step further comprises:

Based on the output position information and the image to be matched, a matched image including a position marker is generated, where the position marker is used to mark the position of the region to be matched image corresponding to the output position information in the matched image.
The method according to claim 1, wherein the second sub-network includes a dimension transformation layer for transforming a feature vector to a target dimension; and

The step of inputting the image to be matched into a second sub-network included in the image recognition model to obtain at least one feature vector to be matched includes:

The image to be matched is input to a second sub-network included in the image recognition model, and at least one feature vector to be matched is obtained that has the same dimension as the reference feature vector.
The method according to any one of claims 1-4, wherein the image recognition model is trained by the following steps:

Obtaining a training sample set, where the training sample includes labeled object information of the sample object image, the sample matched image, and the sample matched image, and the labeled position information represents a position of a region image included in the sample matched image;

Select training samples from the training sample set, and perform the following training steps: input the sample object images included in the selected training samples into the first subnetwork included in the initial model, obtain a first feature vector, and input the sample matching images into the initial model. The second sub-network obtains at least one position information and a second feature vector corresponding to the position information; from the obtained at least one position information, determining position information representing the target region image in the sample matching image as target position information, and determining The second feature vector corresponding to the target position information is used as the target second feature vector; based on the first loss value representing the error of the target position information and the second loss value representing the difference between the distance between the target second feature vector and the first feature vector, It is determined whether the initial model is trained; in response to determining that the training is complete, the initial model is determined as an image recognition model.
The method according to claim 5, wherein the initial model is determined based on the first loss value representing the error of the target position information and the second loss value representing the difference between the distance between the target second feature vector and the first feature vector. Whether the training is completed, including:

According to a preset weight value, a weighted summation result of the first loss value and the second loss value is used as a total loss value, and the total loss value is compared with a target value, and whether the initial model is trained is determined according to the comparison result.
The method according to claim 5, wherein the step of training to obtain the image recognition model further comprises:

In response to determining that the initial model is not trained, adjusting the parameters of the initial model, and selecting training samples from the unselected training samples in the training sample set, using the parameter-adjusted initial model as the initial model, and continuing to execute all Describe the training steps.
A device for extracting an image includes:

An obtaining unit configured to obtain a reference object image and a set of images to be matched;

A generating unit configured to input the reference object image into a first sub-network included in a pre-trained image recognition model, and obtain a feature vector of the reference object image as a reference feature vector;

The extraction unit is configured to perform the following extraction steps on the to-be-matched images in the to-be-matched image set: input the to-be-matched images into a second sub-network included in the image recognition model, and obtain at least one location information and location information The corresponding feature vector to be matched, wherein the feature vector to be matched is the feature vector of the area image included in the image to be matched, and the position information is used to characterize the position of the area image in the to-be-matched image; determining the obtained feature vector to be matched A distance from the reference feature vector; in response to determining that a distance less than or equal to a preset distance threshold exists in the determined distance, extracting the image to be matched as an image matching the reference object image.
The apparatus according to claim 8, wherein the extraction unit comprises:

The output module is configured to determine position information of a region image corresponding to a distance less than or equal to the distance threshold, and output the determined position information.
The apparatus according to claim 9, wherein the extraction unit further comprises:

A generating module configured to generate a matched image including a position marker based on the output position information and the image to be matched, where the position marker is used to mark the position of the region-to-be-matched image corresponding to the output position information in the matched image .
The apparatus according to claim 8, wherein the second sub-network includes a dimension transformation layer for transforming a feature vector to a target dimension; and

The extraction unit is further configured to:

The image to be matched is input to a second sub-network included in the image recognition model, and at least one feature vector to be matched is obtained that has the same dimension as the reference feature vector.
The device according to any one of claims 8-11, wherein the image recognition model is obtained by training in the following steps:

Obtaining a training sample set, where the training sample includes labeled object information of the sample object image, the sample matched image, and the sample matched image, and the labeled position information represents a position of a region image included in the sample matched image;

Select training samples from the training sample set, and perform the following training steps: input the sample object images included in the selected training samples into the first subnetwork included in the initial model, obtain a first feature vector, and input the sample matching images into the initial model. The second sub-network obtains at least one position information and a second feature vector corresponding to the position information; from the obtained at least one position information, determining position information representing the target region image in the sample matching image as target position information, and determining The second feature vector corresponding to the target position information is used as the target second feature vector; based on the first loss value representing the error of the target position information and the second loss value representing the difference between the distance between the target second feature vector and the first feature vector, It is determined whether the initial model is trained; in response to determining that the training is complete, the initial model is determined as an image recognition model.
The apparatus according to claim 12, wherein the initial model is determined based on the first loss value representing the error of the target position information and the second loss value representing the difference between the distance between the target second feature vector and the first feature vector. Whether the training is completed, including:

According to a preset weight value, a weighted summation result of the first loss value and the second loss value is used as a total loss value, and the total loss value is compared with a target value, and whether the initial model is trained is determined according to the comparison result.
The apparatus according to claim 12, wherein the step of training to obtain the image recognition model further comprises:

In response to determining that the initial model is not trained, adjusting the parameters of the initial model, and selecting training samples from the unselected training samples in the training sample set, using the parameter-adjusted initial model as the initial model, and continuing to execute all Describe the training steps.
An electronic device includes:

One or more processors;

A storage device on which one or more programs are stored,

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1-7.
A computer-readable medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.