Disclosure of Invention
The embodiment of the invention provides an image data processing method and device, which can reduce the labor cost consumed in the process of labeling massive image data.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a method, including:
obtaining image data according to the obtained keywords; clustering the image data to obtain a clustering center corresponding to the keyword; and screening out the image data matched with the clustering center, labeling the image data matched with the clustering center through the keyword, and importing the labeled image data into a sample library.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the obtaining image data according to the obtained keyword includes: acquiring user operation data, and acquiring text information and image data corresponding to the text information from the user operation data; and extracting keywords from the text information, and using image data corresponding to the text information as image data corresponding to the extracted keywords.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, the acquiring user operation data, and acquiring text information and image data corresponding to the text information from the user operation data includes: acquiring a retrieval record of a user in a specified time period; extracting a search word from the search record to serve as the keyword, and acquiring search item information with click operation; and acquiring image data corresponding to the retrieval item information as image data corresponding to the extracted keyword.
With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, the acquiring user operation data, and acquiring text information and image data corresponding to the text information from the user operation data includes: acquiring uploaded data of a user in a specified time period, and determining attribute information of an interface for displaying the uploaded data; and extracting the keywords from the attribute information, and extracting image data from the uploaded data as image data corresponding to the extracted keywords.
With reference to the first aspect, in a fourth possible implementation manner of the first aspect, before the clustering the image data, the method further includes:
acquiring the characteristic vectors of the histograms of the image data corresponding to the extracted keywords, and calculating to obtain the distance between the characteristic vectors of the histograms; and screening and reserving a piece of image data according to the histogram with the same characteristic vector and the distance between the characteristic vectors within the specified range.
And/or acquiring a specified number of image data from a noise sample library as negative samples, and acquiring the specified number of image data from each image data corresponding to the extracted keyword as positive samples; training a linear SVM classifier by using the feature vectors of the negative sample and the positive sample; and acquiring the confidence coefficient of each image data corresponding to the extracted keywords by the linear SVM classifier according to the feature vector of each image data corresponding to the extracted keywords, and discarding the image data with the confidence coefficient greater than 0.75.
With reference to the first aspect or the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, the clustering the image data to obtain a cluster center corresponding to the keyword includes: extracting feature vectors of each image data
Wherein x
iRepresenting feature vectors of a copy of the image data, and obtaining a set p of local densities between the feature vectors and a clustering parameter δ, wherein: local density
Characteristic distance d
ij=||x
i-x
j||
2Represents the distance between image data i and image data j, wherein,
clustering parameters
q
iDenotes ρ
iOne descending order lower order; and obtaining a decision graph of each clustering center according to the set rho of the local density among the feature vectors and the clustering parameter delta, and screening out the clustering centers corresponding to the keywords by using the decision graph.
With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner, the screening out image data that matches the clustering center includes: screening out image data belonging to the same clustering center according to a set rho of local densities among the feature vectors; and for the image data belonging to the same clustering center, sorting the image data according to the sequence from small to large characteristic distances, and acquiring image data with a specified proportion quantity according to the sorting sequence as the image data matched with the clustering center.
With reference to the fifth possible implementation manner of the first aspect, in a seventh possible implementation manner, the method further includes: acquiring a verification image from the image data according to the user operation data; and obtaining the distance between the feature vector of the verification image and the screened clustering center corresponding to the keyword, and removing the clustering center with the distance larger than the maximum threshold value.
In a second aspect, an embodiment of the present invention provides an apparatus, including: the image extraction module is used for obtaining image data according to the acquired keywords; the image clustering module is used for clustering the image data and obtaining a clustering center corresponding to the keyword; the image screening module is used for screening out the image data matched with the clustering center, marking the image data matched with the clustering center through the keyword, and importing the marked image data into a sample library; and the sample library management module is used for importing the marked image data into a sample library.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the image extraction module is specifically configured to obtain user operation data, and obtain text information and image data corresponding to the text information from the user operation data; extracting keywords from the text information, and taking image data corresponding to the text information as image data corresponding to the extracted keywords; the acquiring user operation data and acquiring text information and image data corresponding to the text information from the user operation data includes: acquiring a retrieval record of a user in a specified time period by the image extraction module; extracting a search word from the search record to serve as the keyword, and acquiring search item information with click operation; acquiring image data corresponding to the retrieval item information as image data corresponding to the extracted keyword; or, the acquiring the user operation data and acquiring the text information and the image data corresponding to the text information from the user operation data includes: the image extraction module acquires uploaded data of a user in a specified time period and determines attribute information of an interface for displaying the uploaded data; and extracting the keywords from the attribute information, and extracting image data from the uploaded data as image data corresponding to the extracted keywords.
With reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner, the method further includes: the repeated image removing module is used for acquiring the characteristic vectors of the histograms of the image data corresponding to the extracted keywords before clustering the image data, and calculating the distance between the characteristic vectors of the histograms; and screening and reserving a piece of image data according to the histogram with the same characteristic vector and the distance between the characteristic vectors within the specified range. And/or the noise image removing module is used for acquiring the image data of the specified quantity from the noise sample library as negative samples and acquiring the image data of the specified quantity from the image data corresponding to the extracted keywords as positive samples; training a linear SVM classifier by using the feature vectors of the negative sample and the positive sample; and then, acquiring the confidence coefficient of each image data corresponding to the extracted keywords by the linear SVM classifier according to the feature vector of each image data corresponding to the extracted keywords, and discarding the image data with the confidence coefficient greater than 0.75.
The image data processing method and the device provided by the embodiment of the invention specifically extract the image characteristic vector based on deep learning, quickly search clustering image data for clustering based on the density peak value, confirm the picture content closest to the keyword according to the clustering result, confirm the clustering center, obtain the image data matched with the keyword according to the clustering of the characteristic vector, and mark the matched keyword label on the image data, thereby completing the image data processing of the image. Compared with the scheme of manually labeling the sample images in the sample library in the prior art, the scheme of the invention reduces the labor cost consumed in the process of labeling the mass image data, so that effective training data can be quickly and efficiently provided for an image recognition engine or image processing application, and the expansion of the image data in the training sample is also facilitated.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The method flow in this embodiment may be specifically executed on a data processing system as shown in fig. 1, which includes a data processing platform and a database. The data processing platform disclosed in this embodiment may be specifically a server, a workstation, a super computer, or a server cluster system for data processing, which is composed of a plurality of servers. Algorithmic models and programs for clustering may be stored on the data processing platform. And after image data processing is completed on the image data matched with the clustering center, the image data subjected to labeling is stored in a database. The database may be a device cluster including a plurality of server devices and hardware devices such as a storage device, and is configured to store massive image data and keywords corresponding to the image data. It should be noted that the data processing platform may also capture image data and keywords corresponding to the image data from the internet by running data extraction tools such as a web crawler and an image capture program.
In this embodiment, on the basis of the data processing system shown in fig. 1, a device such as a search server and/or a user equipment may also be included, which is specifically shown in fig. 2. The search server may specifically be a server, a workstation, a super computer, or a server cluster system for data processing and composed of a plurality of servers, and is configured to receive a search request sent by the user equipment and return a search result to the user equipment, where user operation data or history data of a search behavior performed by the user equipment through the search server may be stored in a database or may also be stored in a storage device of the search server.
The user equipment may be implemented as a single Device, or integrated into various media data playing devices, such as a set-top box, a Mobile phone, a Tablet Personal Computer (Tablet Personal Computer), a Laptop Computer (Laptop Computer), a multimedia player, a digital camera, a Personal Digital Assistant (PDA), a navigation Device, a Mobile Internet Device (MID), or a Wearable Device (Wearable Device).
An embodiment of the present invention provides an image data processing method, as shown in fig. 3, including:
and S1, obtaining image data according to the acquired keywords.
The data processing platform can extract corresponding keywords and obtain image data according to the acquired keywords. The image data may be captured from the internet and stored in a database, such as: capturing a commodity picture from the internet through a web crawler, and recording name information corresponding to the commodity picture as a keyword; or the data processing platform directly extracts from a business system, such as an online shopping system, for example: and acquiring the commodity picture displayed in the shop from a server for storing shop data in the online shopping system, and recording the commodity name corresponding to the commodity picture as a keyword. The image data collected by the data processing platform may be imported into the sample library, and used as the massive image data to be labeled in the embodiment, and a keyword tag set of the collected image data is established in the data processing platform or the database. Or the data processing platform firstly establishes a keyword tag set of the image data, then obtains search results including the image data from a portal website and an e-commerce website by using the keywords through a search engine, and captures the search results by using a web crawler and constructs an image data set corresponding to the same keyword tag set.
And S2, clustering the image data and obtaining a clustering center corresponding to the keyword.
In this embodiment, the data processing platform may extract the feature vector of each image data
Specifically, the feature vector of each image data may be extracted from the filtered image data in the sample library by the data processing platform using the trained deep learning network. Wherein x is
iA feature vector representing a slice of image data, wherein:
Characteristic distance dij=||xi-xj||2Representing the distance between image data i and image data j,
wherein the content of the first and second substances,
clustering parameters
q
iDenotes ρ
iIn descending order.
And acquiring a set rho and a clustering parameter delta of local density among the characteristic vectors, and obtaining a decision graph of each clustering center according to the set rho and the clustering parameter delta of the local density among the characteristic vectors. For example: obtaining a sample clustering center decision diagram according to the clustering parameter delta and the local density parameter rho, screening a plurality of clustering centers from clustering results according to a certain rule and the distribution condition of the decision diagram, taking the first 20 images of the search results based on the same keyword label set as verification images according to the obtained clustering centers, and removing false clustering centers of the non-verification images near the clustering centers by judging the distances between the feature vectors corresponding to the verification images and the clustering centers.
And screening out the clustering centers corresponding to the keywords by utilizing the decision graph. For example: and classifying the non-clustering center images by filtering the clustering centers after the pseudo clustering centers, traversing from large to small according to rho values, sequencing the samples belonging to the clustering centers according to characteristic distances, taking the first 30% of images closest to the clustering centers as images finally conforming to the corresponding content of the keyword, and labeling the keyword to finish the image data processing of the images.
In this embodiment, a verification image may be acquired from the image data according to the user operation data. And obtaining the distance between the feature vector of the verification image and the screened clustering center corresponding to the keyword, and removing the clustering center with the distance larger than the maximum threshold value.
S3, screening out the image data matched with the clustering center, labeling the image data matched with the clustering center through the keyword, and importing the labeled image data into a sample library.
The sample library imported by the image data may be specifically a model training for an image recognition engine or an image processing application, algorithm optimization performed by a business system, or other application aspects requiring training based on a sample, and the sample library plays a training role in these aspects, and may be referred to as a training sample library in this embodiment. Therefore, effective training data can be quickly and efficiently provided for an image recognition engine or image processing application, and quick updating and expansion of image data in a training sample are facilitated. For example: and screening the image data belonging to the same clustering center by the data processing platform according to the set rho of the local density among the characteristic vectors. And for the image data belonging to the same clustering center, sorting the image data according to the sequence from small to large characteristic distances, and acquiring image data with a specified proportion quantity according to the sorting sequence as the image data matched with the clustering center.
The image data processing method provided by the embodiment of the invention specifically extracts the image characteristic vector based on deep learning, quickly searches clustering image data for clustering based on the density peak value, confirms the picture content closest to the keyword according to the clustering result, confirms the clustering center, obtains the image data matched with the keyword according to the clustering of the characteristic vector, and marks the matched keyword label on the image data, thereby completing the image data processing of the image. The labor cost consumed in the process of labeling the mass image data is reduced, so that effective training data can be quickly and efficiently provided for an image recognition engine or image processing application, and the expansion of the image data in a training sample is facilitated.
In this embodiment, the specific manner of obtaining the image data according to the obtained keyword includes:
and acquiring user operation data, and acquiring text information and image data corresponding to the text information from the user operation data. And extracting keywords from the text information, and using image data corresponding to the text information as image data corresponding to the extracted keywords.
The specific mode of acquiring user operation data and acquiring text information and image data corresponding to the text information from the user operation data includes:
and acquiring the retrieval records of the user in a specified time period. And extracting a search term from the search record as the keyword, and acquiring search term information with click operation. And then acquiring image data corresponding to the retrieval item information as image data corresponding to the extracted keyword. For example: the image data in the database and the keywords corresponding to the image data may be specifically included in a search record or an operation history record performed by a user through a searcher, both the search record and the operation history record may include content indicating user operation data, and the user operation data may specifically include: the method includes the steps of inputting characters when a user performs search operation (for example, when the user performs search through keywords, the user inputs user equipment and reports the keywords to a search server), uploading local picture information (for example, when the user performs search through pictures, the user inputs the user equipment and reports picture data to the search server), and searching results, wherein the searching results specifically include text information returned to the user equipment in the searching process and image data corresponding to the text information.
Or: acquiring the uploaded data of a user in a specified time period, and determining the attribute information of an interface for displaying the uploaded data. And extracting the keywords from the attribute information, and extracting image data from the uploaded data as image data corresponding to the extracted keywords. For example: the method includes the steps of inputting characters when a user performs search operation (for example, when the user performs search through keywords, the user inputs user equipment and reports the keywords to a search server), uploading local picture information (for example, when the user performs search through pictures, the user inputs the user equipment and reports picture data to the search server), and searching results, wherein the searching results specifically include text information returned to the user equipment in the searching process and image data corresponding to the text information.
Further, the data processing platform may filter the collected image data to remove duplicate images and remove noisy image data, such as: for the image data imported into the sample library, the data processing platform can be used for collecting histogram feature vectors of the image data, calculating the distance between the histogram feature vectors under the same keyword, and then only keeping one of the image data with the same and similar feature vectors. In particular, the manner of filtering and removing duplicate images may be understood as: only one same image is reserved due to repeated collection; the way of filtering and removing the noisy image can be understood as: extracting bottom layer features of the image to construct a feature vector of the collected image data, inputting the feature vector into a pre-trained noise image classifier, and filtering out noise images irrelevant to keywords in the collected image. Therefore, in this embodiment, before performing the clustering process on the image data, the method further includes:
the way of filtering and removing duplicate images: and acquiring the characteristic vectors of the histograms of the image data corresponding to the extracted keywords, and calculating to obtain the distance between the characteristic vectors of the histograms. And screening and reserving a piece of image data according to the histogram with the same characteristic vector and the distance between the characteristic vectors within the specified range.
Manner of filtering and removing noisy images: a specified number of image data is acquired from the noise sample library as negative samples, and the specified number of image data is acquired from each image data corresponding to the extracted keyword as positive samples. Training a linear SVM (Support Vector Machine) classifier by using the feature vectors (such as HOG (Histogram of Oriented Gradient)) of the negative sample and the positive sample, acquiring the confidence degree of each image data corresponding to the extracted keyword according to the feature vectors of each image data corresponding to the extracted keyword by using the linear SVM classifier, discarding the image data with the confidence degree greater than 0.75, for example, constructing a noise sample library by using a data processing platform, wherein the noise sample library comprises 5000 or more noise pictures, extracting the feature vectors of candidate images, using the first 60 images obtained by using the same keyword through a search engine as positive samples, randomly selecting 60 images from the obtained noise sample library as negative samples, extracting the feature vectors, training a linear SVM classifier, and inputting the obtained feature vectors into the trained linear SVM classifier, and discarding the image corresponding to the feature vector which is judged to be a negative sample and has the confidence degree of more than 0.75 as a noise image.
In the practical application of this embodiment, the traditional manual labeling method can label 5000 pictures per person per day at an average labeling speed, and 50 tens of thousands of pictures per day can be labeled on a single node device in the data processing platform for the same picture to be labeled through the data processing platform in this embodiment, so that the efficiency is improved by 100 times, the labeling cost is reduced, and the number of times of occurrence of mislabeling is reduced compared with the manual labeling method.
In the prior art, manual marking of sample data is high in cost and is influenced by human subjectivity, so that the basic quality of part of pre-marked data is low, and the execution effect of an optimized model in an actual service environment is low. According to the scheme of the embodiment, manual intervention is not needed, a series of operations such as automatic collection, sorting, filtering and labeling of images are completely achieved, specifically, image feature vectors based on deep learning are extracted, clustered image data are quickly searched and clustered based on density peak values, picture content closest to keywords is confirmed according to clustering results, a clustering center is confirmed, image data matched with the keywords are obtained according to clustering of the feature vectors, and keyword labels matched with the image data labels are labeled, so that image data processing of the images is completed. The problem of overhigh labor cost in the process of labeling mass image data is solved, so that effective training data can be quickly and efficiently provided for an image recognition engine or image processing application, and the expansion of the image data in a training sample is facilitated.
An embodiment of the present invention provides an image data processing apparatus, as shown in fig. 4, including:
and the image extraction module is used for obtaining image data according to the acquired keywords.
And the image clustering module is used for clustering the image data and obtaining a clustering center corresponding to the keyword.
The image screening module is used for screening out the image data matched with the clustering center and marking the image data matched with the clustering center through the keyword;
and the sample library management module is used for importing the marked image data into a sample library.
The sample library imported by the image data may be specifically a model training for an image recognition engine or an image processing application, algorithm optimization performed by a business system, or other application aspects requiring training based on a sample, and the sample library plays a training role in these aspects, and may be referred to as a training sample library in this embodiment. Therefore, effective training data can be quickly and efficiently provided for an image recognition engine or image processing application, and quick updating and expansion of image data in a training sample are facilitated.
In this embodiment, the image extraction module is specifically configured to acquire user operation data, and acquire text information and image data corresponding to the text information from the user operation data. And extracting keywords from the text information, and using image data corresponding to the text information as image data corresponding to the extracted keywords.
The acquiring user operation data and acquiring text information and image data corresponding to the text information from the user operation data includes: and acquiring a retrieval record of the user in a specified time period by the image extraction module. And extracting a search word from the search record to serve as the keyword, and acquiring search item information with click operation. And acquiring image data corresponding to the retrieval item information as image data corresponding to the extracted keyword. Or, the acquiring the user operation data and acquiring the text information and the image data corresponding to the text information from the user operation data includes: and the image extraction module acquires the uploaded data of the user in a specified time period and determines the attribute information of an interface for displaying the uploaded data. And extracting the keywords from the attribute information, and extracting image data from the uploaded data as image data corresponding to the extracted keywords.
The image data processing apparatus provided in this embodiment, as shown in fig. 5, further includes:
and the repeated image removing module is used for acquiring the characteristic vectors of the histograms of the image data corresponding to the extracted keywords before clustering the image data, and calculating the distance between the characteristic vectors of the histograms. And screening and reserving a piece of image data according to the histogram with the same characteristic vector and the distance between the characteristic vectors within the specified range.
And/or the noise image removing module is used for acquiring a specified number of image data from the noise sample library as negative samples and acquiring the specified number of image data from each image data corresponding to the extracted keywords as positive samples. And training a linear SVM classifier by using the feature vectors of the negative sample and the positive sample. And then, acquiring the confidence coefficient of each image data corresponding to the extracted keywords by the linear SVM classifier according to the feature vector of each image data corresponding to the extracted keywords, and discarding the image data with the confidence coefficient greater than 0.75.
The image data processing device provided by the embodiment of the invention specifically extracts the image characteristic vector based on deep learning, quickly searches clustering image data for clustering based on the density peak value, confirms the picture content closest to the keyword according to the clustering result, confirms the clustering center, obtains the image data matched with the keyword according to the clustering of the characteristic vector, and marks the matched keyword label on the image data, thereby completing the image data processing of the image. The labor cost consumed in the process of labeling the mass image data is reduced, so that effective training data can be quickly and efficiently provided for an image recognition engine or image processing application, and the expansion of the image data in a training sample is facilitated.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.