TW201915787A

TW201915787A - Search method and processing device

Info

Publication number: TW201915787A
Application number: TW107127419A
Authority: TW
Inventors: 劉瑞濤; 劉宇; 徐良鵬
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2017-10-10
Filing date: 2018-08-07
Publication date: 2019-04-16
Also published as: US20190108242A1; CN110069650A; CN110069650B; WO2019075123A1

Abstract

A method including extracting an image feature vector of a target image, wherein the image feature vector is used for representing image content of the target image; and determining, in the same vector space, a text corresponding to the target image according to a correlation between the image feature vector and a text feature vector of the text, wherein the text feature vector is used for representing semantics of the text. The method solves the problems of low efficiency and high requirements on the system processing capability in the conventional techniques, thereby achieving a technical effect of easily and accurately implementing image tagging.

Description

Searching method and processing equipment

本發明屬於網際網路技術領域，尤其涉及一種搜尋方法和處理設備。The invention belongs to the technical field of the Internet, and particularly relates to a search method and a processing device.

隨著網際網路、電子商務等技術的不斷發展，對影像資料的需求越來越大，如何對影像資料進行更為有效的分析和利用，對電子商務會產生很大的影響。在對影像資料進行處理的過程中，為影像推薦標簽可以更為有效地實現影像的聚合、影像分類、影像檢索等等，因此，對影像資料推薦標簽的需求也就越來越大。　　例如，使用者A希望通過影像搜尋產品的方式來搜尋產品，這種情況下，如果可以自動對影像進行標記，那麼使用者在上傳影像之後，就可以自動推薦出與影像相關的品類關鍵詞和屬性關鍵詞。或者是在其他存在影像資料的場景，可以自動為影像推薦文字（例如：標簽等），不需要人為進行分類標記。　　針對如何簡單高效地對影像進行標記，目前尚未提出有效的解決方案。With the continuous development of the Internet, e-commerce and other technologies, the demand for image data is increasing. How to analyze and use image data more effectively will have a great impact on e-commerce. In the process of processing image data, recommending labels for images can more effectively implement image aggregation, image classification, image retrieval, and so on. Therefore, the demand for image data recommendation labels is increasing. For example, user A wants to search for products through image search. In this case, if the image can be automatically tagged, then after uploading the image, the user can automatically recommend category keywords and categories related to the image. Attribute keywords. Or in other scenes where image data exists, you can automatically recommend text for the image (for example, tags, etc.), without the need for artificial classification and marking. For how to mark images simply and efficiently, no effective solution has been proposed.

本發明目的在於提供一種搜尋方法和處理設備，可以簡單高效地對影像進行標記。　　本發明提供一種搜尋方法和處理設備是這樣實現的：　　一種搜尋方法，所述方法包括：　　擷取目標影像的影像特徵向量，其中，所述影像特徵向量用於表徵所述目標影像的影像內容；　　在同一向量空間中，根據所述影像特徵向量與標簽的文字特徵向量之間的相關度，確定所述目標影像對應的標簽，其中，所述文字特徵向量用於表徵標簽的語義。　　一種處理設備，包括處理器以及用於儲存處理器可執行指令的記憶體，所述處理器執行所述指令時實現：　　擷取目標影像的影像特徵向量，其中，所述影像特徵向量用於表徵所述目標影像的影像內容；　　在同一向量空間中，根據所述影像特徵向量與標簽的文字特徵向量之間的相關度，確定所述目標影像對應的標簽，其中，所述文字特徵向量用於表徵標簽的語義。　　一種搜尋方法，所述方法包括：　　擷取目標影像的影像特徵，其中，所述影像特徵用於表徵所述目標影像的影像內容；　　在同一向量空間中，根據所述影像特徵與文字的文字特徵之間的相關度，確定所述目標影像對應的文字，其中，所述文字特徵用於表徵文字的語義。　　一種電腦可讀儲存媒體，其上儲存有電腦指令，所述指令被執行時實現上述方法的步驟。　　本發明提供的確定影像標簽的方法和處理設備，考慮到可以採用以圖搜文的方式，基於輸入的目標影像直接搜尋確定出推薦的文字，而不需要在匹配的過程中增加影像匹配的操作，可以直接通過確定影像特徵向量與文字特徵向量之間的相關度來匹配得到對應的文字。通過上述方式解決了現有的推薦文字方式所存在的效率較低、對系統處理能力要求較高的問題，達到了可以簡單準確的實現影像標記的技術效果。The object of the present invention is to provide a search method and processing device, which can simply and efficiently mark images. The present invention provides a search method and processing device implemented as follows: A search method, the method includes: capturing an image feature vector of a target image, wherein the image feature vector is used to characterize the image content of the target image; In the same vector space, the label corresponding to the target image is determined according to the correlation between the image feature vector and the text feature vector of the label, wherein the text feature vector is used to represent the semantics of the label. A processing device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the processor implements: capturing an image feature vector of a target image, wherein the image feature vector is used for characterization The image content of the target image; determining the label corresponding to the target image according to the correlation between the image feature vector and the text feature vector of the label in the same vector space, wherein the text feature vector is used for Characterize the semantics of the label. A search method, the method includes: capturing image features of a target image, wherein the image features are used to characterize the image content of the target image; in the same vector space, according to the image features and text features of text The degree of correlation between them determines the text corresponding to the target image, where the text features are used to characterize the semantics of the text. A computer-readable storage medium having computer instructions stored thereon, the steps of the above method being implemented when the instructions are executed. The method and processing device for determining an image label provided by the present invention consider that a text search method based on an input target image can be used to directly search for a recommended text based on an input target image without adding an image matching operation in the matching process. , You can directly get the corresponding text by determining the correlation between the image feature vector and the text feature vector. Through the above methods, the problems of low efficiency and high system processing capability requirements of the existing recommended text methods are solved, and the technical effect of simple and accurate image tagging is achieved.

為了使本技術領域的人員更好地理解本發明中的技術方案，下面將結合本發明實施例中的附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有作出創造性勞動前提下所獲得的所有其他實施例，都應當屬於本發明保護的範圍。　　目前也存在一些為影像推薦文字的方法，例如：訓練一個以圖搜圖的模型，為每張影像產生一個影像特徵向量，對於任意的兩張影像，影像特徵向量之間的相似度越大，那就表明兩個影像越相似。基於這個原理，現有的搜尋方法一般是收集一個影像集，控制該影像集中的影像可以儘量涵蓋整個應用場景。然後，可以通過基於影像特徵向量的搜尋匹配方式，從影像集中確定出與使用者輸入的影像相似的一個或多個影像，然後，將這一個或多個影像的文字作為文字集，從該文字集中確定出置信度比較高的一個或多個，作為為該影像推薦的文字。　　這種搜尋方法需要維護一個涵蓋整個應用場景的影像集，文字推薦的準確度依賴於影像集的規模，以及影像集自帶文字的精度，且文字往往需要人工進行標注，實現起來較為繁瑣。　　針對上述以圖搜圖的文字推薦方法所存在的問題，考慮到可以採用以圖搜文的方式，基於輸入的目標影像直接搜尋確定出推薦的文字，而不需要在匹配的過程中增加影像匹配的操作，可以直接通過目標影像匹配得到對應的文字，即，可以採用以圖搜文的方式為目標影像推薦文字。　　上述的文字可以是短標簽、長標簽、特定的文字內容等等，具體是哪種形式的文字內容，本發明對此不作限定，可以根據實際需要選擇。例如，在電商場景中上傳圖片，那麼文字可以是短標簽，如果在一個詩文與圖片的匹配系統中，那麼文字可以是詩句，即，可以根據實際的應用場景的不同，選用不同的文字內容類型。　　考慮可以對影像進行特徵擷取和對文字進行特徵擷取，然後，通過擷取的特徵計算影像與標簽集中各個文字之間的相關度，按照相關度高低確定目標影像的文字。基於此，在本例中提供了一種搜尋方法，如圖1所示，通過擷取目標影像中用於表徵目標影像的影像內容的影像特徵向量，和文字中用於表徵文字語義的文字特徵向量，來統計影像特徵向量和文字特徵向量之間的相關度，從而確定出目標影像對應的文字。　　即，可以將文字和影像兩個模態的資料經過各自的編碼轉換為同一空間的特徵的特徵向量，然後通過特徵之間的距離來衡量文字和影像之間的相關度，將相關度高的文字作為目標影像的文字。　　在一個實施方式中，可以通過客戶端上傳影像，其中，所述客戶端可以是客戶操作使用的終端設備或者軟體。具體的，客戶端可以是智慧手機、平板電腦、筆記型電腦、桌上型電腦、智慧手錶或者其它可穿戴設備等終端設備。當然，客戶端也可以是能運行於上述終端設備中的軟體。例如：手機淘寶、支付寶或者瀏覽器等應用軟體。　　在一個實施方式中，考慮到在實際應用中的處理速度，可以預先擷取出各個文字的文字特徵向量，這樣在獲取到目標影像之後，僅需要擷取目標影像的影像特徵向量，而不需要再擷取文字的文字特徵向量，這樣可以避免重複計算，且可以提高處理速度和效率。　　如圖2所示，可以採用但不限於採用以下方式圈定為目標影像確定的文字：　　1）將文字特徵向量與所述目標影像的影像特徵向量之間的相關度大於預設臨限值的一個或多個文字作為所述目標影像對應的文字；　　例如，預設臨限值為0.7，即，如果某個或者某幾個文字的文字特徵向量與目標影像的影像特徵向量之間的相關度大於0.7，則可以將這些文字作為為目標影像確定的文字。　　2）將文字特徵向量與所述目標影像的影像特徵向量之間的相關度位於前預設數量的文字作為所述目標影像的文字。　　例如，預設數量為4個，則可以按照文字特徵向量與目標影像的影像特徵向量之間的相關度高低進行排序，將相關度位於前4的4個文字作為為目標影像確定的文字。　　然而值得注意的是，上述所列舉的圈定為目標影像確定的文字僅是一種示意性描述，在實際實現的時候，還可以採用其它的確定策略，例如，可以將相關度位於前預設數量，且相關度超出預設臨限值的文字作為確定的文字。具體採用哪種方式可以根據實際需要選擇，本發明對此不作具體限定。　　為了可以簡單高效地獲取到目標影像的影像特徵向量和文字的文字特徵向量，可以通過訓練得到編碼模型的方式，來擷取影像特徵向量和文字特徵向量。　　如圖2所示，以標簽作為文字為例進行說明，可以建立影像編碼模型和標簽編碼模型，通過建立的影像編碼模型和標簽編碼模型可以擷取出影像特徵向量和文字特徵向量。　　在一個實施方式中，可以通過如下方式建立編碼模型：　　S1：獲取目標場景（例如：搜尋引擎、電商）的使用者搜尋和基於搜尋文字點擊的影像資料，基於這些行為資料可以獲得大量的影像-多標簽資料。　　其中，使用者搜尋文字和基於搜尋文字點的影像資料，可以是來源於目標場景的歷史搜尋和存取日誌。　　S2：將獲取的搜尋文字進行分詞和詞性分析；　　S2：去除文字中的數字、標點符號、亂碼等字符，保留視覺可分詞（例如：名詞、動詞、形容詞等），可以將這些詞作為標簽；　　S3：對基於搜尋文字點擊的影像資料進行去重處理；　　S4：合併標簽集中意思相近的標簽，去除一些沒有實際意義的標簽，以及無法通過視覺識別出的標簽（例如：發展、問題等）；　　S5：考慮到＜影像單標簽＞資料集比＜影像多標簽＞資料集更有利於網路收斂，因此，可以將＜影像多標簽＞轉換為＜影像單標簽＞對。　　例如，假設多標簽對為＜image，tag1：tag2：tag3＞，那麼可以將其轉換為單標簽對＜image tag1＞、＜image tag2＞、＜image tag3＞三個單標簽對。訓練的時候每個triplet對中，一張影像只對應一個正樣本標簽。　　S6：通過獲取的多個單標簽對進行訓練，得到用於從影像中擷取出影像特徵向量的影像編碼模型和用於從標簽中擷取出文字特徵向量的標簽編碼模型，且儘量使得同一圖片標簽對中的影像特徵向量和文字特徵向量較為相關。　　舉例而言，影像編碼模型可以是採用ResNet-152作為影像特徵向量抽取的神經網路模型，將原始影像統一正規化到預設像素值（例如：224x224像素）作為輸入，然後以pool5層特徵作為網路輸出，輸出的特徵向量長度為2048。在該神經網路模型的基礎上，利用非線性變換進行遷移學習，得到最終的能反應影像內容的特徵向量。如圖2所示，可以將圖2中的影像轉換為能反應影像內容的特徵向量。　　標簽編碼模型可以是將每個標簽通過one-hot編碼轉換為向量，考慮到one-hot編碼向量一般是稀疏的長向量，為了方便處理可以通過Embedding Layer將one-hot編碼轉換為較低維度的稠密向量，將形成的向量序列作為標簽對應的文字特徵向量，對於文字網路而言，可以採用兩層全連接結構，並加入其它的非線性計算層，從而增強文字特徵向量的表達能力，以得到某個影像對應的N個標簽的文字特徵向量。即，最終將標簽轉換為一個定長的實數向量。例如，將圖2中的“連衣裙”通過標簽編碼模型轉換為文字特徵向量，通過該文字特徵向量可以反應原始語義，從而便於與影像特徵向量進行比較。　　在一個實施方式中，考慮到如果對多個標簽同時進行比對，則需要電腦的處理速度比較快，對處理器的處理能力要求較高，為此，可以如圖3所示，逐個確定影像特徵向量與多個標簽中各個標簽的文字特徵向量之間的相關度；並在確定出每個相關度之後，都將相關度計算結果儲存至硬碟上，而不需要將其都放在記憶體中，等到標簽集中的標簽都完成與影像特徵向量之間的相關度計算之後，可以進行相似度排序，或者是相似度判斷，以確定出一個或多個可以作為目標影像標簽的標簽文字。　　為了確定出文字特徵向量與影像特徵向量之間的相關度，可以通過歐式距離進行表徵。具體的，對於文字特徵向量和影像特徵向量都可以通過向量的方式進行表徵，即，在同一向量空間中，可以通過比較兩個特徵向量之間的歐式距離來確定兩者之間的相關度。　　具體的，可以將影像和文字映射到同一特徵空間中，使得影像和文字的特徵向量處於同一向量空間中，這樣可以控制相關度高的文字特徵向量與影像特徵向量在該空間內靠近，而相關度低的遠離。因此，可以通過計算文字特徵向量和影像特徵向量，來確定影像和文字之間的相關度。　　具體的，文字特徵向量與影像特徵向量之間的匹配度可以為兩個向量之間的歐氏距離，當基於兩個向量計算得到的歐氏距離的數值越小，可以表示兩個向量之間的匹配度越好，反之，當基於兩個向量計算得到的歐氏距離的數值越大，可以表示兩個向量之間的匹配度越差。　　在一個實施方式中，在同一向量空間中，可以計算文字特徵向量與影像特徵向量之間的歐式距離，歐式距離越小，說明兩者的相關度越高，歐式距離越大，說明兩者的相關度越低。因此，在進行模型訓練的時候，可以以歐式距離小作為訓練目標，得到最終的編碼模型。相應的，在進行相關度確定的時候，可以基於歐式距離確定影像與文字之間的相關度，從而選擇出與影像更為相關的文字。　　上述僅是以歐式距離來衡量影像特徵向量和文字特徵向量之間的相關度，在實際實現的時候，還可以通過其它方式確定影像特徵向量和文字特徵向量之間的相關度。例如，還可以包括餘弦距離、曼哈頓距離等，另外，在一些情況下，相關度可以是數值，也可以不是數值，例如，可以僅是程度或者趨勢的字符化表徵，這種情況下，可以通過預設的規則使得該字符化表徵的內容量化為一特定值。進而，後續可以利用該量化的值確定兩個向量之間的相關度。例如，可能某個維度的值為“中”，則可以量化該字符為其ASCII碼的二進制值或十六進制值，本發明實施例所述兩個向量之間的匹配度並不以上述為限。　　在統計影像特徵向量和文字特徵向量之間的相關度，從而確定出目標影像對應的文字之後，考慮到有時得到的文字之間存在重合或者是確定出完全不相關的文字，為了提高文字確定的精度，可以進一步去除錯誤文字或者是對文字進行去重處理，從而使得最終確定出的文字更為準確。　　在一個實施方式中，在進行標簽確定的過程中，按照相似度進行排序，選取前N個作為確定出的標簽的方式，難免會出現同一屬性的標簽被打了好幾次標的情況，例如：一個“碗”的圖片，可能相關度比較高的標簽中同時出現了“碗”、“盆”，而關於顏色或者樣式的標簽卻都沒有排的很靠前，因此一個也沒有。這種情況下，可以按照這種方式，直接推送相關度前幾的標簽作為確定的標簽，也可以設定規則，確定幾個標簽類別，選取每個類別中相關度最高的作為確定的標簽，例如：產品類型選一個、顏色選一個、款式選一個等等。具體採用哪種策略，可以根據實際需要選擇，本發明對此不作限定。　　舉例而言，如果確定出相關度排名第一和第二的分別是紅色相關度0.8，紫色相關度0.7，那麼在設定策略為將靠前的幾個標簽都作為標簽推薦，那麼可以將紅色和紫色都作為標簽推薦，在設定策略為每個類別僅選一個，例如，僅選一個顏色標簽的情況下，因為紅色相關度大於紫色相關度，因此，選擇紅色作為推薦的標簽。　　在上例中，將文字和影像這兩種模態的資料，經過各自的編碼模型轉換為同一向量空間的特徵向量，然後，通過特徵向量之間的距離來衡量標簽與影像之間的相關度，將相關度高的標簽作為為影像確定的文字。　　然而值得注意的是，上例所介紹的方式是將影像和文字統一到同一個向量空間，從而使得影像和文字之間可以直接進行相關度匹配。上例是以將這種方式應用到以圖搜文的方式中為例進行的說明，即，給定一個影像，為該影像標記或者是產生描述信息，或者是產生相關文字信息等等。在實際實現的時候，還可以應用於以文搜圖的方式，即，給定文字，搜尋匹配得到對應的圖片，處理方式和思路與上面的以圖搜文是近似的，對此不再贅述。　　下面結合幾個具體場景，對上述搜尋方法進行說明，然而，值得注意的是，該具體場景僅是為了更好地說明本發明，並不構成對本發明的不當限定。　　1）電商網站發佈產品　　如圖4所示，使用者A打算出售自己的一個二手連衣裙，在拍照之後，將圖片傳送到電商網站平臺之後，一般是需要自己為該圖片設置標簽的，例如，輸入：長款、紅色、連衣裙作為該影像的標簽。這樣勢必會增加使用者的操作。　　通過本發明上述的確定影像標簽的方法，可以實現自動標記。使用者A在上傳拍攝的照片之後，系統後臺可以自動識別，為該圖片進行標記。通過上述方法，可以擷取出上傳圖片的影像特徵向量，然後將擷取的影像特徵向量與預先已經擷取好的多個標簽的文字特徵向量進行相關度計算，從而得到該影像特徵向量與各個標簽文字的相關度。然後，按照相關度高低，確定出上傳的照片確定的標簽，並自動進行標記，減少了使用者操作，提高了使用者體驗。　　2）相冊　　拍攝完的照片，或者是從網際網路下載的照片，在儲存到雲相冊或者是手機相冊之後。通過上述方法，可以擷取出上傳圖片的影像特徵向量，然後將擷取的影像特徵向量與預先已經擷取好的多個標簽的文字特徵向量進行相關度計算，從而得到該影像特徵向量與各個標簽文字的相關度。然後，按照相關度高低，確定出上傳的照片確定的標簽，並自動進行標記。　　在標記之後，可以更為方便的實現照片分類，也可以在後續對相冊中圖片進行搜尋的時候，更快的定位到目標圖片。　　3）以圖搜產品　　例如：拍立淘等搜尋模式中，需要使用者上傳一張圖片，然後基於這個圖片搜尋到相關或者是相似的產品。在這種情況下，在使用者上傳圖片之後，可以通過上述方法，擷取出上傳圖片的影像特徵向量，然後將擷取的影像特徵向量與預先已經擷取好的多個標簽的文字特徵向量進行相關度計算，從而得到該影像特徵向量與各個標簽文字的相關度。然後，按照相關度高低，確定出上傳的照片確定的標簽，在為圖片標記之後，就可以通過打上的標簽進行搜尋，從而可以有效提升搜尋的準確性，且可以提升召回率。　　4）以圖搜詩　　例如：如圖5所示，有些應用或者有些場景中需要通過圖片匹配出詩文，那麼在使用者上傳一張圖片之後，可以基於該圖片搜尋匹配出相應的詩文。在這種情況下，在使用者上傳圖片之後，可以通過上述方法，擷取出上傳圖片的影像特徵向量，然後將擷取的影像特徵向量與預先已經擷取好的多個詩文的文字特徵向量進行相關度計算，從而得到該影像特徵向量與各個詩文的文字特徵向量之間相關度。然後，按照相關度高低，確定出上傳的照片對應的詩文內容，可以呈現出該詩文的內容，或者是詩文的題目、作者等信息。　　上面以四個場景為例進行了說明，在實際實現的時候，還有其他的場景可以使用該方法。只要基於不同的場景擷取該場景的圖片標簽對，然後進行訓練，以得到符合該場景的影像編碼模型和文字編碼模型即可。　　本發明實施例所提供的方法實施例可以在移動終端、電腦終端、伺服器或者類似的運算裝置中執行。以運行在伺服器上為例，圖6是本發明實施例的一種搜尋方法的伺服器的硬體結構方塊圖。如圖6所示，伺服器10可以包括一個或多個（圖中僅示出一個）處理器102（處理器102可以包括但不限於微處理器MCU或可編程邏輯裝置FPGA等的處理裝置）、用於儲存資料的記憶體104、以及用於通信功能的傳輸模組106。本領域普通技術人員可以理解，圖6所示的結構僅為示意，其並不對上述電子裝置的結構造成限定。例如，伺服器10還可包括比圖6中所示更多或者更少的組件，或者具有與圖5所示不同的配置。　　記憶體104可用於儲存應用軟體的軟體程式以及模組，如本發明實施例中的搜尋方法對應的程式指令/模組，處理器102通過運行儲存在記憶體104內的軟體程式以及模組，從而執行各種功能應用以及資料處理，即實現上述搜尋方法。記憶體104可包括高速隨機記憶體，還可包括非揮發性記憶體，如一個或者多個磁性儲存裝置、快閃記憶體、或者其他非揮發性固態記憶體。在一些實例中，記憶體104可進一步包括相對於處理器102遠程設置的記憶體，這些遠程記憶體可以通過網路連接至電腦終端10。上述網路的實例包括但不限於網際網路、企業內部網、區域網路、移動通信網及其組合。　　傳輸模組106用於經由一個網路接收或者發送資料。上述的網路具體實例可包括電腦終端10的通信供應商提供的無線網路。在一個實例中，傳輸模組106包括一個網路介面卡（Network Interface Controller，NIC），其可通過基站與其他網路設備相連從而可與網際網路進行通訊。在一個實例中，傳輸模組106可以為射頻（Radio Frequency，RF）模組，其用於通過無線方式與網際網路進行通訊。　　請參考圖7，在軟體實施方式中，該搜尋裝置應用於伺服器中，可以包括請求發起單元、響應接收單元和口令展示單元。其中：　　擷取單元，用於擷取目標影像的影像特徵向量，其中，所述影像特徵向量用於表徵所述目標影像的影像內容；　　確定單元，用於在同一向量空間中，根據所述影像特徵向量與標簽的文字特徵向量之間的相關度，確定所述目標影像對應的標簽，其中，所述文字特徵向量用於表徵標簽的語義。　　在一個實施方式中，所述確定單元還可以用於在根據所述影像特徵向量與標簽的文字特徵向量之間的相關度，確定所述目標影像對應的標簽之前，根據所述影像特徵向量與所述文字特徵向量之間的歐式距離，確定所述目標影像與標簽之間的相關度。　　在一個實施方式中，確定單元具體可以用於將文字特徵向量與所述目標影像的影像特徵向量之間的相關度大於預設臨限值的一個或多個標簽作為所述目標影像對應的標簽；或者，將文字特徵向量與所述目標影像的影像特徵向量之間的相關度位於前預設數量的標簽作為所述目標影像的標簽。　　在一個實施方式中，確定單元具體可以用於逐個確定所述影像特徵向量與多個標簽中各個標簽的文字特徵向量之間的相關度；在確定出所述影像特徵向量與多個標簽中各個標簽的文字特徵向量之間的相似度後，基於確定出的所述影像特徵向量與多個標簽中各個標簽的文字特徵向量之間的相似度，確定所述目標影像對應的標簽。　　在一個實施方式中，擷取單元還可以用於在擷取目標影像的影像特徵向量之前，獲取搜尋點擊行為資料，其中，所述搜尋點擊行為資料包括：搜尋文字和基於搜尋文字點擊的影像資料；　　將所述搜尋點擊行為資料轉換為多個影像標簽對；根據所述多個影像標簽對，訓練得到用於擷取影像特徵向量和標簽特徵的資料模型。　　在一個實施方式中，將所述搜尋點擊行為資料轉換為多個影像標簽對可以包括：對所述搜尋文字進行分詞處理和詞性分析；從分詞處理和詞性分析所得到的資料中確定出標簽；對所述基於搜尋文字點擊的影像資料進行去重處理；根據確定出的標簽和去重處理後得到的影像資料，建立影像標簽對。　　本發明提供的確定影像標簽的方法和處理設備，考慮到可以採用以圖搜文的方式，基於輸入的目標影像直接搜尋確定出推薦的標簽，而不需要在匹配的過程中增加影像匹配的操作，可以直接通過確定影像特徵向量與文字特徵向量之間的相關度來匹配得到對應的標簽文字。通過上述方式解決了現有的推薦標簽方式所存在的效率較低、對系統處理能力要求較高的問題，達到了可以簡單準確的實現影像標記的技術效果。　　雖然本發明提供了如實施例或流程圖所述的方法操作步驟，但基於常規或者無創造性的勞動可以包括更多或者更少的操作步驟。實施例中列舉的步驟順序僅僅為眾多步驟執行順序中的一種方式，不代表唯一的執行順序。在實際中的裝置或客戶端產品執行時，可以按照實施例或者附圖所示的方法順序執行或者並行執行（例如並行處理器或者多線程處理的環境）。　　上述實施例闡明的裝置或模組，具體可以由電腦晶片或實體實現，或者由具有某種功能的產品來實現。為了描述的方便，描述以上裝置時以功能分為各種模組分別描述。在實施本發明時可以把各模組的功能在同一個或多個軟體和/或硬體中實現。當然，也可以將實現某功能的模組由多個子模組或子單元組合實現。　　本發明中所述的方法、裝置或模組可以以電腦可讀程式代碼方式實現控制器按任何適當的方式實現，例如，控制器可以採取例如微處理器或處理器以及儲存可由該（微）處理器執行的電腦可讀程式代碼（例如軟體或韌體）的電腦可讀媒體、邏輯閘、開關、特殊應用積體電路（Application Specific Integrated Circuit，ASIC）、可編程邏輯控制器和嵌入微控制器的形式，控制器的例子包括但不限於以下微控制器：ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320，記憶體控制器還可以被實現為記憶體的控制邏輯的一部分。本領域技術人員也知道，除了以純電腦可讀程式代碼方式實現控制器以外，完全可以通過將方法步驟進行邏輯編程來使得控制器以邏輯閘、開關、特殊應用積體電路、可編程邏輯控制器和嵌入微控制器等的形式來實現相同功能。因此這種控制器可以被認為是一種硬體部件，而對其內部包括的用於實現各種功能的裝置也可以視為硬體部件內的結構。或者甚至，可以將用於實現各種功能的裝置視為既可以是實現方法的軟體模組又可以是硬體部件內的結構。　　本發明所述裝置中的部分模組可以在由電腦執行的電腦可執行指令的一般上下文中描述，例如程式模組。一般地，程式模組包括執行特定任務或實現特定抽象資料類型的例程、程式、對象、組件、資料結構、類等等。也可以在分布式計算環境中實踐本發明，在這些分布式計算環境中，由通過通信網路而被連接的遠程處理設備來執行任務。在分布式計算環境中，程式模組可以位於包括儲存設備在內的本地和遠程電腦儲存媒體中。　　通過以上的實施方式的描述可知，本領域的技術人員可以清楚地瞭解到本發明可借助軟體加必需的硬體的方式來實現。基於這樣的理解，本發明的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，也可以通過資料遷移的實施過程中體現出來。該電腦軟體產品可以儲存在儲存媒體中，如ROM/RAM、磁碟、光碟等，包括若干指令用以使得一台電腦設備（可以是個人電腦，移動終端，伺服器，或者網路設備等）執行本發明各個實施例或者實施例的某些部分所述的方法。　　本說明書中的各個實施例採用遞進的方式描述，各個實施例之間相同或相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。本發明的全部或者部分可用於眾多通用或專用的電腦系統環境或配置中。例如：個人電腦、伺服器電腦、手持設備或可攜式設備、平板型設備、移動通信終端、多處理器系統、基於微處理器的系統、可編程的電子設備、網路PC、小型電腦、大型電腦、包括以上任何系統或設備的分布式計算環境等等。　　雖然通過實施例描繪了本發明，本領域普通技術人員知道，本發明有許多變形和變化而不脫離本發明的精神，希望所附的申請專利範圍包括這些變形和變化而不脫離本發明的精神。In order to enable those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts should fall within the protection scope of the present invention. At present, there are some methods for recommending text for images. For example: training a model that searches images to generate an image feature vector for each image. For any two images, the greater the similarity between the image feature vectors, That means the two images are more similar. Based on this principle, existing search methods generally collect an image set and control the images in the image set to cover the entire application scene as much as possible. Then, one or more images similar to the image input by the user can be determined from the image set through a search and matching method based on the image feature vector, and then the text of the one or more images is used as a text set, and the text is extracted from the text. One or more confidences are determined collectively as the recommended text for the image. This search method needs to maintain an image set covering the entire application scene. The accuracy of text recommendation depends on the size of the image set and the accuracy of the text in the image set. The text often needs to be manually labeled, which is more complicated to implement. In view of the problems in the text recommendation method of searching by images, considering that the method of searching by images can be used to directly search for the recommended text based on the input target image, without the need to add image matching in the matching process. , You can directly get the corresponding text by matching the target image, that is, you can use text search to recommend text for the target image. The above text can be short tags, long tags, specific text content, etc., which specific text content is not limited in the present invention, and can be selected according to actual needs. For example, when uploading pictures in the e-commerce scene, the text can be short tags. If in a matching system of poetry and pictures, the text can be verses, that is, different text content can be selected according to the actual application scenario. Types of. Consider that you can extract features from the image and feature from the text. Then, calculate the correlation between the image and each text in the label set based on the captured features, and determine the text of the target image according to the correlation. Based on this, a search method is provided in this example. As shown in FIG. 1, by extracting the image feature vector in the target image that is used to represent the image content of the target image, and the text feature vector in the text that is used to represent the semantics of the text To calculate the correlation between the image feature vector and the text feature vector, so as to determine the text corresponding to the target image. That is, the data of the two modalities of text and image can be converted into the feature vector of the feature in the same space through their respective encodings, and then the distance between the features can be used to measure the correlation between the text and the image. Text as the text of the target image. In one embodiment, an image may be uploaded through a client, wherein the client may be a terminal device or software used by the client for operation. Specifically, the client may be a terminal device such as a smart phone, tablet computer, notebook computer, desktop computer, smart watch, or other wearable device. Of course, the client can also be software that can run on the terminal device. For example: mobile Taobao, Alipay or browser and other applications. In one embodiment, considering the processing speed in practical applications, the text feature vector of each text can be retrieved in advance, so that after acquiring the target image, only the image feature vector of the target image needs to be retrieved, and no longer Extract the text feature vector of the text, which can avoid repeated calculations and improve processing speed and efficiency. As shown in FIG. 2, the text determined for the target image can be circled in the following ways, but is not limited to: 1) the correlation between the text feature vector and the image feature vector of the target image is greater than a preset threshold Or more text as the text corresponding to the target image; For example, the preset threshold is 0.7, that is, if the correlation between the text feature vector of one or several characters and the image feature vector of the target image is greater than 0.7, you can use these characters as the characters determined for the target image. (2) The text having the correlation between the text feature vector and the image feature vector of the target image at a previously preset number is used as the text of the target image. For example, if the preset number is 4, you can sort according to the degree of correlation between the text feature vector and the image feature vector of the target image, and use the 4 characters with the correlation degree in the top 4 as the text determined for the target image. However, it is worth noting that the above-mentioned texts that are defined for the target image are only a schematic description. In actual implementation, other determination strategies can also be adopted. For example, the correlation can be set to a preset number. And the text whose correlation exceeds the preset threshold is used as the determined text. Which method is adopted may be selected according to actual needs, which is not specifically limited in the present invention. In order to easily and efficiently obtain the image feature vector of the target image and the text feature vector of the text, the image feature vector and text feature vector can be extracted by training to obtain a coding model. As shown in FIG. 2, a label is used as an example for description. An image coding model and a label coding model can be established. The image feature vector and the text feature vector can be extracted through the established image coding model and the label coding model. In one embodiment, a coding model can be established in the following ways: S1: Obtain user search of target scenes (for example: search engine, e-commerce) and image data based on search text clicks. Based on these behavior data, a large number of images can be obtained -Multi-tag information. Among them, the user search text and the image data based on the search text points can be historical search and access logs derived from the target scene. S2: Perform word segmentation and part-of-speech analysis on the obtained search text; S2: Remove characters such as numbers, punctuation marks, garbled characters, and retain visually separable words (such as nouns, verbs, adjectives, etc.), and use these words as labels; S3: Deduplicate the image data based on search text clicks; S4: Combine tags with similar meanings in the tag set, remove tags that have no practical meaning, and tags that cannot be identified visually (such as development, problems, etc.); S5: Considering that the <Image Single Label> data set is more conducive to network convergence than the <Image Single Label> data set, it is possible to convert the <Image Single Label> into a <Image Single Label> pair. For example, if the multi-tag pair is <image, tag1: tag2: tag3>, then it can be converted into three single-tag pairs: <image tag1>, <image tag2>, and <image tag3>. During the training of each triplet pair, one image corresponds to only one positive sample label. S6: Train the obtained multiple single label pairs to obtain an image coding model for extracting image feature vectors from the image and a label encoding model for extracting text feature vectors from the labels, and try to make the same picture label as possible The image feature vector and text feature vector in the alignment are relatively related. For example, the image coding model can be a neural network model that uses ResNet-152 as the image feature vector extraction, uniformly normalizes the original image to a preset pixel value (for example: 224x224 pixels) as input, and then uses pool5 layer features as Network output. The length of the output feature vector is 2048. Based on this neural network model, transfer learning is performed using non-linear transformation to obtain the final feature vector that can reflect the content of the image. As shown in FIG. 2, the image in FIG. 2 can be converted into a feature vector that can reflect the content of the image. The label encoding model can be one-hot encoding to convert each label to a vector. Considering that one-hot encoding vectors are generally sparse long vectors, for convenience of processing, one-hot encoding can be converted to lower dimensions through the Embedding Layer. For dense vectors, the formed vector sequence is used as the text feature vector corresponding to the label. For text networks, a two-layer fully connected structure can be used, and other non-linear calculation layers can be added to enhance the expression ability of text feature vectors. Get the text feature vector of N labels corresponding to a certain image. That is, the label is finally converted into a fixed-length real number vector. For example, the “dress” in FIG. 2 is converted into a text feature vector through a label coding model, and the text feature vector can reflect the original semantics, thereby facilitating comparison with the image feature vector. In one embodiment, considering that multiple tags are compared at the same time, the processing speed of the computer is required to be relatively fast, and the processing capability of the processor is high. Therefore, as shown in FIG. 3, the images can be determined one by one. The correlation between the feature vector and the text feature vector of each label in multiple labels; after each correlation is determined, the correlation calculation results are stored on the hard disk without having to store them all in memory In the body, after the correlation between the labels in the label set and the image feature vector is calculated, the similarity ranking or similarity judgment may be performed to determine one or more label texts that can be used as target image labels. In order to determine the correlation between the text feature vector and the image feature vector, it can be characterized by Euclidean distance. Specifically, both the text feature vector and the image feature vector can be characterized by a vector, that is, in the same vector space, the correlation between the two feature vectors can be determined by comparing the Euclidean distance between the two feature vectors. Specifically, the image and the text can be mapped into the same feature space, so that the feature vectors of the image and the text are in the same vector space. In this way, the text feature vector with high correlation and the image feature vector can be controlled to be close in the space, and related Keep it low. Therefore, the correlation between the image and the text can be determined by calculating the text feature vector and the image feature vector. Specifically, the matching degree between the text feature vector and the image feature vector can be the Euclidean distance between the two vectors. When the value of the Euclidean distance calculated based on the two vectors is smaller, it can represent the relationship between the two vectors. The better the matching degree is, on the contrary, when the value of the Euclidean distance calculated based on the two vectors is larger, it can indicate that the matching degree between the two vectors is worse. In one embodiment, in the same vector space, the Euclidean distance between the text feature vector and the image feature vector can be calculated. The smaller the Euclidean distance, the higher the correlation between the two, and the greater the Euclidean distance, indicating the The lower the correlation. Therefore, when performing model training, a small Euclidean distance can be used as a training target to obtain the final coding model. Correspondingly, when determining the relevance, the relevance between the image and the text can be determined based on the Euclidean distance, so that the text more relevant to the image is selected. The above only measures the correlation between the image feature vector and the text feature vector based on the Euclidean distance. In actual implementation, the correlation between the image feature vector and the text feature vector can also be determined in other ways. For example, it can also include cosine distance, Manhattan distance, etc. In addition, in some cases, the correlation may be numeric or non-numeric. For example, it may only be a characterization of degree or trend. In this case, you can use The preset rule enables the content of the characterization representation to be quantified to a specific value. Furthermore, the quantized value can be used later to determine the correlation between the two vectors. For example, if the value of a certain dimension is “medium”, the character can be quantified as a binary value or a hexadecimal value of the ASCII code. The matching degree between the two vectors described in the embodiment of the present invention is not the same as the above. Limited. After counting the correlation between the image feature vector and the text feature vector, so as to determine the text corresponding to the target image, taking into account that there may be overlap between the obtained texts or determining completely unrelated texts, in order to improve the text determination Accuracy, can further remove the wrong text or deduplicate the text, so that the final determined text is more accurate. In one embodiment, in the process of determining the labels, sorting according to the similarity, and selecting the first N as the determined labels, it is inevitable that the labels of the same attribute are marked several times, for example: one The pictures of "bowl" may have "bowl" and "pot" in the tags with higher relevance, but the tags about color or style are not ranked high, so there is not one. In this case, in this way, you can directly push the first few tags of the correlation as the determined tags, or you can set rules to determine several tag categories, and select the most relevant tag in each category as the determined tag, for example : Choose one product type, one color, one style, etc. Which strategy is specifically adopted may be selected according to actual needs, which is not limited in the present invention. For example, if it is determined that the first and second correlations are red correlation 0.8 and purple correlation 0.7, then in the setting strategy, the first few tags are recommended as tags, then the red and Purple is recommended as a label. In the setting strategy, only one is selected for each category. For example, when only one color label is selected, the red correlation is greater than the purple correlation. Therefore, red is selected as the recommended label. In the above example, text and image modal data are converted into feature vectors of the same vector space through their respective coding models, and then the distance between the feature vectors is used to measure the correlation between the label and the image. , Use highly relevant tags as the text determined for the image. However, it is worth noting that the method introduced in the above example is to unify the image and the text into the same vector space, so that the correlation between the image and the text can be directly matched. The above example is explained by applying this method to the image search method, that is, given an image, tagging the image or generating descriptive information, or generating related text information, and so on. In actual implementation, it can also be used to search for pictures by text, that is, given text, search for matches to obtain corresponding pictures. The processing method and ideas are similar to the above search of pictures, and will not be repeated here. . The above search method is described below with reference to several specific scenarios. However, it is worth noting that this specific scenario is only for better explanation of the present invention and does not constitute an improper limitation on the present invention. 1) The e-commerce website releases the product as shown in Figure 4. User A intends to sell a second-hand dress of his own. After taking a picture and transmitting the picture to the e-commerce website platform, he usually needs to set a label for the picture himself, for example , Enter: long, red, dress as the label for this image. This will inevitably increase user operations. Through the above-mentioned method for determining an image label of the present invention, automatic labeling can be achieved. After user A uploads the captured picture, the system background can automatically identify and tag the picture. Through the above method, the image feature vector of the uploaded image can be extracted, and then the correlation between the image feature vector and the text feature vector of multiple labels that have been captured in advance is calculated to obtain the image feature vector and each label. Relevance of the text. Then, according to the level of relevance, the tags determined by the uploaded photos are determined and automatically labeled, which reduces user operations and improves the user experience. 2) Photo album The photos taken or downloaded from the Internet are stored in the cloud photo album or mobile photo album. Through the above method, the image feature vector of the uploaded image can be extracted, and then the correlation between the image feature vector and the text feature vector of multiple labels that have been captured in advance is calculated to obtain the image feature vector and each label. Relevance of the text. Then, according to the level of relevance, the tags determined by the uploaded photos are determined and automatically labeled.标记 After marking, it is more convenient to implement photo classification, and you can also quickly locate the target picture when searching for pictures in the album later. 3) Search for products with pictures For example, in search modes such as Polaroid, users need to upload an image, and then search for related or similar products based on this image. In this case, after the user uploads the picture, the above method can be used to extract the image feature vector of the uploaded picture, and then the captured image feature vector and the text feature vector of multiple tags that have been previously extracted are performed. The correlation is calculated to obtain the correlation between the image feature vector and each label text. Then, according to the level of relevance, the tags determined by the uploaded photos are determined. After marking the pictures, the tags can be searched by the tags, which can effectively improve the accuracy of the search and improve the recall rate. 4) Searching for poems with pictures For example: as shown in Figure 5, in some applications or scenes, you need to match the poems with pictures. After the user uploads an image, you can search and match the corresponding poems based on the picture. In this case, after the user uploads the image, the above method can be used to extract the image feature vector of the uploaded image, and then the captured image feature vector and the text feature vector of multiple poems that have been previously extracted The correlation is calculated to obtain the correlation between the image feature vector and the text feature vector of each poem. Then, according to the level of relevance, the content of the poem corresponding to the uploaded photo can be determined, and the content of the poem, or the title and author of the poem can be presented. The above four scenes are used as examples for illustration. In actual implementation, there are other scenes that can use this method. As long as the image tag pairs of the scene are captured based on different scenes, and then trained to obtain an image coding model and a text coding model that fit the scene.实施 The method embodiments provided in the embodiments of the present invention may be executed in a mobile terminal, a computer terminal, a server, or a similar computing device. Taking a server running as an example, FIG. 6 is a block diagram of a hardware structure of a server of a searching method according to an embodiment of the present invention. As shown in FIG. 6, the server 10 may include one or more (only one shown in the figure) a processor 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) , A memory 104 for storing data, and a transmission module 106 for communication functions. Persons of ordinary skill in the art can understand that the structure shown in FIG. 6 is only schematic, and it does not limit the structure of the electronic device. For example, the server 10 may further include more or fewer components than those shown in FIG. 6, or have a different configuration from that shown in FIG. 5. The memory 104 may be used to store software programs and modules of application software, such as program instructions / modules corresponding to the search method in the embodiment of the present invention. The processor 102 runs the software programs and modules stored in the memory 104. Thus, various functional applications and data processing are performed, that is, the above search method is implemented. The memory 104 may include high-speed random memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory remotely disposed with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof. The transmission module 106 is used to receive or send data via a network. Specific examples of the above-mentioned network may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a network interface controller (NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission module 106 may be a radio frequency (RF) module, which is used to communicate with the Internet wirelessly.参考 Please refer to FIG. 7. In a software implementation, the search device is applied to a server and may include a request initiation unit, a response receiving unit, and a password display unit. Wherein: an acquisition unit for acquiring an image feature vector of a target image, wherein the image feature vector is used for characterizing the image content of the target image; a determination unit is used in the same vector space according to the image The correlation between the feature vector and the text feature vector of the label determines the label corresponding to the target image, wherein the text feature vector is used to represent the semantics of the label. In an implementation manner, the determining unit may be further configured to determine a label corresponding to the target image according to a correlation between the image feature vector and a text feature vector of the label, and determine the label corresponding to the target image. The Euclidean distance between the text feature vectors determines the correlation between the target image and the label. In one embodiment, the determining unit may be specifically configured to use one or more tags whose correlation between the text feature vector and the image feature vector of the target image is greater than a preset threshold as the corresponding tags of the target image Or, using a tag having a correlation between the text feature vector and the image feature vector of the target image at a preset number as the label of the target image. In one embodiment, the determining unit may be specifically configured to determine the correlation between the image feature vector and the text feature vector of each of the multiple tags one by one; after determining the image feature vector and each of the multiple tags, After the similarity between the text feature vectors of the tags, based on the determined similarity between the image feature vector and the text feature vector of each tag in the plurality of tags, a tag corresponding to the target image is determined. In one embodiment, the capturing unit may be further configured to obtain search and click behavior data before capturing the image feature vector of the target image. The search and click behavior data includes search text and image data based on the search text click. Converting the search and click behavior data into a plurality of image tag pairs; and training and obtaining a data model for capturing image feature vectors and tag features according to the plurality of image tag pairs. In one embodiment, converting the search click behavior data into a plurality of image tag pairs may include: performing word segmentation processing and part-of-speech analysis on the search text; determining tags from data obtained from the word segmentation processing and part-of-speech analysis; Performing deduplication processing on the image data clicked based on the search text; and establishing an image tag pair according to the determined label and the image data obtained after the deduplication processing. The method and processing device for determining image tags provided by the present invention, considering that a text search method can be adopted to directly search and determine a recommended tag based on the input target image, without the need to add an image matching operation in the matching process. , You can directly obtain the corresponding label text by determining the correlation between the image feature vector and the text feature vector. Through the above methods, the problems of low efficiency and high system processing capability requirements of the existing recommended labeling methods are solved, and the technical effect of simple and accurate image tagging is achieved. While the present invention provides method operation steps as described in the embodiments or flowcharts, more or less operation steps may be included based on conventional or non-creative labor. The sequence of steps listed in the embodiments is only one way of executing the steps, and does not represent the only sequence of execution. When the actual device or client product is executed, it may be executed sequentially or in parallel according to the method shown in the embodiment or the accompanying drawings (for example, a parallel processor or a multi-threaded environment).装置 The devices or modules described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. When implementing the present invention, the functions of each module can be implemented in the same software or multiple software and / or hardware. Of course, a module that implements a certain function may also be implemented by combining multiple submodules or subunits. The method, device or module described in the present invention may be implemented by a computer-readable program code in a suitable manner by the controller. For example, the controller may adopt, for example, a microprocessor or a processor and the storage may be implemented by the (micro) Computer-readable media, logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers, computer-readable program code (such as software or firmware) executed by a processor Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320. The memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to implementing the controller in pure computer-readable program code, it is entirely possible to program the method logic to make the controller use logic gates, switches, special application integrated circuits, and programmable logic controllers. And embedded microcontroller to achieve the same function. Therefore, such a controller can be considered as a hardware component, and the device included in the controller for implementing various functions can also be considered as a structure within the hardware component. Or even, a device for implementing various functions can be regarded as a structure that can be both a software module implementing the method and a hardware component.部分 Some modules in the device of the present invention can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types. The invention can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media, including storage devices. According to the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary hardware. Based on such an understanding, the technical solution of the present invention in essence or a part that contributes to the existing technology can be embodied in the form of a software product, or it can be embodied in the implementation process of data migration. The computer software product can be stored in storage media, such as ROM / RAM, magnetic disks, optical disks, etc., including several instructions to make a computer device (can be a personal computer, mobile terminal, server, or network device, etc.) The method described in each embodiment or part of the embodiment of the present invention is performed.各个 Each embodiment in this specification is described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. All or part of the present invention can be used in many general-purpose or special-purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, mobile communication terminals, multi-processor systems, microprocessor-based systems, programmable electronic devices, network PCs, small computers, Mainframe computers, distributed computing environments including any of the above systems or equipment, and more. Although the present invention is described through the examples, those skilled in the art know that the present invention has many variations and changes without departing from the spirit of the present invention, and it is hoped that the scope of the attached patent application includes these variations and changes without departing from the spirit of the present invention. .

102‧‧‧處理器102‧‧‧ processor

106‧‧‧傳輸模組106‧‧‧Transmission Module

10‧‧‧伺服器10‧‧‧Server

104‧‧‧非揮發性記憶體104‧‧‧Non-volatile memory

為了更清楚地說明本發明實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本發明中記載的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。　　圖1是本發明提供的搜尋方法一種實施例的方法流程圖；　　圖2是本發明提供的影像編碼模型和標簽編碼模型的建立示意圖；　　圖3是本發明提供的搜尋方法另一實施例的方法流程圖；　　圖4是本發明提供的影像自動標記示意圖；　　圖5是本發明提供的以圖搜詩文的示意圖；　　圖6是本發明提供的伺服器的架構示意圖；　　圖7是本發明提供的搜尋裝置的結構方塊圖。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only the present invention. For some ordinary people skilled in the art, some embodiments described in the invention can also obtain other drawings according to these drawings without paying creative labor. 1 is a method flowchart of an embodiment of a search method provided by the present invention; FIG. 2 is a schematic diagram of establishment of an image coding model and a tag coding model provided by the present invention; FIG. 3 is a method of another embodiment of the search method provided by the present invention Flow chart; FIG. 4 is a schematic diagram of automatic image tagging provided by the present invention; FIG. 5 is a schematic diagram of searching for poems by map provided by the present invention; FIG. 6 is a schematic diagram of a server provided by the present invention; Block diagram of the structure of the device.

Claims

A search method, characterized in that the method includes: capturing an image feature vector of a target image, wherein the image feature vector is used to characterize the image content of the target image; in the same vector space, according to the image The correlation between the feature vector and the text feature vector of the text determines the text corresponding to the target image, wherein the text feature vector is used to represent the semantics of the text.

The method according to claim 1, wherein before determining the text corresponding to the target image based on the correlation between the image feature vector and the text feature vector of the text, the method further comprises: according to the image feature vector and The Euclidean distance between the character feature vectors determines the correlation between the target image and the text.

The method according to claim 1, wherein determining the text corresponding to the target image according to the correlation between the image feature vector and the text feature vector of the text includes: combining the text feature vector with the target image One or more characters whose correlation between image feature vectors is greater than a preset threshold value as the text corresponding to the target image; or, the correlation between the text feature vector and the image feature vector of the target image is located at The previously preset amount of text is used as the text of the target image.

The method according to claim 1, wherein determining the text corresponding to the target image according to the correlation between the image feature vector and the text feature vector of the text includes: 确定 determining the image feature vector and a plurality of The correlation between the text feature vectors of each text in the text; 确定 after determining the similarity between the image feature vector and the text feature vectors of each text in multiple texts, based on the determined image feature vector and The similarity between the character feature vectors of each character in the plurality of characters determines the character corresponding to the target image.

The method according to claim 1, before the capturing of the image feature vector of the target image, further comprising: obtaining search and click behavior data, wherein the search and click behavior data includes: search text and an image clicked based on the search text Data; 转换 converting the search and click behavior data into multiple image text pairs; training to obtain a data model for capturing image feature vectors and text feature vectors based on the multiple image text pairs.

The method according to claim 5, wherein converting the search click behavior data into a plurality of image text pairs includes: 进行 performing word segmentation processing and part-of-speech analysis on the search text; 中 from data obtained from word segmentation processing and part-of-speech analysis Determine the text; 进行 Deduplicate the image data based on the search text click; 建立 Create an image text pair based on the determined text and image data obtained after deduplication processing.

The method according to claim 6, wherein the image text pair includes a single tag pair, and the single tag pair carries: one image and one text.

A processing device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the method realizes: a method for determining image text, wherein the method includes: capturing a target image Image feature vector, wherein the image feature vector is used to characterize the image content of the target image; in the same vector space, determining the image feature vector according to the correlation between the image feature vector and the text feature vector of the text The text corresponding to the target image, wherein the text feature vector is used to represent the semantics of the text.

The processing device according to claim 8, wherein, before determining the text corresponding to the target image according to the correlation between the image feature vector and the text feature vector of the text, the processor is further configured to: The Euclidean distance between the image feature vector and the text feature vector determines the correlation between the target image and the text.

The processing device according to claim 8, wherein the processor determines the text corresponding to the target image according to the correlation between the image feature vector and the text feature vector of the text, including: combining the text feature vector with One or more characters whose correlation between image feature vectors of the target image is greater than a preset threshold value as the corresponding text of the target image; or alternatively, one of the text feature vector and the image feature vector of the target image The inter-relative degree is located in a predetermined number of characters as the characters of the target image.

The processing device according to claim 8, wherein the processor determines the text corresponding to the target image according to the correlation between the image feature vector and the text feature vector of the text, including: 确定 determining the images one by one The correlation between the feature vector and the character feature vector of each character in the multiple characters; after determining the similarity between the image feature vector and the character feature vector of each character in the multiple characters, based on the determined all The similarity between the image feature vector and the character feature vector of each character in the plurality of characters is used to determine the character corresponding to the target image.

The processing device according to claim 8, wherein, before the image feature vector of the target image is captured, the processor is further configured to: obtain search and click behavior data, wherein the search and click behavior data includes: search text and Image data based on search text clicks; 转换 converting the search and click behavior data into multiple image text pairs; training to obtain a data model for capturing image feature vectors and text feature vectors based on the multiple image text pairs.

The processing device according to claim 12, wherein the processor converts the search click behavior data into a plurality of image text pairs includes: performing word segmentation processing and part-of-speech analysis on the search text; segmenting processing and part-of-speech analysis Text is determined from the obtained data; deduplication processing is performed on the image data clicked based on the search text click; an image text pair is established according to the determined text and image data obtained after deduplication processing.

A search method, characterized in that the method includes: capturing image features of a target image, wherein the image features are used to characterize the image content of the target image; in the same vector space, according to the image features and The correlation between the text features of the text determines the text corresponding to the target image, where the text features are used to characterize the semantics of the text.

A computer-readable storage medium has computer instructions stored thereon that, when executed, implement the steps of the method of any one of claims 1 to 7.