WO2023020005A1 - 神经网络模型的训练方法、图像检索方法、设备和介质 - Google Patents

神经网络模型的训练方法、图像检索方法、设备和介质 Download PDF

Info

Publication number
WO2023020005A1
WO2023020005A1 PCT/CN2022/089626 CN2022089626W WO2023020005A1 WO 2023020005 A1 WO2023020005 A1 WO 2023020005A1 CN 2022089626 W CN2022089626 W CN 2022089626W WO 2023020005 A1 WO2023020005 A1 WO 2023020005A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
neural network
network model
recognition neural
text
Prior art date
Application number
PCT/CN2022/089626
Other languages
English (en)
French (fr)
Inventor
陈玥蓉
姚锟
孙逸鹏
韩钧宇
刘经拓
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Priority to JP2022573483A priority Critical patent/JP2023541752A/ja
Publication of WO2023020005A1 publication Critical patent/WO2023020005A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, in particular to computer vision and deep learning technology, which can be applied to scenes such as image processing and image recognition, and specifically relates to a neural network model training method, image retrieval method, device, electronic equipment, computer Readable storage medium and computer program product.
  • Artificial intelligence is a discipline that studies the use of computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level.
  • Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge map technology and other major directions.
  • the present disclosure provides a training method of a neural network model, an image retrieval method, a device, an electronic device, a computer-readable storage medium and a computer program product.
  • a training method of a neural network model includes an image recognition neural network model and a text recognition neural network model, and the method includes: obtaining a sample image and semantics corresponding to the sample image Text information; input the sample image into the image recognition neural network model, and obtain the first feature vector corresponding to the sample image output by the image recognition neural network model; input the semantic text information into the text recognition neural network model, and obtain the output of the text recognition neural network model A second eigenvector corresponding to the semantic text information; calculating a first loss value based on the first eigenvector and the second eigenvector; and adjusting parameters of the image recognition neural network model and the text recognition neural network model at least based on the first loss value .
  • an image retrieval method based on a neural network model is provided, the neural network model is obtained by training through the above-mentioned training method, and the neural network model includes an image recognition neural network model and a text recognition A neural network model, the method comprising: inputting the image to be detected into the image recognition neural network model, obtaining an image feature vector output by the image recognition neural network model; and determining the first image feature vector of the image to be detected from a database A set of matching images.
  • a neural network model training device configured to train the parameters of the image recognition neural network model and a text recognition neural network model
  • the device includes: a first acquisition unit configured to acquire samples The semantic text information corresponding to the image and the sample image; the second acquisition unit is configured to input the sample image into the image recognition neural network model, and obtain the first feature vector corresponding to the sample image output by the image recognition neural network model; the third The acquisition unit is configured to input the semantic text information into the text recognition neural network model, and obtain the second feature vector corresponding to the semantic text information output by the text recognition neural network model; the calculation unit is configured to be based on the first feature vector and the second feature vector The second feature vector is used to calculate the first loss value; and the parameter adjustment unit is configured to adjust the parameters of the image recognition neural network model at least based on the first loss value.
  • an image retrieval device based on a neural network model is provided, the neural network model is obtained by training through the above-mentioned training method, and the neural network model includes an image recognition neural network model and a text recognition A neural network model, the device includes: a first acquisition unit configured to input the image to be detected into the image recognition neural network model, and obtain an image feature vector output by the image recognition neural network model, wherein the image feature extraction network model is obtained through the above-mentioned
  • the training method is obtained by training; and the first determination unit is configured to determine the first matching image set of the image to be detected from the database based on the image feature vector of the image to be detected.
  • an electronic device including: at least one processor; and a memory connected to the at least one processor in communication; wherein the memory stores instructions executable by the at least one processor, and these instructions are executed by At least one processor executes, so that at least one processor can execute the above image retrieval method or neural network model training method.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make a computer execute the above image retrieval method or neural network model training method.
  • a computer program product including a computer program, wherein the computer program implements the above image retrieval method or neural network model training method when executed by a processor.
  • the image recognition neural network and the text recognition neural network are trained by using the sample image and the semantic text information corresponding to the sample image, so that the image recognition neural network can learn the semantic information of the image.
  • the trained image recognition neural network is used to obtain image features, which can improve the accuracy of image retrieval results because the image features contain the semantic information of the image.
  • Fig. 1 shows a flow chart of a training method of a neural network model according to an embodiment of the present disclosure
  • Fig. 2 shows a flow chart of another neural network model training method according to an embodiment of the present disclosure
  • FIG. 3 shows a flow chart of another neural network model training method according to an embodiment of the present disclosure
  • FIG. 4 shows a flowchart of an image retrieval method according to an embodiment of the present disclosure
  • FIG. 5 shows a flow chart of another image retrieval method according to an embodiment of the present disclosure
  • FIG. 6 shows a flow chart of another image retrieval method according to an embodiment of the present disclosure
  • FIG. 7 shows a flow chart of another image retrieval method according to an embodiment of the present disclosure.
  • FIG. 8 shows a flow chart of another image retrieval method according to an embodiment of the present disclosure.
  • FIG. 9 shows a structural block diagram of a training device for a neural network model according to an embodiment of the present disclosure.
  • FIG. 10 shows a structural block diagram of an image retrieval device according to an exemplary embodiment of the present disclosure.
  • FIG. 11 shows a structural block diagram of an exemplary electronic device that can be used to implement the embodiments of the present disclosure.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, temporal relationship or importance relationship of these elements, and such terms are only used for Distinguishes one element from another.
  • first element and the second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on contextual description.
  • the retrieval technology that only uses image features cannot make full use of the text information that can be attached to the image itself, so it cannot better learn the correlation of each module in the image.
  • Text information itself is a very comprehensive and three-dimensional supervision information. Without the participation of text information, the neural network model is relatively weak in representing objects with complex angles and changing shapes (such as clothing).
  • the image recognition neural network and the text recognition neural network are combined, and the sample image and the semantic text information corresponding to the sample image are input respectively.
  • the purpose is to make the image recognition neural network better. to learn semantic features of images.
  • only the trained image recognition neural network is used, and then the similarity calculation is performed with the feature vector stored in the database. This can better learn the semantic features of the image and output more accurate results.
  • the training method 100 may include: step 101, acquiring sample images and semantic text information corresponding to the sample images; step 102, inputting the sample images into the image recognition neural network model, and obtaining samples output by the image recognition neural network model The first feature vector corresponding to the image; step 103, inputting the semantic text information into the text recognition neural network model, and obtaining the second feature vector corresponding to the semantic text information output by the text recognition neural network model; step 104, based on the first feature vector and the second feature vector, and calculate the first loss value; step 105, at least based on the first loss value, adjust the parameters of the image recognition neural network model and the text recognition neural network model.
  • the image recognition neural network can learn the semantic information of the image.
  • the sample image may be an image of an item, and the semantic text information is richer information that can reflect the content of the image.
  • the sample image includes a mouse, a keyboard, and a monitor, and the corresponding semantic text information may include a wireless mouse and keyboard, a liquid crystal display, and brand names and colors.
  • the sample image may be a jacket image, and the semantic text information corresponding to the sample image may include brand names, jackets, short sleeves, sportswear, and the like.
  • the sample image before the sample image is input into the image neural network model, the sample image may be preprocessed, and then the preprocessed sample image is input into the image recognition neural network model to obtain the first feature vector corresponding to the sample image .
  • Preprocessing may include resizing the sample image, correcting angles, etc.
  • At least one keyword in the semantic text information can be obtained, and at least one keyword corresponding to the semantic text information is input into the text recognition neural network model to obtain The first feature vector corresponding to the semantic text information.
  • the Euclidean distance between the first eigenvector and the second eigenvector it is possible to calculate the Euclidean distance between the first eigenvector and the second eigenvector, and determine the Euclidean distance as the first loss value, so as to adjust the model parameters based on the first loss value, so that the first The loss value should be as small as possible.
  • the sample image includes a sample object.
  • the training method 200 of the neural network model may include:
  • Step 201 Obtain a sample image and semantic text information corresponding to the sample image.
  • Step 202 mark the real bounding box surrounding the sample object in the sample image and the real label of the sample object.
  • At least one keyword of the semantic text information may be acquired, and one or more keywords in the at least one keyword may be used as real tags.
  • a sample image of a cat and semantic text information corresponding to the sample image are obtained, wherein the semantic text information may include cats, cat breeds, cat colors, cat features (such as short legs, short ears, etc.) ) and other keywords.
  • the real label can be the keyword "cat" included in the semantic text information
  • Step 203 Input the sample image into the image recognition neural network model, and obtain the first feature vector, predicted bounding box and predicted label of the sample object corresponding to the sample image output by the image recognition neural network model.
  • Step 204 Calculate a second loss value based on the predicted bounding box, the predicted label, the ground truth bounding box and the ground truth label.
  • the loss value can be calculated based on the intersection ratio between the real bounding box and the predicted bounding box, or the loss value can be calculated based on the center distance between the real bounding box and the predicted bounding box, and the loss value can also be calculated in other ways, here No limit.
  • Step 205 Input the semantic text information into the text recognition neural network model, and obtain a second feature vector corresponding to the semantic text information output by the text recognition neural network model.
  • Step 206 Calculate a first loss value based on the first eigenvector and the second eigenvector.
  • Step 207 based on the first loss value and the second loss value, adjust the parameters of the image recognition neural network model.
  • Step 208 Adjust the parameters of the text recognition neural network model based on the first loss value.
  • the image neural network model also has the function of target detection, so that the model can better extract the image features of the sample object.
  • Step 201 , step 205 and step 206 in FIG. 2 are similar to step 101 , step 103 and step 104 in FIG. 1 , and will not be repeated here.
  • the training method 300 of the neural network model may include:
  • Step 301 Obtain a sample image and semantic text information corresponding to the sample image.
  • Step 302 mark the real label of the sample object in the sample image.
  • Step 303 Determine the foreground area where the sample object is located in the sample image.
  • Step 304 cropping the sample image to obtain a foreground image.
  • Step 305 Input the foreground image into the image recognition neural network model, and obtain the first feature vector corresponding to the sample image output by the image recognition neural network model and the predicted label of the sample object.
  • Step 306. Calculate a third loss value based on the predicted label and the real label.
  • the predicted label and the real label are transformed in one space to obtain respective feature vectors, and thus calculate the Euclidean distance between the two feature vectors, and finally obtain the third loss value.
  • Step 307 Input the semantic text information into the text recognition neural network model, and obtain a second feature vector corresponding to the semantic text information output by the text recognition neural network model.
  • Step 308 Calculate a first loss value based on the first eigenvector and the second eigenvector.
  • Step 309 based on the first loss value and the third loss value, adjust the parameters of the image recognition neural network model.
  • the loss value can be continuously reduced by continuously changing all the parameters in the neural network, so as to train a more accurate neural network model.
  • Step 310 based on the first loss value, adjust the parameters of the text neural network recognition neural network model.
  • Step 301 , step 307 and step 308 in FIG. 3 are similar to step 101 , step 103 and step 104 in FIG. 1 , and will not be repeated here.
  • an image retrieval method based on a neural network model the neural network model is obtained by training the above-mentioned training method, and the neural network model includes an image recognition neural network model and Text recognition neural network model.
  • the image retrieval method 400 may include:
  • Step 401 Input the image to be detected into the image recognition neural network model, and obtain the image feature vector output by the image recognition neural network model.
  • the trained image recognition neural network is used to obtain the image features, since the image features contain the semantic information of the image, the accuracy of the image retrieval result can be improved.
  • the image recognition neural network model may be a hierarchical transformer model constructed by introducing a layered construction method commonly used in convolutional neural networks.
  • the transformer model combines CNN and self-attention structures.
  • the front layer of the neural network uses a convolutional neural network with a sliding window mechanism to extract low-level features
  • the deep layer uses a transformer model with a self-attention mechanism to extract high-level features. It is effective in image retrieval tasks The improvement is very obvious.
  • Step 402 based on the image feature vector of the image to be detected, determine a first matching image set of the image to be detected from the database.
  • multiple image feature vectors corresponding to multiple images are stored in the database, and the relationship between the image feature vector of the image to be detected and each of the multiple image feature vectors stored in the database is calculated respectively.
  • Euclidean distance of A plurality of images in the database that match the image to be detected can be determined based on the corresponding Euclidean distance.
  • the plurality of image feature vectors stored in the database can also be obtained through the image recognition neural network model trained by any of the methods 100, 200 and 300 shown in FIGS. 1-3 .
  • the image to be detected includes a target object, as shown in FIG. 5
  • the image retrieval method 500 may include:
  • Step 501 Input the image to be detected into the image recognition neural network model, and obtain the image feature vector output by the image recognition neural network model, the target bounding box of the target object, and the target label of the target object.
  • Step 502 based on the image feature vector of the image to be detected, determine a first matching image set of the image to be detected from the database.
  • Step 503 Input the target label into the text recognition neural network model, and obtain the text feature vector output by the text recognition neural network model.
  • Step 504 based on the text feature vector, determine at least one matching image of the image to be detected from the first set of matching images.
  • Step 502 in FIG. 5 is similar to step 402 in FIG. 4 , and details are not repeated here.
  • the Euclidean distances between the text feature vectors output by the text recognition neural network model and the text feature vectors corresponding to the images included in the first matching image set determined in step 502 are respectively calculated, from the first matching image Centrally determine at least one matching image of the input image to be detected.
  • the image to be detected determines the first matching image set through the image recognition neural network model
  • further determination is carried out through the text neural network model trained together with the image recognition neural network model by using the speech and text information related to the image to be detected.
  • the finalized image has a higher degree of matching with the input image to be detected.
  • the image retrieval method 600 includes:
  • Step 601. Input the image to be detected into the image recognition neural network model, and obtain the image feature vector output by the image recognition neural network model, the target bounding box of the target object, and the target label of the target object.
  • Step 602 based on the image feature vector of the image to be detected, determine a first matching image set of the image to be detected from the database.
  • Step 603 Input the target label into the text recognition neural network model, and obtain the text feature vector output by the text recognition neural network model.
  • Step 604 based on the text feature vector, determine a second matching image set of the image to be detected from the database.
  • the Euclidean distance between the text feature vector and the text feature vector stored in the database is calculated, and qualified images are selected, and these images together form a second matching image set with the image to be detected.
  • the plurality of text feature vectors stored in the database can also be the text recognition neural network in the neural network model trained by any of the methods 100, 200 and 300 shown in FIGS. 1-3 model to obtain.
  • Step 605 Determine at least one matching image of the image to be detected based on the first matching image set and the second matching image set.
  • the Euclidean distance between the image feature vector and the image feature vector in the database and the Euclidean distance between the text feature vector and the text feature vector in the database are calculated based on comparing the image feature vector and the text feature vector with the data in the database respectively Finally, the final matching image is determined by comparing the results twice.
  • the images included in the two comparison results are used as matching images, or the two comparison results are sorted according to the similarity, and several images with the highest scores are selected as the final matching images.
  • Steps 601 to 603 in FIG. 6 are similar to steps 501 to 503 in FIG. 5 , and will not be repeated here.
  • the image retrieval method 700 may include:
  • Step 701. Determine the foreground area where the target object in the image to be detected is located.
  • Step 702 Crop the image to be detected to obtain a foreground image.
  • Step 703 using the foreground image as an input of the image recognition neural network model, and obtaining the image feature vector output by the image recognition neural network model and the target label of the target object.
  • Step 704 based on the image feature vector of the image to be detected, determine a first matching image set of the image to be detected from the database.
  • Step 705 Input the target label into the text recognition neural network model, and obtain the text feature vector output by the text recognition neural network model.
  • Step 706 Determine at least one matching image of the image to be detected from the first set of matching images based on the text feature vector.
  • Steps 704 - 706 in FIG. 7 are similar to steps 502 - 504 in FIG. 5 , and will not be repeated here.
  • the image retrieval method 800 may include:
  • Step 801. Determine the foreground area where the target object in the image to be detected is located.
  • Step 802 cropping the detected image to obtain a foreground image.
  • Step 803 using the foreground image as the input of the image recognition neural network model, and obtaining the image feature vector output by the image recognition neural network model and the target label of the target object.
  • Step 804 based on the image feature vector of the image to be detected, determine a first matching image set of the image to be detected from the database.
  • Step 805 Input the target label into the text recognition neural network model, and obtain the text feature vector output by the text recognition neural network model.
  • Step 806 based on the text feature vector, determine a second matching image set of the image to be detected from the database.
  • Step 807 based on the first matching image and the second matching image set, determine at least one matching image of the image to be detected.
  • Using the image retrieval method in the embodiment of the present disclosure can make full use of the semantic text information of the image, improve the accuracy of image retrieval results, and improve user experience.
  • the target image is a jacket image
  • the semantic information of the jacket image may include brand names, jackets, short sleeves, sportswear, and so on.
  • the results retrieved by the existing retrieval method include downloads of the image logo of the same brand, but the image retrieval method in the embodiment of the present disclosure can make full use of the semantic text information corresponding to the image, and the retrieval results can only include the image logo of the same brand.
  • the sports short-sleeved top greatly improves the accuracy of image retrieval results and improves user experience.
  • Steps 804 to 807 in FIG. 8 are similar to steps 602 to 605 in FIG. 6 , and will not be repeated here.
  • a neural network model training device 900 is also provided. As shown in FIG. 9 , the neural network model training device 900 includes: a first acquisition unit 901 configured to acquire a sample image and semantic text information corresponding to the sample image; a second acquisition unit 902 configured to use the sample image Input the image recognition neural network model, and obtain the first feature vector corresponding to the sample image output by the image recognition neural network model; the third obtaining unit 903 is configured to input semantic text information into the text recognition neural network model, and obtain the text recognition neural network model.
  • the second eigenvector corresponding to the semantic text information output by the network model; the calculation unit 904 is configured to calculate the first loss value based on the first eigenvector and the second eigenvector; and the parameter adjustment unit 905 is configured to at least based on The first loss value adjusts the parameters of the image recognition neural network model.
  • the training device 900 further includes: a first labeling unit configured to label the real bounding box surrounding the sample object in the sample image and the real label of the sample object.
  • the calculation unit 904 is further configured to calculate the second loss value based on the predicted bounding box, the predicted label, the ground truth bounding box and the ground truth label.
  • the parameter adjustment unit 905 includes: a first parameter adjustment subunit configured to adjust parameters of the image recognition neural network model based on the first loss value and the second loss value; and a second parameter adjustment subunit, It is configured to adjust the parameters of the text recognition neural network model based on the first loss value.
  • the first labeling unit is further configured to obtain at least one keyword of the semantic text information, and use one or more keywords in the at least one keyword as real labels.
  • the training device 900 further includes: a determination unit configured to determine the foreground area where the sample object in the sample image is located before inputting the sample image into the image recognition neural network model; The image is cropped to obtain the foreground image, and the foreground image is used as the input of the image recognition neural network model.
  • the training device 900 further includes: a second marking unit configured to mark the real label of the sample object in the sample image;
  • the calculation unit 904 is further configured to calculate a third loss value based on the predicted label and the real label.
  • the parameter adjustment unit 905 includes: a third parameter adjustment subunit configured to adjust parameters of the image recognition neural network model based on the first loss value and the third loss value; and a fourth parameter adjustment subunit, It is configured to adjust the parameters of the text recognition neural network model based on the first loss value.
  • an image retrieval device 1000 based on a neural network model, the neural network model is obtained by training through the above-mentioned training method, and the neural network model includes an image recognition neural network model and Text recognition neural network model.
  • the image retrieval device 1000 includes: a first acquisition unit 1001 configured to input the image to be detected into the image recognition neural network model, and obtain an image feature vector output by the image recognition neural network model; and a first determination unit 1002, configured to determine a first matching image set of the image to be detected from a database based on the image feature vector of the image to be detected.
  • the output of the image recognition neural network model further includes a target bounding box surrounding the target object and a target label of the target object.
  • the image retrieval device 1000 further includes a second acquisition unit configured to input the target label into the text recognition neural network model, and obtain the text feature vector output by the text recognition neural network model; and the second determination unit is configured to It is configured to determine at least one matching image of the image to be detected from the first set of matching images based on the text feature vector.
  • the image retrieval device 1000 further includes a third acquisition unit configured to input the target label into the text recognition neural network model, and obtain the text feature vector output by the text recognition neural network model; the third determination unit is configured For determining a second matching image set of the image to be detected from the database based on the text feature vector; and a fourth determining unit configured to determine at least one of the images to be detected based on the first matching image set and the second matching image set A matching image.
  • the image to be detected includes a target object
  • the image retrieval device 1000 further includes: a fifth determination unit configured to determine the target object in the image to be detected before inputting the image to be detected into the image recognition neural network model The foreground area where it is located; the clipping unit is configured to clip the image to be detected to obtain a foreground image, and use the foreground image as the input of the image recognition neural network model.
  • the output of the image recognition neural network model further includes a target label of the target object.
  • the image retrieval device 1000 further includes a fourth acquisition unit configured to input the target label into the text recognition neural network model, and obtain a text feature vector output by the text recognition neural network model; and a sixth determination unit configured to It is configured to determine at least one matching image of the image to be detected from the first set of matching images based on the text feature vector.
  • the output of the image recognition neural network model further includes the target label of the target object
  • the image retrieval device 1000 further includes a fifth acquisition unit configured to input the target label into the text recognition neural network model, and acquire the text Recognize the text feature vector output by the neural network model
  • the seventh determination unit is configured to determine the second matching image set of the image to be detected from the database based on the text feature vector
  • the eighth determination unit is configured to be based on the first A matching image set and a second matching image set, at least one matching image of the image to be detected is determined.
  • unit 1001 and unit 1002 of the image retrieval device 1000 are similar to the operations of step 401 and step 405 described above, and will not be repeated here.
  • the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
  • an electronic device a readable storage medium, and a computer program product are also provided.
  • Electronic device is intended to mean various forms of digital electronic computing equipment, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1100 includes a computing unit 1101 that can be executed according to a computer program stored in a read-only memory (ROM) 1102 or loaded from a storage unit 1108 into a random-access memory (RAM) 1103. Various appropriate actions and treatments. In the RAM 1103, various programs and data necessary for the operation of the device 1100 can also be stored.
  • the computing unit 1101, ROM 1102, and RAM 1103 are connected to each other through a bus 1104.
  • An input/output (I/O) interface 1105 is also connected to the bus 1104 .
  • the input unit 1106 may be any type of device capable of inputting information to the device 1100, the input unit 1106 may receive input digital or character information, and generate key signal input related to user settings and/or function control of the electronic device, and may Including but not limited to mouse, keyboard, touch screen, trackpad, trackball, joystick, microphone and/or remote control.
  • the output unit 1107 may be any type of device capable of presenting information, and may include, but is not limited to, a display, a speaker, a video/audio output terminal, a vibrator, and/or a printer.
  • the storage unit 1108 may include, but is not limited to, a magnetic disk and an optical disk.
  • the communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver and/or a chipset , such as a BluetoothTM device, a 1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.
  • the computing unit 1101 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 1101 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 1101 executes various methods and processes described above, such as a neural network model training method and an image retrieval method.
  • the neural network model training method and the image retrieval method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 1108 .
  • part or all of the computer program may be loaded and/or installed on the device 1100 via the ROM 1102 and/or the communication unit 1109.
  • the computer program When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the methods neural network model training method and image retrieval method described above can be performed.
  • the computing unit 1101 may be configured to execute the neural network model training method and the image retrieval method in any other appropriate manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种图像检索方法、装置、设备和介质,涉及人工智能技术领域,具体为计算机视觉和深度学习技术领域,可应用于图像处理和图像识别等场景。该图像检索方法包括:将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量;基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。

Description

神经网络模型的训练方法、图像检索方法、设备和介质
相关申请的交叉引用
本申请要求于2021年8月17日提交的中国专利申请202110945344.X的优先权,其全部内容通过引用整体结合在本申请中。
技术领域
本公开涉及人工智能技术领域,尤其涉及计算机视觉和深度学习技术,可应用于图像处理和图像识别等场景,具体涉及一种神经网络模型的训练方法、图像检索的方法、装置、电子设备、计算机可读存储介质和计算机程序产品。
背景技术
人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术:人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。
随着互联网的普及,网络购物的优点更加突出,日益成为一种重要的购物形式。与此同时,在购物软件中搜索产品对于用户来说是一个突出的需求。现在的方法除了通过关键词搜索外,利用图片搜索产品也是当前的主要方法。
在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明,否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地,除非另有指明,否则此部分中提及的问题不应认为在任何现有技术中已被公认。
发明内容
本公开提供了一种神经网络模型的训练方法、图像检索方法、装置、电子设备、计算机可读存储介质和计算机程序产品。
根据本公开的一方面,提供了一种神经网络模型的训练方法,所述神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,该方法包括:获取样本图像以及样本图像相对应的语义文本信息;将样本图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的样本图像相对应的第一特征向量;将语义文本信息输入文本识别神经网络模型,获取文本识别神经网络模型输出的语义文本信息相对应的第二特征向量;基于第一特征向量和第二特征向量,计算第一损失值;以及至少基于第一损失值,调整图像识别神经网络模型和文本识别神经网络模型的参数。
根据本公开的另一方面,提供了一种基于神经网络模型的图像检索方法,该神经网络模型为通过上述的训练方法进行训练来得到的,该神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,该方法包括:将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量;以及基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
根据本公开的另一方面,提供了一种神经网络模型的训练装置,神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,该装置包括:第一获取单元,被配置用于获取样本图像以及样本图像相对应的语义文本信息;第二获取单元,被配置用于将样本图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的样本图像相对应的第一特征向量;第三获取单元,被配置用于将语义文本信息输入文本识别神经网络模型,获取文本识别神经网络模型输出的语义文本信息相对应的第二特征向量;计算单元,被配置为基于第一特征向量和第二特征向量,计算第一损失值;以及调参单元,被配置为至少基于第一损失值,调整图像识别神经网络模型的参数。
根据本公开的另一方面,提供了一种基于神经网络模型的图像检索装置,该神经网络模型为通过上述的训练方法进行训练来得到的,该神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,该装置包括:第一获取单元,被配置用于将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量,其中图像特征提取网络模型为通过上 述的训练方法进行训练来得到的;以及第一确定单元,被配置用于基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中存储器存储有可被至少一个处理器执行的指令,这些指令被至少一个处理器执行,以使至少一个处理器能够执行上述图像检索方法或神经网络模型的训练方法。
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行上述图像检索方法或神经网络模型的训练方法。
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,其中,计算机程序在被处理器执行时实现上述图像检索方法或神经网络模型的训练方法。
根据本公开的一个或多个实施例,利用样本图像以及与样本图像对应的语义文本信息对图像识别神经网络和文本识别神经网络进行训练,能够使得图像识别神经网络学习到图像的语义信息。在实际应用的时候,利用训练好的图像识别神经网络来获取图像特征,由于该图像特征包含了图像的语义信息,从而能够提高图像检索结果的准确性。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图示例性地示出了实施例并且构成说明书的一部分,与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的,并不限制权利要求的范围。在所有附图中,相同的附图标记指代类似但不一定相同的要素。
图1示出了根据本公开的实施例的神经网络模型的训练方法的流程图;
图2示出了根据本公开的实施例的另一种神经网络模型的训练方法的流程图;
图3示出了根据本公开的实施例的另一种神经网络模型的训练方法的流程图;
图4示出了根据本公开的实施例的图像检索方法的流程图;
图5示出了根据本公开的实施例的另一种图像检索方法的流程图;
图6示出了根据本公开的实施例的另一种图像检索方法的流程图;
图7示出了根据本公开的实施例的另一种图像检索方法的流程图;
图8示出了根据本公开的实施例的另一种图像检索方法的流程图;
图9示出了根据本公开的实施例的神经网络模型的训练装置的结构框图;
图10示出了根据本公开示例性实施例的图像检索装置的结构框图;以及
图11示出了能够用于实现本公开的实施例的示例性电子设备的结构框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一些示例中,第一要素和第二要素可以指向该要素的同一实例,而在某些情况下,基于上下文的描述,它们也可以指代不同实例。
在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。此外,本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。
相关技术中,单纯使用图像特征的检索技术,无法充分利用图片本身可附加的文本信息,所以无法更好的学习图像中各模块的相关性。文本信息本身是一种很全面、立体的监督信息。缺失了文本信息的参与,神经网络模型对于复杂角度、形状多变的物体(例如服饰)等目标的表征能力相对较弱。
为解决上述问题,在神经网络模型训练过程中,结合了图像识别神经网络和文本识别神经网络,分别输入样本图像以及与样本图像对应的语义文本信息,目的是为了让图像识别神经网络能够更好的学习图像的语义特征。在实际应用的时候,只使用训练好的图像识别神经网络,再与数据库存储的特征向量进行相似度计算。这样能更好地学习图像的语义特征,输出更准确地结果。
下面将结合附图详细描述本公开的实施例。
根据本公开的一方面,提供了一种神经网络模型的训练方法。如图1所示,训练方法100可以包括:步骤101、获取样本图像以及样本图像相对应的语义文本信息;步骤102、将样本图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的样本图像相对应的第一特征向量;步骤103、将语义文本信息输入文本识别神经网络模型,获取文本识别神经网络模型输出的语义文本信息相对应的第二特征向量;步骤104、基于第一特征向量和第二特征向量,计算第一损失值;步骤105、至少基于第一损失值,调整图像识别神经网络模型和文本识别神经网络模型的参数。由此,通过利用样本图像以及与样本图像对应的语义文本信息对图像识别神经网络和文本识别神经网络进行训练,能够使得图像识别神经网络学习到图像的语义信息。
在一个示例中,样本图像可以是物品的图像,语义文本信息是能够反映图片内容的并且更丰富的信息。例如,样本图像中包括鼠标、键盘和显示器,其对应的语义文本信息可以包括无线鼠标键盘、液晶显示器以及品牌名称、颜色等。再例如,样本图像可以为上衣图像,此时样本图像对应的语义文本信息可以包括品牌名称、上衣、短袖、运动装等等。
根据一些实施例,在将样本图像输入图像神经网络模型之前,可以对样本图像进行预处理,然后将预处理后的样本图像输入图像识别神经网络模型,获得与样本图像相对应的第一特征向量。预处理可以包括对样本图像进行尺寸调整、角度矫正等。
根据一些实施例,在将语义文本信息输入文本识别神经网络模型之前,可以获取语义文本信息中的至少一个关键字,并将语义文本信息相对应的至少一个关键字输入文本识别神经网络模型,获得与语义文本信息相对应的第一特征向量。
示例性的,可以但不局限于计算第一特征向量和第二特征向量之间的欧式距离,并将该欧式距离确定为第一损失值,以基于第一损失值调整模型参数,使得第一损失值尽可能得小。
根据一些实施例,样本图像包括样本对象。如图2所示,神经网络模型的训练方法200可以包括:
步骤201、获取样本图像以及样本图像对应的语义文本信息。
步骤202、标记样本图像中包围样本对象的真实边界框以及样本对象的真实标签。
根据一些实施例,可以获取语义文本信息的至少一个关键字,将至少一个关键字中的一个或多个关键字作为真实标签。
在一个示例中,获取猫的样本图像以及与该样本图像的对应的语义文本信息,其中,该语义文本信息可以包括猫、猫的品种、猫的颜色、猫的特征(例如短腿、短耳)等等关键字。在这种场景下,真实标签可以为语义文本信息所包括的关键字“猫”
步骤203、将样本图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的样本图像对应的第一特征向量、预测边界框和样本对象的预测标签。
步骤204、基于预测边界框、预测标签、真实边界框和真实标签,计算第二损失值。
根据一些实施例,可以基于真实边界框和预测边界框的交并比计算损失值,也可以基于真实边界框和预测边界框的中心距离计算损失值,还可以通过其他方式计算损失值,在此不做限定。
步骤205、将语义文本信息输入文本识别神经网络模型,获取文本识别神经网络模型输出的语义文本信息相对应的第二特征向量。
步骤206、基于第一特征向量和第二特征向量,计算第一损失值。
步骤207、基于第一损失值和第二损失值,调整图像识别神经网络模型的参数。
步骤208、基于第一损失值,调整文本识别神经网络模型的参数。
上述训练方法中,图像神经网络模型还具有目标检测的功能,使得模型能够更好得提取样本对象的图像特征。
图2中的步骤201、步骤205和步骤206分别与图1中的步骤101、步骤103和步骤104类似,在此不做赘述。
根据一些实施例,如图3所示,神经网络模型的训练方法300可以包括:
步骤301、获取样本图像以及样本图像相对应的语义文本信息。
步骤302、标记样本图像中样本对象的真实标签。
步骤303、确定样本图像中样本对象所处的前景区域。
步骤304、对样本图像进行裁剪,以得到前景图像。
步骤305、将前景图像输入图像识别神经网络模型,获得图像识别神经网络模型输出的样本图像对应的第一特征向量和样本对象的预测标签。
步骤306、基于预测标签和真实标签,计算第三损失值。
根据一些实施例,将预测标签和真实标签在一个空间进行转化,分别得到各自的特征向量,并由此计算两个特征向量的欧氏距离,最后得到第三损失值。
步骤307、将语义文本信息输入文本识别神经网络模型,获取文本识别神经网络模型输出的语义文本信息相对应的第二特征向量。
步骤308、基于第一特征向量和第二特征向量,计算第一损失值。
步骤309、基于第一损失值和第三损失值,调整图像识别神经网络模型的参数。
根据一些实施例,基于步骤308中得到的第一损失值与步骤306中表示预测标签和真实标签的损失值的第三损失值,其中损失值是用来表示预测值与真实值的差距。在训练神神经网络时,可以通过不断的改变神经网络中所有的参数,使损失值不断减小,从而训练出更准确的神经网络模型。
步骤310、基于第一损失值,调整文本神经网络识别神经网络模型的参数。
图3中步骤301、步骤307和步骤308与图1中的步骤101、步骤103和步骤104类似,在此不做赘述。
根据本公开的另一方面,还提供了一种基于神经网络模型的图像检索方法,所述神经网络模型为上述的训练方法进行训练来得到的,所述神经网络模型包括图像识别神经网络模型和文本识别神经网络模型。如图4所示,图像检索方法400可以包括:
步骤401、将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量。
上述方案中,利用训练好的图像识别神经网络来获取图像特征,由于该图像特征包含了图像的语义信息,从而能够提高图像检索结果的准确性。
在一个示例中,图像识别神经网络模型可以是通过引入卷积神经网络中常用的层次化构建方式来构建的层次化transformer模型。transformer模型结合CNN与self-attention结构,神经网络中前层使用具有滑动窗口机制卷积神经网络提取low level特征,深层采用具有自注意力机制的transformer模型提取high level特征,在图像检索任务上效果提升非常明显。
步骤402、基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
根据一些实施例,数据库中存储有多个图像各自相应的多个图像特征向量,分别计算待检测图像的图像特征向量与在数据库中存储的多个图像特征向量中的每一个图像特征向量之间的欧式距离。可以于相应的欧式距离,确定数据库中与待检测图像匹配的多个图像。
根据一些实施例,其中数据库中存储的多个图像特征向量也可以通过图1-图3展示的方法100、方法200和方法300中任一方法训练得到的图像识别神经网络模型来获得。
根据一些实施例,待检测图像包括目标对象,如图5所示,图像检索方法500可以包括:
步骤501、将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量、目标对象的目标边界框和目标对象的目标标签。
步骤502、基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
步骤503、将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量。
步骤504、基于文本特征向量,从第一匹配图像集中确定待检测图像的至少一个匹配图像。
图5中的步骤502与图4中的步骤402类似,在此不作赘述。
根据一些实施例,分别计算文本识别神经网络模型输出的文本特征向量与步骤502中确定的第一匹配图像集所包括的各图像相对应的文本特征向量之间的欧式距离,从第一匹配图像集中确定输入的待检测图像的至少一个匹配图像。
由此,待检测图像通过图像识别神经网络模型确定第一匹配图像集后,再利用与待检测图像相关的语音文本信息通过与图像识别神经网络模型一起训练的文本神经网络模型进行进一步的确定,最终确定的图像与输入的待检测图像的匹配程度更高。
根据一些实施例,如图6所示,图像检索方法600包括:
步骤601、将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量、目标对象的目标边界框和目标对象的目标标签。
步骤602、基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
步骤603、将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量。
步骤604、基于文本特征向量,从数据库中确定待检测图像的第二匹配图像集。
根据一些实施例,计算文本特征向量与数据库中存储的文本特征向量欧氏距离,筛选出符合要求的图像,这些图像共同组成与待检测图像的第二匹配图像集。
根据一些实施例,其中数据库中存储的多个文本特征向量也可以为通过图1-图3展示的方法100、方法200和方法300中任一方法训练得到的神经网络模型中的文本识别神经网络模型来获得。
步骤605、基于第一匹配图像集和第二匹配图像集,确定待检测图像的至少一个匹配图像。
根据一些实施例,基于图像特征向量和文本特征向量分别与数据库中的数据进行比对,计算图像特征向量和数据库中的图像特征向量的欧式距离以及文本特征向量和数据库中的文本特征向量的欧式距离,最终通过两次比对结果来确定最终的匹配图像。
在一个示例中,将两次比对结果都包括的图像作为匹配图像或者将两次比对结果按照相似度进行排序,取分数最高的若干图像作为最终的匹配图像。
图6中的步骤601-步骤603与图5中的步骤501-步骤503相似,在此不做赘述。
根据一些实施例,如图7所示,图像检索方法700可以包括:
步骤701、确定待检测图像中目标对象所处的前景区域。
步骤702、对待检测图像进行裁剪,以得到前景图像。
步骤703、将前景图像作为图像识别神经网络模型的输入,获取图像识别神经网络模型输出的图像特征向量和目标对象的目标标签。
步骤704、基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
步骤705、将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量。
步骤706、基于文本特征向量,从第一匹配图像集中确定待检测图像的至少一个匹配图像。
图7中的步骤704-步骤706与图5中的步骤502-步骤504相似,在此不做赘述。
根据一些实施例,如图8所示,图像检索方法800可以包括:
步骤801、确定待检测图像中目标对象所处的前景区域。
步骤802、对检测图像进行裁剪,以得到前景图像。
步骤803、将前景图像作为图像识别神经网络模型的输入,获取图像识别神经网络模型输出的图像特征向量和目标对象的目标标签。
步骤804、基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
步骤805、将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量。
步骤806、基于文本特征向量,从数据库中确定待检测图像的第二匹配图像集。
步骤807、基于第一匹配图像及和第二匹配图像集,确定待检测图像的至少一个匹配图像。
利用本公开实施例中的图像检索方法能够充分利用图像的语义文本信息,提升图像检索结果的准确性,提升用户体验。
例如,目标图像为上衣图像,该上衣图像的语义信息可以包括品牌名称、上衣、短袖、运动装等等。利用现有的检索方法检索到的结果包括相同品牌图像标志的下装,而利用本公开实施例中的图像检索方法能够充分利用图像对应的语义文本信息,检索结果可以仅包括相同品牌图像标志的运动款短袖上衣,大大提升图像检索结果的准确性,提升用户体验。
图8中的步骤804-步骤807与图6中的步骤602-步骤605相似,在此不做赘述。
根据本公开的另一方面,还提供了一种神经网络模型的训练装置900。如图9所示,神经网络模型训练装置900包括:第一获取单元901,被配置用于获取样本图像以及样本图像相对应的语义文本信息;第二获取单元902,被配置用于将样本图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的样本图像相对应的第一特征向量;第三获取单元903,被配置用于将语义文本信息输入文本识别神经网络模型,获取文本识别神经网络模型输出的语义文本信息相对应的第二特征向量;计算单元904,被配置为基于第一特征向量和第二特征向量,计算第一损失值;以及调参单元905,被配置为至少基于第一损失值,调整图像识别神经网络模型的参数。
根据一些实施例,其中训练装置900还包括:第一标记单元,被配置为标记样本图像中包围样本对象的真实边界框以及样本对象的真实标签。
根据一些实施例,其中计算单元904被进一步配置为基于预测边界框、预测标签、真实边界框和真实标签,计算第二损失值。
根据一些实施例,其中调参单元905包括:第一调参子单元,被配置为基于第一损失值和第二损失值,调整图像识别神经网络模型的参数;以及第二调参子单元,被配置为基于第一损失值,调整文本识别神经网络模型的参数。
根据一些实施例,其中第一标记单元被进一步配置为获取语义文本信息的至少一个关键字,将至少一个关键字中的一个或多个关键字作为真实标签。
根据一些实施例,其中训练装置900还包括:确定单元被配置为在将样本图像输入图像识别神经网络模型之前,确定样本图像中样本对象所处的前景区域;以及裁剪单元,被配置为对样本图像进行裁剪,以得到前景图像,并将前景图像作为图像识别神经网络模型的输入。
根据一些实施例,其中训练装置900还包括:第二标记单元,被配置为标记样本图像中样本对象的真实标签;
根据一些实施例,其中计算单元904被进一步配置为基于预测标签和真实标签,计算第三损失值。
根据一些实施例,其中调参单元905包括:第三调参子单元,被配置为基于第一损失值和第三损失值,调整图像识别神经网络模型的参数;以及第四调参子单元,被配置为基于第一损失值,调整文本识别神经网络模型的参数。
训练装置900的单元901至单元905的操作和前面描述的步骤101至步骤105的操作类似,在此不做赘述。
根据本公开的另一方面,还提供了一种基于神经网络模型的图像检索装置1000,该神经网络模型为通过上述的训练方法进行训练来得到的,该神经网络模型包括图像识别神经网络模型和文本识别神经网络模型。如图10所示,图像检索装置1000包括:第一获取单元1001,被配置用于将待检测图像输入图像识别神经网络模型,获取图像识别神经网络模型输出的图像特征向量;以及第一确定单元1002,被配置用于基于待检测图像的图像特征向量,从数据库中确定待检测图像的第一匹配图像集。
根据一些实施例,其中待检测图像包括目标对象,图像识别神经网络模型的输出还包括包围目标对象的目标边界框和目标对象的目标标签。
根据一些实施例,其中图像检索装置1000还包括第二获取单元,被配置用于将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量;以及第二确定单元,被配置用于基于文本特征向量,从第一匹配图像集中确定待检测图像的至少一个匹配图像。
根据一些实施例,其中图像检索装置1000还包括第三获取单元,被配置用于将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量;第三确定单元,被配置用于基于文本特征向量,从数据库中确定待检测图像的第二匹配图像集;以及第四确定单元,被配置用于基于第一匹配图像集和第二匹配图像集,确定待检测图像的至少一个匹配图像。
根据一些实施例,其中待检测图像包括目标对象,并且图像检索装置1000还包括:第五确定单元,被配置用于在将待检测图像输入图像识别神经网络模型之前,确定待检测图像中目标对象所处的前景区域;裁剪单元,被配置用于对待检测图像进行裁剪,以得到前景图像,并将前景图像作为图像识别神经网络模型的输入。
根据一些实施例,其中图像识别神经网络模型的输出还包括目标对象的目标标签。
根据一些实施例,其中图像检索装置1000还包括第四获取单元,被配置用于将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量;以及第六确定单元,被配置用于基于文本特征向量,从第一匹配图像集中确定待检测图像的至少一个匹配图像。
根据一些实施例,其中,图像识别神经网络模型的输出还包括目标对象的目标标签,并且图像检索装置1000还包括第五获取单元,被配置用于将目标标签输入文本识别神经网络模型,获取文本识别神经网络模型输出的文本特征向量;第七确定单元,被配置用于基于文本特征向量,从数据库中确定待检测图像的第二匹配图像集;以及第八确定单元,被配置用于基于第一匹配图像集和第二匹配图像集,确定待检测图像的至少一个匹配图像。
图像检索装置1000的单元1001和单元1002的操作和前面描述的步骤401和步骤405的操作类似,在此不做赘述。
本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。
根据本公开的实施例,还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
参考图11,现将描述可以作为本公开的服务器或客户端的电子设备1100的结构框图,其是可以应用于本公开的各方面的硬件设备的示例。电子设备旨在表示各种形式的数字电子的计算机设备,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图11所示,设备1100包括计算单元1101,其可以根据存储在只读存储器(ROM)1102中的计算机程序或者从存储单元1108加载到随机访问存储器(RAM)1103中的计算机程序,来执行各种适当的动作和处理。在RAM 1103中,还可存储设备1100操作所需的各种程序和数据。计算单元1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(I/O)接口1105也连接至总线1104。
设备1100中的多个部件连接至I/O接口1105,包括:输入单元1106、输出单元1107、存储单元1108以及通信单元1109。输入单元1106可以是能向设备1100输入信息的任何类型的设备,输入单元1106可以接收输入的数字或字符信息,以及产生与电子设备的用户设置和/或功能控制有关的键信号输入,并且可以包括但不限于鼠标、键盘、触摸屏、轨迹板、轨迹球、操作杆、麦克风和/或遥控器。输出单元1107可以是能呈现信息的任何类型的设备,并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。存储单元1108可以包括但不限于磁盘、光盘。通信单元1109允许设备1100通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据,并且可以包括但不限于调制解调器、网卡、红外通信设备、 无线通信收发机和/或芯片组,例如蓝牙TM设备、1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。
计算单元1101可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元1101的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1101执行上文所描述的各个方法和处理,例如神经网络模型训练方法和图像检索方法。例如,在一些实施例中,神经网络模型训练方法和图像检索方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1108。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1102和/或通信单元1109而被载入和/或安装到设备1100上。当计算机程序加载到RAM 1103并由计算单元1101执行时,可以执行上文描述的方法神经网络模型训练方法和图像检索方法的一个或多个步骤。备选地,在其他实施例中,计算单元1101可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行方法神经网络模型训练方法和图像检索方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在 机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客 户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式系统的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行、也可以顺序地或以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
虽然已经参照附图描述了本公开的实施例或示例,但应理解,上述的方法、系统和设备仅仅是示例性的实施例或示例,本发明的范围并不由这些实施例或示例限制,而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外,可以通过不同于本公开中描述的次序来执行各步骤。进一步地,可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进,在此描述的很多要素可以由本公开之后出现的等同要素进行替换。

Claims (25)

  1. 一种神经网络模型的训练方法,所述神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,所述方法包括:
    获取样本图像以及样本图像相对应的语义文本信息;
    将所述样本图像输入图像识别神经网络模型,获取所述图像识别神经网络模型输出的所述样本图像相对应的第一特征向量;
    将语义文本信息输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的所述语义文本信息相对应的第二特征向量;
    基于所述第一特征向量和第二特征向量,计算第一损失值;以及
    至少基于所述第一损失值,调整所述图像识别神经网络模型和文本识别神经网络模型的参数。
  2. 如权利要求1所述的方法,其中,所述样本图像包括样本对象,所述图像识别神经网络模型的输出还包括预测边界框和所述样本对象的预测标签,
    并且所述方法还包括:
    标记所述样本图像中包围所述样本对象的真实边界框以及所述样本对象的真实标签;
    基于所述预测边界框、所述预测标签、所述真实边界框和真实标签,计算第二损失值,
    其中,至少基于所述第一损失值,调整所述图像识别神经网络模型和文本识别神经网络模型的参数包括:
    基于所述第一损失值和第二损失值,调整所述图像识别神经网络模型的参数;以及
    基于所述第一损失值,调整所述文本识别神经网络模型的参数。
  3. 如权利要求2所述的方法,还包括:
    获取所述语义文本信息的至少一个关键字,将所述至少一个关键字中的一个或多个关键字作为所述真实标签。
  4. 如权利要求1所述的方法,其中,所述样本图像包括样本对象,并且所述方法还包括:
    在将所述样本图像输入所述图像识别神经网络模型之前,确定所述样本图像中所述样本对象所处的前景区域;
    对所述样本图像进行裁剪,以得到前景图像,并将所述前景图像作为图像识别神经网络模型的输入。
  5. 如权利要求4所述的方法,其中,所述图像识别神经网络模型的输出还包括所述样本对象的预测标签,
    并且所述方法还包括:
    标记所述样本图像中所述样本对象的真实标签;
    基于所述预测标签和真实标签,计算第三损失值,
    其中,至少基于所述第一损失值,调整所述图像识别神经网络模型和文本识别神经网络模型的参数包括:
    基于所述第一损失值和第三损失值,调整所述图像识别神经网络模型的参数;以及
    基于所述第一损失值,调整所述文本识别神经网络模型的参数。
  6. 一种基于神经网络模型的图像检索方法,所述神经网络模型为通过权利要求1-5中任一项所述的训练方法进行训练来得到的,所述神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,所述方法包括:
    将待检测图像输入图像识别神经网络模型,获取所述图像识别神经网络模型输出的图像特征向量;以及
    基于所述待检测图像的图像特征向量,从数据库中确定所述待检测图像的第一匹配图像集。
  7. 如权利要求6所述的方法,其中,所述待检测图像包括目标对象,所述图像识别神经网络模型的输出还包括包围所述目标对象的目标边界框和所述目标对象的目标标签,
    并且所述方法还包括:
    将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;以及
    基于所述文本特征向量,从所述第一匹配图像集中确定所述待检测图像的至少一个匹配图像。
  8. 如权利要求6所述的方法,其中,所述待检测图像包括目标对象,所述图像识别神经网络模型的输出还包括包围所述目标对象的目标边界框和所述目标对象的目标标签,
    并且所述方法还包括:
    将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;
    基于所述文本特征向量,从数据库中确定所述待检测图像的第二匹配图像集;以及
    基于所述第一匹配图像集和第二匹配图像集,确定所述待检测图像的至少一个匹配图像。
  9. 如权利要求6所述的方法,其中,所述待检测图像包括目标对象,并且所述方法还包括:
    在将所述待检测图像输入所述图像识别神经网络模型之前,确定所述待检测图像中所述目标对象所处的前景区域;
    对所述待检测图像进行裁剪,以得到前景图像,并将所述前景图像作为图像识别神经网络模型的输入。
  10. 如权利要求9所述的方法,其中,所述图像识别神经网络模型的输出还包括所述目标对象的目标标签,
    并且所述方法还包括:
    将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;以及
    基于所述文本特征向量,从所述第一匹配图像集中确定所述待检测图像的至少一个匹配图像。
  11. 如权利要求9所述的方法,其中,所述图像识别神经网络模型的输出还包括所述目标对象的目标标签,
    并且所述方法还包括:
    将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;
    基于所述文本特征向量,从数据库中确定所述待检测图像的第二匹配图像集;以及
    基于所述第一匹配图像集和第二匹配图像集,确定所述待检测图像的至少一个匹配图像。
  12. 一种神经网络模型的训练装置,所述神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,所述装置包括:
    第一获取单元,被配置用于获取样本图像以及样本图像相对应的语义文本信息;
    第二获取单元,被配置用于将所述样本图像输入图像识别神经网络模型,获取所述图像识别神经网络模型输出的所述样本图像相对应的第一特征向量;
    第三获取单元,被配置用于将语义文本信息输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的所述语义文本信息相对应的第二特征向量;
    计算单元,被配置为基于所述第一特征向量和第二特征向量,计算第一损失值;以及
    调参单元,被配置为至少基于所述第一损失值,调整图像识别神经网络模型的参数。
  13. 如权利要求12所述的装置,其中,所述样本图像包括样本对象,所述图像识别神经网络模型的输出还包括预测边界框和所述样本对象的预测标签,
    并且所述装置还包括:
    第一标记单元,被配置为标记所述样本图像中包围所述样本对象的真实边界框以及所述样本对象的真实标签,
    其中,所述计算单元被进一步配置为基于所述预测边界框、所述预测标签、所述真实边界框和真实标签,计算第二损失值,
    其中,所述调参单元包括:
    第一调参子单元,被配置为基于所述第一损失值和第二损失值,调整所述图像识别神经网络模型的参数;以及
    第二调参子单元,被配置为基于所述第一损失值,调整所述文本识别神经网络模型的参数。
  14. 如权利要求12所述的装置,其中,所述第一标记单元被进一步配置为获取所述语义文本信息的至少一个关键字,将所述至少一个关键字中的一个或多个关键字作为所述真实标签。
  15. 如权力要求12所述的装置,其中,所述样本图像包括样本对象,并且所述装置还包括:
    确定单元,被配置为在将所述样本图像输入所述图像识别神经网络模型之前,确定所述样本图像中所述样本对象所处的前景区域;以及
    裁剪单元,被配置为对所述样本图像进行裁剪,以得到前景图像,并将所述前景图像作为图像识别神经网络模型的输入。
  16. 如权利要求12所述的装置,其中,所述图像识别神经网络模型的输出还包括所述样本对象的预测标签,
    并且所述装置还包括:
    第二标记单元,被配置为标记所述样本图像中所述样本对象的真实标签;
    其中,所述计算单元被进一步配置为基于所述预测标签和所述真实标签,计算第三损失值,
    其中,所述调参单元包括:
    第三调参子单元,被配置为基于所述第一损失值和第三损失值,调整所述图像识别神经网络模型的参数;以及
    第四调参子单元,被配置为基于所述第一损失值,调整所述文本识别神经网络模型的参数。
  17. 一种基于神经网络模型的图像检索装置,所述神经网络模型为通过权利要求1-5中任一项所述的训练方法进行训练来得到的,所述神经网络模型包括图像识别神经网络模型和文本识别神经网络模型,包括:
    第一获取单元,被配置用于将待检测图像输入图像识别神经网络模型,获取所述图像识别神经网络模型输出的图像特征向量;以及
    第一确定单元,被配置用于基于所述待检测图像的图像特征向量,从数据库中确定所述待检测图像的第一匹配图像集。
  18. 根据权利要求17所述的装置,其中,所述待检测图像包括目标对象,所述图像识别神经网络模型的输出还包括包围所述目标对象的目标边界框和所述目标对象的目标标签,
    并且所述装置还包括:
    第二获取单元,被配置用于将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;以及
    第二确定单元,被配置用于基于所述文本特征向量,从所述第一匹配图像集中确定所述待检测图像的至少一个匹配图像。
  19. 根据权利要求17所述的装置,其中,所述待检测图像包括目标对象,所述图像识别神经网络模型的输出还包括包围所述目标对象的目标边界框和所述目标对象的目标标签,
    并且所述装置还包括:
    第三获取单元,被配置用于将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;
    第三确定单元,被配置用于基于所述文本特征向量,从数据库中确定所述待检测图像的第二匹配图像集;以及
    第四确定单元,被配置用于基于所述第一匹配图像集和第二匹配图像集,确定所述待检测图像的至少一个匹配图像。
  20. 根据权利要求17所述的装置,其中,所述待检测图像包括目标对象,并且所述装置还包括:
    第五确定单元,被配置用于在将所述待检测图像输入所述图像识别神经网络模型之前,确定所述待检测图像中所述目标对象所处的前景区域;
    裁剪单元,被配置用于对所述待检测图像进行裁剪,以得到前景图像,并将所述前景图像作为图像识别神经网络模型的输入。
  21. 根据权利要求20所述的装置,其中,所述图像识别神经网络模型的输出还包括所述目标对象的目标标签,
    并且所述装置还包括:
    第四获取单元,被配置用于将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;以及
    第六确定单元,被配置用于基于所述文本特征向量,从所述第一匹配图像集中确定所述待检测图像的至少一个匹配图像。
  22. 根据权利要求20所述的装置,其中,所述图像识别神经网络模型的输出还包括所述目标对象的目标标签,
    并且所述装置还包括:
    第五获取单元,被配置用于将所述目标标签输入文本识别神经网络模型,获取所述文本识别神经网络模型输出的文本特征向量;
    第七确定单元,被配置用于基于所述文本特征向量,从数据库中确定所述待检测图像的第二匹配图像集;以及
    第八确定单元,被配置用于基于所述第一匹配图像集和第二匹配图像集,确定所述待检测图像的至少一个匹配图像。
  23. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-11中任一项所述的方法。
  24. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-11中任一项所述的方法。
  25. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序在被处理器执行时实现权利要求1-11中任一项所述的方法。
PCT/CN2022/089626 2021-08-17 2022-04-27 神经网络模型的训练方法、图像检索方法、设备和介质 WO2023020005A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2022573483A JP2023541752A (ja) 2021-08-17 2022-04-27 ニューラルネットワークモデルのトレーニング方法、画像検索方法、機器及び媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110945344.X 2021-08-17
CN202110945344.XA CN113656582B (zh) 2021-08-17 2021-08-17 神经网络模型的训练方法、图像检索方法、设备和介质

Publications (1)

Publication Number Publication Date
WO2023020005A1 true WO2023020005A1 (zh) 2023-02-23

Family

ID=78492122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089626 WO2023020005A1 (zh) 2021-08-17 2022-04-27 神经网络模型的训练方法、图像检索方法、设备和介质

Country Status (3)

Country Link
JP (1) JP2023541752A (zh)
CN (1) CN113656582B (zh)
WO (1) WO2023020005A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311271A (zh) * 2023-03-22 2023-06-23 北京百度网讯科技有限公司 文本图像的处理方法及装置
CN116612204A (zh) * 2023-06-01 2023-08-18 北京百度网讯科技有限公司 图像生成方法、训练方法、装置、电子设备以及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656582B (zh) * 2021-08-17 2022-11-18 北京百度网讯科技有限公司 神经网络模型的训练方法、图像检索方法、设备和介质
CN114118379B (zh) * 2021-12-02 2023-03-24 北京百度网讯科技有限公司 神经网络的训练方法、图像处理方法、装置、设备和介质
CN114155543B (zh) * 2021-12-08 2022-11-29 北京百度网讯科技有限公司 神经网络训练方法、文档图像理解方法、装置和设备
CN114612749B (zh) * 2022-04-20 2023-04-07 北京百度网讯科技有限公司 神经网络模型训练方法及装置、电子设备和介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298749A (zh) * 2014-10-14 2015-01-21 杭州淘淘搜科技有限公司 一种图像视觉和文本语义融合商品检索方法
CN109522967A (zh) * 2018-11-28 2019-03-26 广州逗号智能零售有限公司 一种商品定位识别方法、装置、设备以及存储介质
CN110866140A (zh) * 2019-11-26 2020-03-06 腾讯科技(深圳)有限公司 图像特征提取模型训练方法、图像搜索方法及计算机设备
CN112364195A (zh) * 2020-10-22 2021-02-12 天津大学 一种基于属性引导对抗哈希网络的零样本图像检索方法
CN112612913A (zh) * 2020-12-28 2021-04-06 厦门市美亚柏科信息股份有限公司 一种用于图像的搜索方法和系统
US11017019B1 (en) * 2015-08-14 2021-05-25 Shutterstock, Inc. Style classification for authentic content search
CN113656582A (zh) * 2021-08-17 2021-11-16 北京百度网讯科技有限公司 神经网络模型的训练方法、图像检索方法、设备和介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095829B (zh) * 2016-06-01 2019-08-06 华侨大学 基于深度学习与一致性表达空间学习的跨媒体检索方法
CN107730343A (zh) * 2017-09-15 2018-02-23 广州唯品会研究院有限公司 一种基于图片属性提取的用户商品信息推送方法及设备
CN108062421A (zh) * 2018-01-09 2018-05-22 焦点科技股份有限公司 一种大规模图片多尺度语义检索方法
CN111860084B (zh) * 2019-04-30 2024-04-16 千寻位置网络有限公司 图像特征的匹配、定位方法及装置、定位系统
CN112163114B (zh) * 2020-09-10 2024-03-22 华中科技大学 一种基于特征融合的图像检索方法
CN112784912A (zh) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 图像识别方法及装置、神经网络模型的训练方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298749A (zh) * 2014-10-14 2015-01-21 杭州淘淘搜科技有限公司 一种图像视觉和文本语义融合商品检索方法
US11017019B1 (en) * 2015-08-14 2021-05-25 Shutterstock, Inc. Style classification for authentic content search
CN109522967A (zh) * 2018-11-28 2019-03-26 广州逗号智能零售有限公司 一种商品定位识别方法、装置、设备以及存储介质
CN110866140A (zh) * 2019-11-26 2020-03-06 腾讯科技(深圳)有限公司 图像特征提取模型训练方法、图像搜索方法及计算机设备
CN112364195A (zh) * 2020-10-22 2021-02-12 天津大学 一种基于属性引导对抗哈希网络的零样本图像检索方法
CN112612913A (zh) * 2020-12-28 2021-04-06 厦门市美亚柏科信息股份有限公司 一种用于图像的搜索方法和系统
CN113656582A (zh) * 2021-08-17 2021-11-16 北京百度网讯科技有限公司 神经网络模型的训练方法、图像检索方法、设备和介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311271A (zh) * 2023-03-22 2023-06-23 北京百度网讯科技有限公司 文本图像的处理方法及装置
CN116311271B (zh) * 2023-03-22 2023-12-26 北京百度网讯科技有限公司 文本图像的处理方法及装置
CN116612204A (zh) * 2023-06-01 2023-08-18 北京百度网讯科技有限公司 图像生成方法、训练方法、装置、电子设备以及存储介质
CN116612204B (zh) * 2023-06-01 2024-05-03 北京百度网讯科技有限公司 图像生成方法、训练方法、装置、电子设备以及存储介质

Also Published As

Publication number Publication date
JP2023541752A (ja) 2023-10-04
CN113656582A (zh) 2021-11-16
CN113656582B (zh) 2022-11-18

Similar Documents

Publication Publication Date Title
WO2023020005A1 (zh) 神经网络模型的训练方法、图像检索方法、设备和介质
US10043109B1 (en) Attribute similarity-based search
US20230005284A1 (en) Method for training image-text matching model, computing device, and storage medium
US20210312211A1 (en) Training method of image-text matching model, bi-directional search method, and relevant apparatus
US11069338B2 (en) Interactive method, interactive terminal, storage medium, and computer device
KR101768521B1 (ko) 이미지에 포함된 객체에 대한 정보 데이터를 제공하는 방법 및 시스템
US10642887B2 (en) Multi-modal image ranking using neural networks
US9875445B2 (en) Dynamic hybrid models for multimodal analysis
AU2015259118B2 (en) Natural language image search
US20190034814A1 (en) Deep multi-task representation learning
WO2020143314A1 (zh) 一种基于搜索引擎的问答方法、装置、存储介质及计算机设备
CN110446063B (zh) 视频封面的生成方法、装置及电子设备
KR101754473B1 (ko) 문서를 이미지 기반 컨텐츠로 요약하여 제공하는 방법 및 시스템
EP3872652B1 (en) Method and apparatus for processing video, electronic device, medium and product
WO2022161302A1 (zh) 动作识别方法、装置、设备、存储介质及计算机程序产品
CN111831826A (zh) 跨领域的文本分类模型的训练方法、分类方法以及装置
US20230290174A1 (en) Weakly supervised semantic parsing
CN112765387A (zh) 图像检索方法、图像检索装置和电子设备
KR20190118108A (ko) 전자 장치 및 그의 제어방법
CN114840734B (zh) 多模态表示模型的训练方法、跨模态检索方法及装置
WO2023142406A1 (zh) 排序方法、排序模型的训练方法、装置、电子设备及介质
CN115131604A (zh) 一种多标签图像分类方法、装置、电子设备及存储介质
Zhang et al. Application and analysis of image recognition technology based on Artificial Intelligence--machine learning algorithm as an example
JP2023531759A (ja) 車線境界線検出モデルの訓練方法、車線境界線検出モデルの訓練装置、電子機器、記憶媒体及びコンピュータプログラム
CN113642481A (zh) 识别方法、训练方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022573483

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857323

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE