CN115858839B

CN115858839B - Cross-modal LOGO retrieval method, system, terminal and storage medium

Info

Publication number: CN115858839B
Application number: CN202310121845.5A
Authority: CN
Inventors: 孔欧
Original assignee: Shanghai Mdata Information Technology Co ltd
Current assignee: Shanghai Mido Technology Co ltd
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2023-05-30
Anticipated expiration: 2043-02-16
Also published as: CN115858839A

Abstract

The application provides a cross-modal LOGO retrieval method, a system, a terminal and a storage medium, which comprise the following steps: inputting cross-modal data information; filtering the cross-modal data information to obtain cross-modal data information containing LOGO; extracting characteristic values of the cross-modal data information containing LOGO, and determining LOGO categories in the cross-modal data information containing LOGO; and realizing a cross-modal LOGO retrieval task based on the characteristic value and the LOGO category. The cross-modal LOGO retrieval method, the system, the terminal and the storage medium can comprehensively and accurately detect and identify the specific LOGO, and cross-modal LOGO retrieval between voice modal data and image modal data is realized; by filtering the cross-modal data without LOGO, invalid operation processes are removed, and LOGO retrieval efficiency is improved.

Description

Cross-modal LOGO retrieval method, system, terminal and storage medium

Technical Field

The application belongs to the technical field of LOGO identification, and particularly relates to a cross-mode LOGO retrieval method, a system, a terminal and a storage medium.

Background

Trademark (LOGO) is a LOGO symbology of an enterprise or organization that represents the strength and uniqueness of a brand. LOGO search shows great commercial value and application prospect in the fields of commodity image search, sports event sponsorship, marketing activity evaluation, intelligent traffic management, intellectual property protection and the like.

The existing LOGO retrieval method basically adopts single-mode data to carry out LOGO retrieval, for example, the similarity matching retrieval of specific LOGO in the respective mode data is realized by a text retrieval method or an image retrieval method. The information carrier on the internet now has a trend of multiple and multiple modes, such as a commercial advertisement, which contains text expressions and commodity distribution pictures and audio information, so that the collected information is not comprehensive only by a single mode, and the recall ratio is low. In addition, since the LOGO size is generally smaller, and the problems of deformation, illumination and the like are often caused by different photographing angles, the LOGO search result accuracy is also lower due to the fact that only a single mode is adopted.

Disclosure of Invention

The invention aims to provide a cross-modal LOGO retrieval method, a system, a terminal and a storage medium, which are used for solving the technical problem that cross-modal retrieval cannot be carried out on LOGO information in the prior art.

In a first aspect, the present application provides a cross-modal LOGO retrieval method, comprising the steps of: inputting cross-modal data information; filtering the cross-modal data information to obtain cross-modal data information containing LOGO; extracting characteristic values of the cross-modal data information containing LOGO, and determining LOGO categories in the cross-modal data information containing LOGO; and realizing a cross-modal LOGO retrieval task based on the characteristic value and the LOGO category.

In one implementation of the first aspect, the cross-modal data information includes voice modal data and image modal data.

In the method, the specific LOGO can be comprehensively and accurately detected and identified, and cross-mode LOGO retrieval between voice mode data and image mode data is realized.

In one implementation manner of the first aspect, filtering the cross-modal data information to obtain cross-modal data information including the LOGO includes the following steps:

when voice modal data is input, the voice modal data is recognized by using a pre-training voice recognition model so as to obtain a character recognition result;

judging whether the character recognition result contains any one of a preset LOGO set or not;

if yes, processing the text recognition result by using a pre-training voice synthesis model to obtain filtered voice mode data containing LOGO; otherwise, the subsequent steps are not continued.

In the implementation mode, invalid operation processes are removed by filtering voice modal data without LOGO, so that LOGO retrieval efficiency is improved.

when image mode data is input, the image mode data is identified by using a pre-training image identification model so as to obtain an image identification result;

judging whether the image recognition result contains any one of a preset LOGO set or not;

if yes, processing the image recognition result by using a pre-training LOGO recognition model to acquire filtered image mode data containing LOGO; otherwise, the subsequent steps are not continued.

In the implementation mode, invalid operation processes are removed by filtering the image mode data without LOGO, so that LOGO retrieval efficiency is improved.

In one implementation of the first aspect, processing the image recognition result using a pre-trained LOGO recognition model to obtain filtered image modality data including LOGO includes the steps of:

obtaining positioning information of LOGO images in the image recognition result by using a pre-training LOGO recognition model;

intercepting the image recognition result based on the positioning information of the LOGO image so as to acquire the LOGO image in the image recognition result;

and taking the LOGO image as the filtered image mode data containing LOGO.

In one implementation manner of the first aspect, extracting a feature value of the cross-modal data information including the LOGO, and determining a LOGO category in the cross-modal data information including the LOGO includes the following steps:

extracting characteristic values of the filtered voice modal data, and determining LOGO categories in the filtered voice modal data;

and extracting the characteristic value of the filtered image modal data, and determining the LOGO category in the filtered image modal data.

In one implementation form of the first aspect, implementing the cross-modal LOGO retrieval task based on the feature value and the LOGO class comprises the steps of:

calculating a first similarity based on the characteristic value of the filtered voice modal data and the characteristic value of the filtered image modal data;

calculating a second similarity based on the LOGO class in the filtered voice modality data and the LOGO class in the filtered image modality data;

and fusing the first similarity and the second similarity to generate LOGO similarity, and realizing a cross-mode information retrieval task based on the LOGO similarity.

In a second aspect, the present application provides a cross-modal LOGO retrieval system comprising:

the data input module is used for inputting cross-modal data information;

the data screening module is used for filtering the cross-modal data information to obtain cross-modal data information containing LOGO;

the feature extraction module is used for extracting the feature value of the cross-modal data information containing LOGO and determining the LOGO category in the cross-modal data information containing LOGO;

and the LOGO retrieval module is used for realizing a cross-mode LOGO retrieval task based on the characteristic value and the LOGO category.

In the present application, the beneficial effects brought by the second independent claim are described in combination with technical features.

In a third aspect, the present application provides a cross-modal LOGO retrieval terminal, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, so that the cross-modal LOGO search terminal executes any one of the cross-modal LOGO search methods.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the above-described cross-modal LOGO retrieval method.

As described above, the cross-modal LOGO retrieval method, system, terminal and storage medium have the following beneficial effects.

(1) Specific LOGO can be comprehensively and accurately detected and identified, and cross-mode LOGO retrieval between voice mode data and image mode data is realized.

(2) By filtering the cross-modal data without LOGO, invalid operation processes are removed, and LOGO retrieval efficiency is improved.

Drawings

FIG. 1 is a flow chart illustrating a cross-modal LOGO retrieval method in accordance with an embodiment of the present application.

FIG. 2 is a schematic diagram of a cross-modal LOGO retrieval system in accordance with an embodiment of the present application.

FIG. 3 is a schematic diagram of a cross-modal LOGO retrieval system in accordance with an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a cross-modal LOGO search terminal according to an embodiment of the present application.

Detailed Description

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

In addition, descriptions such as those related to "first," "second," and the like, are provided for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated in this application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be regarded as not exist and not within the protection scope of the present application.

The following embodiments of the present application provide a cross-modal LOGO retrieval method, which can be applied to a terminal device. The terminal in the present application may include a mobile phone, a tablet computer, a notebook computer, a wearable device, a vehicle-mounted device, an augmented reality (Augmented Reality, AR)/virtual reality (VirtualReality, VR) device, an Ultra-Mobile Personal Computer, UMPC, a netbook, a personal digital assistant (Personal Digital Assistant, PDA), etc. with a wireless charging function, and the specific type of the terminal is not limited in this embodiment of the present application.

For example, the terminal may be a Station (ST) in a wireless charging enabled WLAN, a wireless charging enabled cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (WirelessLocal Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA) device, a wireless charging enabled handheld device, a computing device or other processing device, a computer, a laptop computer, a handheld communication device, a handheld computing device, and/or other devices for communicating over a wireless system, as well as next generation communication systems, such as a mobile terminal in a 5G network, a mobile terminal in a future evolved public land mobile network (PublicLand Mobile Network, PLMN), or a mobile terminal in a future evolved Non-terrestrial network (Non-terrestrial Network, NTN), etc.

For example, the terminal may communicate with the network and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (GlobalSystem of Mobile communication, GSM), general Packet radio service (General Packet RadioService, GPRS), code division multiple access (Code DivisionMultiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE)), email, short message service (Short Messaging Service, SMS), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (Global Positioning System, GPS), a global navigation satellite system (GlobalNavigation Satellite System, GLONASS), a beidou satellite navigation system (BeiDou navigation Satellite System, BDS), a Quasi zenith satellite system (Quasi-Zenith Satellite System, QZSS) and/or a satellite based augmentation system (Satellite BasedAugmentation Systems, SBAS). The following describes the technical solutions in the embodiments of the present application in detail with reference to the drawings in the embodiments of the present application.

As shown in fig. 1, the present embodiment provides a cross-modal LOGO searching method, which includes the following steps.

S1, cross-mode data information is input.

In an embodiment, the cross-modal data information includes voice modal data and image modal data.

The speech mode data includes speech features, such as speaking frequency, intensity, speech speed, and clarity, which reflect the emotion of the speaker, and semantic features and vocabulary features, which reflect the content the speaker wants to express. According to the embodiment of the application, semantic features and vocabulary features are extracted from voice modal data of a universal voice library, the universal voice library contains sentences or speech segments embedded with LOGO information, and the semantic features and the vocabulary features are associated with the LOGO information.

The image modality data includes a plurality of graphical image features such as color features, texture features, shape features, spatial relationship features, and the like. Similar to extracting semantic features and lexical features from voice modality data, embodiments of the present application extract visual feature vectors from image modality data of a generic image library, the generic image library containing graphics or images embedded with LOGO information, the visual feature vectors being associated with the LOGO information.

S2, filtering the cross-modal data information to obtain cross-modal data information containing LOGO.

In one embodiment, filtering the cross-modal data information to obtain cross-modal data information including LOGO includes the following steps.

S211, when voice mode data are input, the voice mode data are recognized by using a pre-training voice recognition model so as to obtain a character recognition result.

Specifically, a Conformer model is used as the pre-trained speech recognition model in the embodiments of the present application. The Conformer model employs automatic speech recognition (Automatic Speech Recognition, ASR) techniques to convert speech modality data to text modality data. In this embodiment, the input of the Conformer model is the voice mode data, and the output is the text recognition result A. The processing process of the Conformer model on the voice mode data comprises judging whether the character recognition result A is a null value or not; if the word is a non-null value, the word segmentation is carried out on the word recognition result A by adopting the Jieba word segmentation to obtain the word recognition result A after word segmentation; otherwise, the subsequent steps are not performed. The word recognition result A after word segmentation comprises B Chinese words (namely B word feature vectors).

S212, judging whether the character recognition result comprises any one of a preset LOGO set.

Specifically, the preset LOGO set covers a representative brand among sports brands, electronic brands, automobiles, foods, handbags, cosmetics and other commodities, and can be obtained through autonomous shooting in a real scene or crawling on the internet through a web crawler. The mode of the LOGO in the preset LOGO set is not limited, so that the word recognition result A after word segmentation can be compared with all LOGOs in the preset LOGO set to determine whether the word recognition result A after word segmentation is consistent with any one LOGO in the preset LOGO set. In addition, the types and the number of LOGOs in the LOGO set need to be reasonably set so as to ensure the accuracy and the comprehensiveness of the retrieval result.

S213, if yes, processing the text recognition result by using a pre-training voice synthesis model to obtain filtered voice mode data containing LOGO; otherwise, the subsequent steps are not continued.

Specifically, a VITS model is employed as the pre-trained speech synthesis model in embodiments of the present application. The VITS model is an end-to-end speech synthesis model which is directly mapped into waveforms by characters or phonemes, and the speech synthesis model adopts a countermeasure training mode similar to GAN, so that the synthesis tone quality is effectively improved. In this embodiment, if the word recognition result a after word segmentation or the B word feature vectors belong to any one of the preset LOGO sets, the word recognition result a or the B word feature vectors are used as input of the VITS model, and N audio signals with high tone quality are output through speech synthesis of the VITS model, where N is a natural number and N is less than or equal to B, and the audio signals are filtered speech mode data including LOGO. If the word recognition result A or B after word segmentation does not belong to any one of the preset LOGO sets, the subsequent steps are not continued. In the implementation mode, invalid operation processes are removed by filtering voice modal data without LOGO, so that LOGO retrieval efficiency is improved.

S221, when the image mode data is input, the image mode data is identified by using a pre-training image identification model so as to obtain an image identification result.

Specifically, a YOLO v5 model was used as the pre-training image recognition model described in the embodiments of the present application. The YOLO v5 model divides an image into grids by applying a single neural network to the entire image, and predicts class probabilities and bounding boxes for each grid to achieve automated extraction and classification recognition of image features. YOLO v5 has smaller mean weight, shorter training time and faster reasoning speed than YOLO v4 without degrading detection accuracy. In this embodiment, the input of the YOLOv5 model is the image mode data, and the output is the image recognition result C.

S222, judging whether the image recognition result contains any one of a preset LOGO set.

Specifically, the preset LOGO set may be the same as the preset LOGO set described in S212 above. The mode of the LOGO in the preset LOGO set is not limited, so that the image recognition result C can be compared with all LOGO in the preset LOGO set to determine whether the image recognition result C is consistent with any one LOGO in the preset LOGO set. Likewise, the types and the number of the LOGOs in the LOGO set need to be reasonably set so as to ensure the accuracy and the comprehensiveness of the search result. It should be further noted that, in order to better solve the problem of LOGO search in a complex natural environment, when making a data set, LOGO information in natural environments under different indoor and outdoor illumination, angles and shielding conditions can be collected intentionally.

S223, if yes, processing the image recognition result by using a pre-training LOGO recognition model to acquire filtered image mode data containing LOGO; otherwise, the subsequent steps are not continued.

In one embodiment, processing the image recognition results using a pre-trained LOGO recognition model to obtain filtered image modality data including LOGO comprises the steps of:

obtaining positioning information of LOGO images in the image recognition result by using a pre-training LOGO recognition model; intercepting the image recognition result based on the positioning information of the LOGO image so as to acquire the LOGO image in the image recognition result; and taking the LOGO image as the filtered image mode data containing LOGO.

Specifically, the pre-training LOGO recognition model acquires the coordinate information Mx (xmin, ymin, xmax, ymax) of LOGO from the image recognition result C; and intercepting M LOGO images from the image mode data based on the coordinate information of the LOGO so as to acquire the filtered image mode data containing the LOGO. In the implementation mode, invalid operation processes are removed by filtering the image mode data without LOGO, so that LOGO retrieval efficiency is improved.

S3, extracting characteristic values of the cross-modal data information containing LOGO, and determining LOGO categories in the cross-modal data information containing LOGO.

In one embodiment, extracting the feature value of the cross-modal data information including the LOGO, and determining the LOGO category in the cross-modal data information including the LOGO includes the following steps.

S31, extracting characteristic values of the filtered voice modal data, and determining LOGO categories in the filtered voice modal data.

Specifically, the Wav2Vec model is used to obtain the eigenvalues of each audio signal. The Wav2Vec model is an unsupervised pre-training voice model, which uses a large amount of unlabeled voice data to perform unsupervised training, so that the original voice sample can be mapped to a feature space which can represent data features more. In the embodiment, the Wav2Vec model is input as a single audio signal, and the output is a voice feature matrix of [ Q,768], wherein Q is a natural number; after Global Average Pooling operation, 768 speech feature values of [1,768] are obtained. For N audio signals, N [1,768] speech feature values can be obtained.

A 2-layer full connection layer (Fully Connected layers) is built to determine the LOGO category in the filtered voice modality data. The input of the 2-layer full-connection layer is 768 voice characteristic values of [1,768], and the input is output as a LOGO class set corresponding to the 768 voice characteristic values. For N audio signals, then N sets of LOGO categories may be output.

S32, extracting characteristic values of the filtered image modal data, and determining LOGO categories in the filtered image modal data.

Specifically, a VIT model is adopted to obtain the characteristic value of the filtered image modal data and the category containing the LOGO. The VIT model cuts the image into blocks with fixed sizes, the cut images are regarded as words one by one, and the words are input into the transducer model after linear conversion and position coding, so that image feature extraction and classification are realized. In this embodiment, the VIT model is input as a single LOGO image, outputting 768 image feature values of [1,768] and a LOGO class set corresponding to the 768 image feature values. For M LOGO images, then M [1,768] image feature values, and M LOGO class sets can be output.

S4, realizing a cross-mode LOGO retrieval task based on the characteristic value and the LOGO category.

In one embodiment, implementing a cross-modal LOGO retrieval task based on the feature values and the LOGO category includes the following steps.

S41, calculating first similarity based on the characteristic value of the filtered voice modal data and the characteristic value of the filtered image modal data.

Specifically, the cosine similarity is used for measuring the distance between the characteristic value of the filtered voice modal data and the characteristic value of the filtered image modal data. The characteristic value of the filtered voice mode data is N [1,768] voice characteristic values, the characteristic value of the filtered image mode data is M [1,768] image characteristic values, if similarity between every two images is calculated by using a mapping principle, cosine similarity is calculated for N times and M times, and one of the biggest similarity is selected as a first similarity Branch_2_simi.

S42, calculating a second similarity based on the LOGO category in the filtered voice modal data and the LOGO category in the filtered image modal data.

Specifically, a cosine similarity is used to measure the distance between the LOGO category in the filtered voice modality data and the LOGO category in the filtered image modality data. N LOGO class sets are in the filtered voice mode data, M LOGO class sets are in the filtered image mode data, if similarity between every two is calculated by using a mapping principle, cosine similarity is calculated for N times, and one of the biggest similarity is selected as a second similarity Branch_3_simi.

S43, fusing the first similarity and the second similarity to generate LOGO similarity, and realizing a cross-mode information retrieval task based on the LOGO similarity.

Specifically, the first similarity and the second similarity are fused by using a formula (branch_2_simi 0.7+branch_3_simi 0.3), a calculation result is used as a LOGO similarity of the final audio and the image, and mutual retrieval between the audio and the image is realized based on the LOGO similarity.

The protection scope of the cross-modal LOGO searching method described in the embodiments of the present application is not limited to the order of execution of the steps listed in the embodiments, and all the schemes implemented by adding or removing steps and replacing steps according to the principles of the present application in the prior art are included in the protection scope of the present application.

As shown in fig. 2 and 3, the present embodiment provides a cross-modal LOGO retrieval system, which includes the following modules.

The data input module 1 is used for inputting cross-modal data information.

And the data screening module 2 is used for filtering the cross-modal data information to obtain the cross-modal data information containing LOGO.

The feature extraction module 3 is configured to extract a feature value of the cross-modal data information including the LOGO, and determine a LOGO category in the cross-modal data information including the LOGO.

And the LOGO retrieval module 4 is used for realizing a cross-mode LOGO retrieval task based on the characteristic value and the LOGO category.

It should be noted that, the structures and principles of the data input module 1, the data filtering module 2, the feature extraction module 3 and the LOGO retrieving module 4 are in one-to-one correspondence with the steps in the above cross-mode LOGO retrieving method, so that the description thereof is omitted here.

The cross-modal LOGO retrieval system provided in the embodiment of the present application may implement the cross-modal LOGO retrieval method described in the present application, but the implementation device of the cross-modal LOGO retrieval method described in the present application includes, but is not limited to, the structure of the cross-modal LOGO retrieval system listed in the present embodiment, and all structural modifications and substitutions made according to the principles of the present application in the prior art are included in the protection scope of the present application.

As shown in fig. 4, this embodiment provides a cross-modal LOGO search terminal, including: a processor 51 and a memory 52.

The memory 52 is used for storing a computer program.

The processor 51 is configured to execute the computer program stored in the memory, so that the cross-modal LOGO search terminal executes the cross-modal LOGO search method.

Preferably, the processor 51 may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (ApplicationSpecific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which is characterized in that the program is executed by a processor to realize the above cross-mode LOGO retrieval method.

Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, or methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.

The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in various embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

In summary, the cross-modal LOGO retrieval method, system, terminal and storage medium provided by the application solve the technical problem that the cross-modal retrieval can not be performed on LOGO information in the prior art, can comprehensively and accurately detect and identify specific LOGO, and realize the cross-modal LOGO retrieval between voice modal data and image modal data; by filtering the cross-modal data without LOGO, invalid operation processes are removed, and LOGO retrieval efficiency is improved.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. The cross-modal LOGO retrieval method is characterized by comprising the following steps:

inputting cross-modal data information;

filtering the cross-modal data information to obtain cross-modal data information containing LOGO;

extracting characteristic values of the cross-modal data information containing LOGO, and determining LOGO categories in the cross-modal data information containing LOGO;

realizing a cross-modal LOGO retrieval task based on the characteristic value and the LOGO category;

implementing a cross-modal LOGO retrieval task based on the feature value and the LOGO category includes the steps of:

2. The cross-modal LOGO retrieval method as claimed in claim 1, wherein the cross-modal data information comprises voice modal data and image modal data.

3. The cross-modal LOGO retrieval method as claimed in claim 1, wherein filtering the cross-modal data information to obtain cross-modal data information containing LOGO comprises the steps of:

4. The cross-modal LOGO retrieval method as claimed in claim 1, wherein filtering the cross-modal data information to obtain cross-modal data information containing LOGO comprises the steps of:

5. The cross-modality LOGO retrieval method of claim 4, wherein processing the image recognition results using a pre-trained LOGO recognition model to obtain filtered image modality data containing LOGO comprises the steps of:

and taking the LOGO image as the filtered image mode data containing LOGO.

6. The method of claim 1, wherein extracting the feature value of the cross-modal data information including the LOGO and determining the LOGO category in the cross-modal data information including the LOGO comprises the steps of:

7. A cross-modal LOGO retrieval system comprising:

the data input module is used for inputting cross-modal data information;

the LOGO retrieval module is used for realizing a cross-mode LOGO retrieval task based on the characteristic value and the LOGO category;

8. A cross-modal LOGO retrieval terminal comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, so that the cross-modal LOGO search terminal performs the cross-modal LOGO search method as claimed in any one of claims 1 to 6.

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the cross-modal LOGO retrieval method of any one of claims 1 to 6.