WO2021180109A1 - Electronic device and search method thereof, and medium - Google Patents

Electronic device and search method thereof, and medium Download PDF

Info

Publication number
WO2021180109A1
WO2021180109A1 PCT/CN2021/079905 CN2021079905W WO2021180109A1 WO 2021180109 A1 WO2021180109 A1 WO 2021180109A1 CN 2021079905 W CN2021079905 W CN 2021079905W WO 2021180109 A1 WO2021180109 A1 WO 2021180109A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
feature vector
data
index
electronic device
Prior art date
Application number
PCT/CN2021/079905
Other languages
French (fr)
Chinese (zh)
Inventor
吴大
李艳明
唐吴全
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021180109A1 publication Critical patent/WO2021180109A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an electronic device and a search method and medium for the electronic device.
  • search functions In recent years, the rapid development of machine learning and deep learning has greatly promoted the development of search functions.
  • mobile phones have a global search function, which can search for pictures in the gallery, search on the Internet through a browser, and search for other applications.
  • Most of the existing search technologies are content-based searches. Taking image search as an example, the main process is: manually or pre-trained models automatically label images, save the labels in the database, and then the user enters The keyword matches the text of the label in the database, and the search result is returned. And most of the searches are searches for single-modal media data, such as searching for pictures in a gallery through text input, or searching for pictures in some Internet.
  • the embodiments of the application provide an electronic device and a search method and medium for the electronic device, which can map the multi-modal data feature vector to a high-dimensional unified vector, and then realize the multi-modal global search on the electronic device through a model .
  • the embodiments of the present application provide an electronic device and a search method of the electronic device, and the foregoing method includes:
  • the result data corresponding to the selected index feature vector is output as the search result, where the result data included in the search result has multiple modalities.
  • the search data input by the user for example, obtain the image data input by the user, and then extract the low-level features of the image data, such as extracting low-level features such as color, texture, and gray of the image, and then generate these low-level features
  • the feature vector corresponding to the feature vector is compared with the index feature vector stored in the index library to obtain a feature vector that is highly similar to the feature vector of the search data, and then according to the index feature vector in the index library.
  • the correlation index relationship of determines multiple feature vectors and the result data corresponding to multiple feature vectors.
  • the foregoing method further includes:
  • d represents the similarity between the search feature vector and the multiple feature vectors stored in the index library
  • x i represents the feature vector of the input data
  • y i represents the multiple feature vectors stored in the index library
  • i represents the value of the input data.
  • the similarity between the search feature vector and the index feature vector can be calculated by Euclidean distance.
  • the calculation of the similarity can also be calculated by other methods, such as Pearson coefficient.
  • the foregoing method further includes:
  • the electronic device has an index library, and multiple result data of multiple modalities that have a corresponding relationship with multiple index feature vectors in the index library are data on the electronic device.
  • the foregoing method further includes:
  • the electronic device is a mobile terminal. That is, the electronic device is not limited to mobile terminals such as mobile phones, but can also be electronic devices such as servers and PCs.
  • the foregoing method further includes:
  • the user inputs search data on the negative screen of the mobile terminal. That is, the multi-modal search on electronic devices can be applied to the negative screen of mobile terminals such as mobile phones.
  • the foregoing method further includes:
  • the user inputs search data in the memo of the mobile terminal. That is, multi-modal search on electronic devices can be applied to memos of mobile terminals such as mobile phones.
  • the foregoing method further includes:
  • modalities include image, video, audio, text, detection data of sensors of electronic devices. That is, modal refers to the source form or existence of data, so multiple modal data includes image, text, video, audio and other data.
  • an embodiment of the present application provides an electronic device, and the above-mentioned electronic device includes:
  • the obtaining module is used to obtain the search data input by the user
  • the feature extraction module is used to extract the features of the search data and generate the search feature vector of the search data based on the extracted features;
  • Similarity calculation module used to compare the search feature vector with multiple index feature vectors in the index library to select the index feature vector whose similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold,
  • the index library there is a correspondence between multiple index feature vectors and multiple result data of multiple modalities
  • Output module output the result data corresponding to the selected index feature vector as the search result, where the result data included in the search result has multiple modalities.
  • an embodiment of the present application provides a machine-readable medium on which an instruction is stored.
  • the instruction When the instruction is executed on a machine, the machine can execute any one of the possible methods of the first aspect.
  • an embodiment of the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the system, and a processor, which is one of the processors of the system, for executing Any one of the possible methods of the first aspect described above.
  • an embodiment of the present application provides an electronic device that has the function of implementing the above search method.
  • the function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more units corresponding to the above-mentioned functions.
  • Fig. 1 shows a multi-modal global search scene 10 according to some embodiments of the present application.
  • Fig. 2 shows a schematic diagram of a method for performing a global search in the mobile phone 100 according to some embodiments of the present application.
  • Fig. 3 shows a schematic diagram of a process of generating a multi-modal search model according to some embodiments of the present application.
  • Fig. 4 shows a schematic diagram of a process of establishing an index library according to some embodiments of the present application.
  • Fig. 5a shows a negative screen of a mobile phone 100 according to some embodiments of the present application.
  • Fig. 5b shows a schematic diagram of performing a global search on a negative screen of a mobile phone according to some embodiments of the present application.
  • Fig. 6 shows a schematic diagram of performing a global search in a mobile phone memo according to some embodiments of the present application.
  • Fig. 7 shows a schematic structural diagram of an electronic device according to some embodiments of the present application.
  • Fig. 8 shows a schematic structural diagram of another electronic device according to some embodiments of the present application.
  • Fig. 9 shows a software structure block diagram of an electronic device according to some embodiments of the present application.
  • the illustrative embodiments of the present application include, but are not limited to, an electronic device and a search method, medium and system of the electronic device.
  • the modal refers to the source of each type of information or the form of information.
  • Different types of information such as images, voices, texts, and videos represent different modalities, such as radar and infrared.
  • the test results of, accelerometer, etc. can also be expressed in different modes due to different sources.
  • Fig. 1 shows a multi-modal global search scene 10 according to some embodiments of the present application.
  • the scene 10 includes an electronic device 100 and an electronic device 200.
  • the electronic device 100 can use the multimodal search model trained by the electronic device 200 to realize multimodal search, that is, input a certain modal data, and the multimodal search model can generate a feature vector corresponding to the data, and then generate The feature vector of is compared with the feature vector in the multi-modal feature vector index library, and each modal data corresponding to the feature vector in the multi-modal feature vector index library that meets the predetermined conditions is output as the search result.
  • the electronic device 200 can train a multimodal search model by using multiple modal data and the characteristics of each modal data.
  • the feature vectors generated by the same or similar modal data are the same or similar.
  • the feature vector is similar means that the difference between the two feature vectors is less than the similarity threshold, and the difference between the feature vectors can be represented by the Euclidean distance between the feature vectors. The larger the Euclidean distance, the more the feature The greater the difference between the vectors, the smaller the similarity between the feature vectors.
  • the electronic device 200 can not only train a multi-modal search model, but can also use a multi-modal search model completed by its own training to implement various search functions.
  • the electronic device 100 and the electronic device 200 may include, but are not limited to, laptop computers, desktop computers, tablet computers, mobile phones, wearable devices, head-mounted displays, servers, mobile email devices, Portable game consoles, portable music players, reader devices, televisions with one or more processors embedded or coupled therein, or other electronic devices that can access the Internet.
  • the electronic device 100 is a mobile phone and the electronic device 200 is a server as an example to illustrate the technical solution of the present application.
  • a multi-modal search model capable of searching for multi-modal data can be trained on the server 200, and then the multi-modal search model can be transplanted to the mobile phone 100.
  • the global search of the modal data on the mobile phone 100 is realized.
  • Fig. 2 shows a technical solution of using the server 200 to train a multimodal search model and transplanting the trained multimodal search model to the mobile phone 100 to perform a global search according to some embodiments of the present application. Specifically, as shown in Figure 2:
  • the server 200 When training the multimodal search model, the server 200 first needs to perform feature extraction on the sample data used for training.
  • the sample data may include data of multiple modalities, for example, image, voice, text, video, sensor test data, and so on.
  • These sample data (for example, an image, a speech, or a text) are generally unstructured data with different structures, which have the characteristics of high dimensionality, different forms of expression, and a lot of redundant information. Therefore, it is necessary to extract the initial feature vector that can characterize the sample data. It is understandable that these initial feature vectors can be one-dimensional or multi-dimensional. For example, a person’s performance ranking can be represented by the person’s Chinese performance, mathematics performance, and English performance.
  • the initial feature vector of the person’s performance ranking has three dimensions, namely (Chinese performance, mathematics performance, and English performance).
  • the feature vector of a character can be one-dimensional, that is, the code value of the character. If it is a sentence, such as "Xiaobai is a dog", it can be represented by multiple one-dimensional initial feature vectors. For example, the initial feature vector of the word “Xiaobai”, the initial feature vector of the word "Yes”, and the initial feature vector of the word "dog”, these three initial feature vectors together represent the sentence "Xiaobai is a dog".
  • feature extraction algorithm 1 feature extraction algorithm 2
  • feature extraction algorithm 3 can be used respectively. Generate the initial feature vectors of these three modal data.
  • the initial feature vector T 1 of the hand-painted building A can be generated by the residual network Resnet-34 algorithm.
  • T 1 can be (h 1 , h 2 ), and h 1 is the hand-painted gray Degree value, h 2 can be the value of hand-drawing size.
  • MFCC Mel Frequency Cepstrum Coefficient
  • the initial feature vector T 2 describing the speech of the building A is generated.
  • T 2 can be (h 3 , h 4 )
  • h 3 , h 4 may be a value representing certain characteristics of the speech of the building A, such as the frequency and pitch of the speech.
  • the initial feature vector T 3 describing the text of the building A can be generated through Attention-Based Bidirectional Long Short-Term Memory Networks (BiLSTM+Attention), for example, T 3 is (h 5 ), h 5 can be It is the characteristic value of the character code describing the text of the building A.
  • the initial feature vectors T 1 , T 2 , and T 3 describing the hand-painted, voice, and text of building A in the above example can be expressed as T 1 (hand-painted gray-scale feature value, hand-painted size Characteristic value), T 2 (characteristic value of speech frequency, characteristic value of speech pitch), T 3 (characteristic value of text encoding).
  • the difference of Gaussian function can also be used to generate the initial feature vector of the image
  • Thesaurus model (Bag -of-words model) to generate the initial feature vector of the text.
  • Feature vector clustering refers to the fact that after sample data that are related to each other are input into the multimodal search model, the final feature vectors output are similar or identical to each other.
  • correlation means that the content represented by each modal data is the same or similar.
  • multiple low-level features can be extracted from each sample data, and the features extracted from two data are similar, which can be that a certain proportion of the extracted multiple features are the same or similar, or two data
  • the difference between the extracted feature values of the same feature is less than a predetermined threshold. For example, 10 and 12 features are extracted from image A and audio data B, respectively. If 9 features in image A are the same as 9 features in audio data B, it can be considered that image A and audio data B are similar.
  • the features extracted from image A are “shepherd dogs” and “adult dogs”
  • the features extracted from image B are “huskies” and “adult dogs”.
  • image A and image B represent dogs, so their features are similar.
  • dog species recognition it can be considered that the features of image A and image B are not similar.
  • the clustering process of feature vectors will be described.
  • the initial feature vectors T 1 , T 2 , and T 3 describing the hand-painted, voice, and text of the building A are obtained, because these initial feature vectors use different feature extraction algorithms
  • the generated independent initial feature vector is not in a unified feature vector representation space, and the vector values between each other will be very different, and it is impossible to judge the correlation between the feature vectors.
  • the initial feature vector of a hand-painted painting is represented by the gray value of the hand-painted (for example, 222) and the voice is represented by the frequency feature value of the voice (such as 20 (Hz)).
  • LSTM Long Short Term Memory network
  • the output result of the LSTM model at each moment depends on the output result of the previous moment. Specifically, for example, input a paragraph of text "I love the motherland” into the long-term short-term memory network model shown in Figure 3.
  • the model will output the feature vector at each moment in time sequence, for example, input long-term memory at t 1
  • the short-term memory network is "I”
  • the long-term short-term memory network will generate a feature vector H 1 (h 6 ) corresponding to "I” at t 1 , where h 6 represents the encoded feature value of "I”
  • the short-term memory network is "love”
  • the long-term short-term memory network will generate the feature vector H 2 (h 6 , h 7 ) corresponding to "love” at t 2 where h 7 represents the coded feature value of "love", t 3
  • the long-term and short-term memory network When the “motherland” is entered into the long-term and short-term
  • the feature vector H 3 since the feature vector H 3 has all the features of "I love the motherland", the feature vector H 3 will be used to represent the sentence "I love the motherland".
  • a vector value representing its feature is output at a certain moment, and the vector value at other moments is 0.
  • the feature vector H 4 (h 9 ) output at t 1 represents its gray value
  • the feature vector output at t 2 and t 3 is H 5 (h 9 , 0) and H 5 (h 9 , 0 , 0).
  • the training process of the long and short-term memory network LSTM model is as follows:
  • whether the two feature vectors are similar can be calculated by the following formula:
  • d represents the similarity between the two feature vectors
  • x i represents the feature vector of the input data
  • y i represents the multiple feature vectors stored in the index library
  • i represents the feature vector of the input data or multiple features stored in the index library.
  • the intermediate feature vectors T 1 ', T 2 ' are obtained And T 3 '.
  • T 1 ′ (a 1 ,a 2 ,a 3 )
  • T 2 ′ (b 1 ,b 2 ,b 3 )
  • T 3 ' (c 1 , c 2 , c 3 ). Then, the similarity between T 1 ′ and T 2 ′ can be expressed as:
  • T 1 ', T 2 'and T 3 ' are the same or similar, and the intermediate feature vectors T 1 ', T 2 'and T 3 'as the final feature vector.
  • the similarity threshold can be set according to actual needs, and this application is not limited here.
  • the loss function can also be used to calculate the similarity between the intermediate feature vectors obtained after the initial feature vectors of the sample data of different modalities are input into the LSTM model. Specifically, the initial feature vector of the same or similar sample data is input into the LSTM model to obtain the intermediate feature vector, and the loss function is used to calculate the error between the input initial feature vector and the output intermediate feature vector, and based on the error, Partial derivative. Then, the model parameters in the LSTM model are adjusted based on the obtained partial derivatives.
  • the initial feature vector corresponding to the album cover image describing music C is T 4
  • the text describing music C is T 6
  • the initial feature vectors T 1 , T describing the three modalities of building A (hand painting, speech, text) 2.
  • T 3 and the initial feature vectors T 4 , T 5 , and T 6 describing the three modalities of music C are input into the LSTM model as training data, and the long- and short-term memory network is adjusted by Relevant parameters make the final Euclidean distance between T 1 ', T 2 'and T 3 ' less than the predetermined similarity threshold, so that the feature vectors describing the modal data of the building A are clustered together, Then use the same method to make the final feature vectors T 4 ', T 5 'and T 6 ' which describe the initial feature vectors T 4 , T 5 , and T 6 of music B.
  • the Euclidean distance between each pair is less than the predetermined similarity. Threshold, thus clustering the feature vectors of each modal data describing music C together.
  • server 200 is used to train the multi-modal search model in the foregoing embodiment, in other embodiments, other computer equipment may also be used to train the multi-modal search model. There is no restriction here.
  • an Android project can be built, the model can be read and parsed through the model reading interface in the aforementioned project, and then compiled to generate an APK (Android application package). , Android application package) file, installed in the mobile phone 100 to complete the transplantation of the multi-modal search model. Then input various modal data (image, voice, text, video, etc.) in the mobile phone 100 into the multi-modal search model to obtain the feature vector corresponding to each data, and establish the relationship between each data and the feature vector of each data. The index relationship is obtained, and the index library is obtained.
  • APK Android application package
  • image 1-image 100, speech 1-speech 50, text 1-text 80 on mobile phone 100 can be input into the multimodal search model, and then the feature vector T 1 -T 100 corresponding to image 1-image 100 can be obtained.
  • the index relationship between each feature vector obtained above and the identification of the corresponding data can be established.
  • the data identification of the data can be an identification set for the above-mentioned image, text, and voice file, or the above-mentioned data can be stored in the mobile phone 100.
  • the index library may exist in the form of a database in the multi-modal search model, and the feature vector and the identification of the corresponding data may be stored in the database in the form of a field.
  • the specific process of establishing an index library on the mobile phone 100 is shown in FIG. 4:
  • the trained multi-modal search model is transplanted to the mobile phone 100, and the multi-modal search model is used to generate the information in the mobile phone 100.
  • Various modal data such as image, voice, text (in other embodiments, may also include video, sensor detection data, etc.) feature vectors, and these feature vectors generated by the multi-modal search model are stored in the mobile phone 100 At the same time, an index relationship between each data and feature vector on the mobile phone 100 is established to obtain an index library.
  • the respective feature vectors T 1 , T 2 , T 3 , T 150 , T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150 and T 230 and store these feature vectors in the mobile phone 100, and at the same time, store the index relationship between these data and the respective feature vectors in the index library.
  • a global search can be implemented on the mobile phone 100, that is, when searching, various modal data can be input, for example, image, text, voice, video, sensor detection Data, etc.
  • the multi-modal search model can convert these data into feature vectors, and obtain the search results by comparing the converted feature vectors corresponding to the search input data with the feature vectors in the index library.
  • the search results can be Including various modal data on the mobile phone.
  • the user can implement a global search by entering search keywords in the search box of the mobile phone 100 on one screen.
  • the user can search by entering keywords.
  • the user enters the search keyword in the search box of the mobile phone 100 on one screen.
  • the mobile phone 100 generates the feature vector of the keyword input by the user through the above-mentioned multi-modal search model transplanted to the global search, and then can calculate the feature vector of the keyword and the index established above through the above formula for calculating the Euclidean distance between the vectors.
  • the similarity of the feature vectors in the library for the feature vectors in the index library whose similarity to the feature vector of the keyword is greater than the similarity threshold, according to the index relationship in the index library, obtain the data of various modalities corresponding to these feature vectors , And display these data on the negative screen of the mobile phone 100.
  • the user enters the content to be searched in the search box on the screen of the mobile phone, such as " woman wearing a hat", and the mobile phone 100 will perform the search based on specific keywords such as "wearing a hat” and " woman".
  • Global search and then the search results (for example, images with the above keywords in all images, or schedule memos containing the above keywords, and specific voices and voice texts containing the above keywords in the voice) are displayed below the search bar .
  • users can also enter the content they want in the search bar on the one-screen screen of the mobile phone through voice or picture input.
  • the multi-modal search model in the mobile phone 100 will extract the features "wearing a hat” and " woman” of the above search text, and then generate the feature vector T search text corresponding to the above search text, and then calculate the feature vector T search by the above formula
  • the similarity between the text and the feature vector in the index library select the feature vector with the similarity greater than the similarity threshold, and then find the data corresponding to the feature vector with the similarity greater than the similarity threshold, and output it as the search result, for example, the search result includes There are image one, image two, image three of a woman wearing a hat, related content in the mobile phone schedule, and related audio files in the voice memo. Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.
  • the mobile phone when the user enters the keyword "building A”, the mobile phone generates the feature vector T building of the keyword “building A” through the above multi-modal search model, and then calculates the feature vector T building and storage through the above formula closeness between feature vectors in the feature vector multimodal index database, determining eigenvector T 1, T 2, T 3 and close with the keyword "building a" T eigenvectors of buildings, and then output according to The feature vectors T 1 , T 2 , and T 3 output various modal data of the building A, that is, the hand drawing of the building A, the speech and the text describing the building A.
  • the mobile phone when the user performs a global search, he can also input voice through the voice input unit 170a, and then the mobile phone generates the feature vector T voice of the user input voice through the above-mentioned multi-modal search model, and then calculates the feature vector T voice of the user input voice
  • the similarity with the feature vector in the aforementioned index library is determined according to the calculation result, and the search result is displayed on the negative screen of the mobile phone 100 through the display screen.
  • the multimodal search model extracts the feature vector T meeting arrangement of the "meeting arrangement", and then uses the above-mentioned similarity calculation method to calculate the similarity between the T meeting arrangement and each feature vector in the index library , Knowing that the T meeting schedule is close to the feature vectors T 150 and T 230 , determine that the search results of "meeting schedule" are meeting recordings and Monday agenda memos, and output the meeting recording audio and Monday agenda memos to the mobile phone 100 through the display screen. Negative screen.
  • the user can also input an image through the image input unit 193a, and then the mobile phone generates the feature vector T image of the image input by the user through the above-mentioned multi-modal search model, and then calculates the feature vector T image of the input image and the feature vector in the index library.
  • the search result is determined according to the calculation result, and the search result is displayed on the negative screen of the mobile phone 100 through the display screen.
  • the user inputs the image "building A" through the image input unit 190a, the multi-modal search model extracts the feature vector of the "building A" image of the building A , and then uses the above-mentioned similarity calculation method to calculate the image of the building A and store it.
  • each feature vector in the index library it is known that the image of T building A is close to T 1 , and the search result of "building A" is determined to be the hand-painted drawing of building A, and according to the modalities in the index library The index relationship between the feature vectors will be related to the hand-drawn drawing of the building A.
  • Other modal data such as the speech describing the building A and the text describing the building A, are all output on the mobile phone 100 through the display screen.
  • the solution of the present application is also suitable for global search in the memo of the mobile phone 100.
  • the user can enter the memo in the search bar 600 of the mobile phone. Specific text, voice or picture to search.
  • the user can search for pictures of women in hats as follows:
  • the mobile phone 100 When the user searches in the memo search bar 600 of the mobile phone 100, the mobile phone 100 performs a corresponding search according to the modal of the user's input data:
  • the mobile phone 100 will extract the feature vector T' woman in a hat text of the input " woman in a hat” through the above-mentioned multimodal search model, and use the above calculation method to calculate the The similarity between the feature vector T' woman wearing a hat text and the feature vectors T wearing a hat woman image , T wearing a hat woman text , and T wearing a hat woman audio stored in the index database, determine the T'wearing woman image text It is similar to or close to the text of the woman wearing a hat , and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector closest to the text of the woman wearing a hat , that is, output the memo of the woman wearing a hat, or The image of the woman in the hat (or the ID of the image) and the audio of the woman in the hat are attached in the attachment, and the number of items that meet the search criteria is displayed
  • the phone 100 will pass the voice of the plurality extract the input modal search model "woman with a hat” feature vector T 'hat Woman audio , and calculate the similarity between the feature vector T' woman wearing a hat audio and the feature vector T wearing a hat woman image stored in the index database, T wearing a hat woman text , and T wearing a hat woman audio to determine T
  • the audio of the woman wearing a hat is similar or close to the audio of the woman wearing a hat , and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector that is closest to the audio of the woman wearing a hat, that is, the output of the woman wearing a hat
  • the memo of the woman, or the image of the woman wearing the hat (or the ID of the image) and the audio of the woman wearing the hat in the attachment, and the number of items that meet the search criteria is displayed at the
  • FIG. 7 shows a schematic structural diagram of an electronic device. It can be understood that the specific technical details in the foregoing search method are also applicable to the electronic device. In order to avoid repetition, it will not be repeated here.
  • the electronic device includes:
  • the obtaining module 701 is used to obtain search data input by the user;
  • the feature extraction module 702 is configured to extract features of the search data, and generate a search feature vector of the search data based on the extracted features;
  • Similarity calculation module 703 used to compare the search feature vector with multiple index feature vectors in the index library to select the similarity between the search feature vector and the search feature vector in the index library to be greater than the similarity Threshold index feature vector,
  • the index library there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities, and the more similar the extracted features of different result data are, the different results The greater the similarity between the index feature vectors corresponding to the data;
  • Output module 704 output the result data corresponding to the selected index feature vector as a search result, where the result data included in the search result has multiple modalities.
  • FIG. 8 shows a schematic structural diagram of an electronic device 800 according to an embodiment of the present application.
  • the electronic device 800 can be used to train the aforementioned multi-modal search model, and can also receive the aforementioned multi-modal model from another electronic device, and then perform a global search on the data of various modalities on the electronic device 800 based on the aforementioned multi-modal model .
  • the electronic device 800 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2.
  • USB universal serial bus
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 800.
  • the electronic device 800 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, and perform the above-mentioned feature extraction of modal data and training of a multi-modal search model.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, and a video processor. Codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc.
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
  • the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 800.
  • the mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.
  • LNA low noise amplifier
  • the wireless communication module 160 can provide applications on the electronic device 800 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • IR infrared technology
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the electronic device 800 implements a display function through a GPU, a display screen 194, and an application processor.
  • the GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, and the like.
  • the display screen 194 includes a display panel.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the electronic device 800 may include one or N display screens 194, and N is a positive integer greater than one.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 800 may include one or N cameras 193, and N is a positive integer greater than one.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 800 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the electronic device 800 can be realized, such as image recognition, face recognition, voice recognition, text understanding, image clustering, and so on.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function, and the like.
  • the data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 800.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the processor 110 executes various functional applications and data processing of the electronic device 800 by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the internal memory 121 can also store various modal data in the mobile phone 100, a multi-modal search model transplanted to the mobile phone 100 and store intermediate calculation data of the model, store model parameters, an index library, etc.
  • the electronic device 800 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
  • the speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 800 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the electronic device 800 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
  • the electronic device 800 may be provided with at least one microphone 170C. In other embodiments, the electronic device 800 can be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 800 may also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the earphone interface 170D is used to connect wired earphones.
  • the earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, and a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA, CTIA
  • FIG. 9 it is a software structure block diagram of an electronic device in an embodiment of the present application.
  • the electronic device 900 can be used to train the aforementioned multi-modal search model, and can also receive the aforementioned multi-modal model from other electronic devices, and then perform a global search on the data of various modalities on the electronic device 900 based on the aforementioned multi-modal model.
  • Figure 9 shows that the software system of the electronic device can adopt a layered architecture, event-driven architecture, micro-core architecture, micro-service architecture, or cloud architecture.
  • the embodiment of the present invention takes an Android system with a layered architecture as an example to exemplarily illustrate the software structure of a terminal device.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications such as phone, camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, and so on.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the telephone manager is used to provide the communication function of the terminal device. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, and so on.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window.
  • prompt text messages in the status bar sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
  • the application layer and application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An electronic device and a search method thereof, and a medium. The method comprises: obtaining search data input by a user; extracting a feature of the search data, and generating a search feature vector of the search data on the basis of the extracted feature; comparing the search feature vector with a plurality of index feature vectors in an index library to select, in the index library, an index feature vector having greater similarity between the index feature vector and the search feature vector than a similarity threshold, wherein there is a correspondence between the plurality of index feature vectors and a plurality of result data of a plurality of modes in the index library; and outputting the result data corresponding to the selected index feature vector as a search result, wherein the result data comprised in the search result has the plurality of modes, so as to achieve a multimode global search function.

Description

电子设备以及电子设备的搜索方法、介质Electronic equipment and search method and medium for electronic equipment
本申请要求2020年3月10日递交的申请号为202010164088.6、发明名称为“电子设备以及电子设备的搜索方法、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202010164088.6 and the invention title of "Electronic Equipment and Electronic Equipment Search Method and Medium" filed on March 10, 2020, the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能领域,特别涉及一种电子设备以及电子设备的搜索方法、介质。This application relates to the field of artificial intelligence, and in particular to an electronic device and a search method and medium for the electronic device.
背景技术Background technique
近年,随着机器学习和深度学习快速发展,极大地促进了搜索功能的发展。目前,手机具有全局搜索的功能,可以对图库里的图片进行搜索、通过浏览器进行互联网搜索以及对其他各应用进行的搜索。现有的搜索技术中,大多数是基于内容的搜索,以图像搜索为例,主要过程为:由人工或者预训练好的模型自动给图片标注标签,并将标签保存在数据库,然后用户通过输入关键词和数据库的标签的文本匹配,返回搜索结果。并且大部分搜索都是单模态媒体数据的搜索,比如图库中通过文本输入搜索图片,或者某些互联网中以图搜图的搜索。In recent years, the rapid development of machine learning and deep learning has greatly promoted the development of search functions. At present, mobile phones have a global search function, which can search for pictures in the gallery, search on the Internet through a browser, and search for other applications. Most of the existing search technologies are content-based searches. Taking image search as an example, the main process is: manually or pre-trained models automatically label images, save the labels in the database, and then the user enters The keyword matches the text of the label in the database, and the search result is returned. And most of the searches are searches for single-modal media data, such as searching for pictures in a gallery through text input, or searching for pictures in some Internet.
发明内容Summary of the invention
本申请实施例提供了一种电子设备以及电子设备的搜索方法、介质,能够将多模态数据特征向量映射在高维统一向量中,进而通过一个模型实现在电子设备上的多模态全局搜索。The embodiments of the application provide an electronic device and a search method and medium for the electronic device, which can map the multi-modal data feature vector to a high-dimensional unified vector, and then realize the multi-modal global search on the electronic device through a model .
第一方面,本申请实施例提供了一种电子设备以及电子设备的搜索方法,上述方法包括:In the first aspect, the embodiments of the present application provide an electronic device and a search method of the electronic device, and the foregoing method includes:
获取用户输入的搜索数据;提取搜索数据的低层次特征,并基于提取的低层次特征生成搜索数据的搜索特征向量;将搜索特征向量与索引库中的多个索引特征向量进行比对,以选择出索引库中与搜索特征向量之间的相近度大于相近度阈值的索引特征向量,其中,在索引库中,多个索引特征向量与多个模态的多个结果数据之间存在对应关系。将与选择出的索引特征向量对应的结果数据作为搜索结果输出,其中,搜索结果包括的结果数据具有多个模态。即首先获取用户输入的搜索数据,比如,获取用户输入的图像数据,然后提取该图像数据的低层次的特征,比如提取图像的颜色、纹理、灰度等低层次的特征,然后生成这些低层次的特征对应的特征向量,将特征向量与存储在索引库中的索引特征向量进行相近度比较,得到与搜索数据的特征向量相近度高的特征向量,然后根据索引库中索引特征向量之间具有的 相关性索引关系,确定多个特征向量以及和多个特征向量所对应的结果数据。Obtain the search data entered by the user; extract the low-level features of the search data, and generate the search feature vector of the search data based on the extracted low-level features; compare the search feature vector with multiple index feature vectors in the index library to select The index feature vector whose similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold is output, wherein in the index library, there is a correspondence between multiple index feature vectors and multiple result data of multiple modalities. The result data corresponding to the selected index feature vector is output as the search result, where the result data included in the search result has multiple modalities. That is, first obtain the search data input by the user, for example, obtain the image data input by the user, and then extract the low-level features of the image data, such as extracting low-level features such as color, texture, and gray of the image, and then generate these low-level features The feature vector corresponding to the feature vector is compared with the index feature vector stored in the index library to obtain a feature vector that is highly similar to the feature vector of the search data, and then according to the index feature vector in the index library. The correlation index relationship of, determines multiple feature vectors and the result data corresponding to multiple feature vectors.
在上述第一方面的一种可能的实现中,上述方法还包括:In a possible implementation of the foregoing first aspect, the foregoing method further includes:
搜索特征向量与索引特征向量之间的相近度通过以下公式计算得出:The similarity between the search feature vector and the index feature vector is calculated by the following formula:
Figure PCTCN2021079905-appb-000001
Figure PCTCN2021079905-appb-000001
其中,d表示搜索特征向量与索引库中存储的多个特征向量之间的相近度,x i表示输入数据的特征向量,y i表示索引库中存储的多个特征向量,i表示输入数据的特征向量或索引库中存储的多个特征向量的维度。 Among them, d represents the similarity between the search feature vector and the multiple feature vectors stored in the index library, x i represents the feature vector of the input data, y i represents the multiple feature vectors stored in the index library, and i represents the value of the input data. The dimension of the feature vector or multiple feature vectors stored in the index library.
即上述搜索特征向量和索引特征向量之间的相近度可以通过欧氏距离计算,当然,可以理解的,相近度的计算也可以通过其他方式计算,比如皮尔逊系数等。That is, the similarity between the search feature vector and the index feature vector can be calculated by Euclidean distance. Of course, it is understandable that the calculation of the similarity can also be calculated by other methods, such as Pearson coefficient.
在上述第一方面的一种可能的实现中,上述方法还包括:In a possible implementation of the foregoing first aspect, the foregoing method further includes:
电子设备上具有索引库,并且,与索引库中的多个索引特征向量具有对应关系的多个模态的多个结果数据为电子设备上的数据。The electronic device has an index library, and multiple result data of multiple modalities that have a corresponding relationship with multiple index feature vectors in the index library are data on the electronic device.
在上述第一方面的一种可能的实现中,上述方法还包括:In a possible implementation of the foregoing first aspect, the foregoing method further includes:
电子设备为移动终端。即电子设备不限于手机等移动终端,还可以是服务器、PC等电子设备。The electronic device is a mobile terminal. That is, the electronic device is not limited to mobile terminals such as mobile phones, but can also be electronic devices such as servers and PCs.
在上述第一方面的一种可能的实现中,上述方法还包括:In a possible implementation of the foregoing first aspect, the foregoing method further includes:
用户在移动终端的负一屏输入搜索数据。即电子设备上的多模态搜索可以应用在手机等移动终端的负一屏上。The user inputs search data on the negative screen of the mobile terminal. That is, the multi-modal search on electronic devices can be applied to the negative screen of mobile terminals such as mobile phones.
在上述第一方面的一种可能的实现中,上述方法还包括:In a possible implementation of the foregoing first aspect, the foregoing method further includes:
用户在移动终端的备忘录中输入搜索数据。即电子设备上的多模态搜索可以应用在手机等移动终端的备忘录上。The user inputs search data in the memo of the mobile terminal. That is, multi-modal search on electronic devices can be applied to memos of mobile terminals such as mobile phones.
在上述第一方面的一种可能的实现中,上述方法还包括:In a possible implementation of the foregoing first aspect, the foregoing method further includes:
多个模态包括图像、视频、音频、文本、电子设备的传感器的检测数据。即模态指的是数据的来源形式或存在形态,所以多个模态数据就包括图像、文本、视频、音频等数据。Multiple modalities include image, video, audio, text, detection data of sensors of electronic devices. That is, modal refers to the source form or existence of data, so multiple modal data includes image, text, video, audio and other data.
第二方面,本申请实施例提供了一种电子设备,上述电子设备包括:In a second aspect, an embodiment of the present application provides an electronic device, and the above-mentioned electronic device includes:
获取模块,用于获取用户输入的搜索数据;The obtaining module is used to obtain the search data input by the user;
特征提取模块,用于提取搜索数据的特征,并基于提取的特征生成搜索数据的搜索特征向量;The feature extraction module is used to extract the features of the search data and generate the search feature vector of the search data based on the extracted features;
相近度计算模块:用于将搜索特征向量与索引库中的多个索引特征向量进行比对,以选择出索引库中与搜索特征向量之间的相近度大于相近度阈值的索引特征向量,Similarity calculation module: used to compare the search feature vector with multiple index feature vectors in the index library to select the index feature vector whose similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold,
其中,在索引库中,多个索引特征向量与多个模态的多个结果数据之间存在对应关系;Among them, in the index library, there is a correspondence between multiple index feature vectors and multiple result data of multiple modalities;
输出模块:将与选择出的索引特征向量对应的结果数据作为搜索结果输出,其中,搜索结果包括的结果数据具有多个模态。Output module: output the result data corresponding to the selected index feature vector as the search result, where the result data included in the search result has multiple modalities.
第三方面,本申请实施例提供了一种机器可读介质,机器可读介质上存储有指令,该指令在机器上执行时可以使机器执行上述第一方面的任意一种可能的方法。In a third aspect, an embodiment of the present application provides a machine-readable medium on which an instruction is stored. When the instruction is executed on a machine, the machine can execute any one of the possible methods of the first aspect.
第四方面,本申请实施例提供了一种电子设备,包括:存储器,用于存储由系统的一个或多个处理器执行的指令,以及处理器,是系统的处理器之一,用于执行上述第一方面的任意一种可能的方法。In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the system, and a processor, which is one of the processors of the system, for executing Any one of the possible methods of the first aspect described above.
第五方面,本申请实施例提供了一种电子设备,该电子设备具有实现上述搜索方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多于一个与上述功能相对应的单元。In a fifth aspect, an embodiment of the present application provides an electronic device that has the function of implementing the above search method. The function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more units corresponding to the above-mentioned functions.
附图说明Description of the drawings
图1根据本申请的一些实施例,示出了一种多模态全局搜索场景10。Fig. 1 shows a multi-modal global search scene 10 according to some embodiments of the present application.
图2根据本申请的一些实施例,示出了在手机100中进行全局搜索的方法示意图。Fig. 2 shows a schematic diagram of a method for performing a global search in the mobile phone 100 according to some embodiments of the present application.
图3根据本申请的一些实施例,示出了生成多模态搜索模型的过程示意图。Fig. 3 shows a schematic diagram of a process of generating a multi-modal search model according to some embodiments of the present application.
图4根据本申请的一些实施例,示出了建立索引库的过程示意图。Fig. 4 shows a schematic diagram of a process of establishing an index library according to some embodiments of the present application.
图5a根据本申请的一些实施例,示出了一种手机100的负一屏。Fig. 5a shows a negative screen of a mobile phone 100 according to some embodiments of the present application.
图5b根据本申请的一些实施例,示出了在手机负一屏进行全局搜索的示意图。Fig. 5b shows a schematic diagram of performing a global search on a negative screen of a mobile phone according to some embodiments of the present application.
图6根据本申请的一些实施例,示出了在手机备忘录进行全局搜索的示意图。Fig. 6 shows a schematic diagram of performing a global search in a mobile phone memo according to some embodiments of the present application.
图7根据本申请的一些实施例,示出了一种电子设备结构示意图。Fig. 7 shows a schematic structural diagram of an electronic device according to some embodiments of the present application.
图8根据本申请的一些实施例,示出了另一种电子设备结构示意图。Fig. 8 shows a schematic structural diagram of another electronic device according to some embodiments of the present application.
图9根据本申请一些实施例,示出了一种电子设备的软件结构框图。Fig. 9 shows a software structure block diagram of an electronic device according to some embodiments of the present application.
具体实施方式Detailed ways
本申请的说明性实施例包括但不限于一种电子设备以及电子设备的搜索方法、介质及其系统。The illustrative embodiments of the present application include, but are not limited to, an electronic device and a search method, medium and system of the electronic device.
可以理解,本申请所使用的术语“第一”、“第二”等可在本文本中用于描述各种元件,但除非特别说明,这些元件不受这些术语限制。这些术语仅用于将第一个元件与另外一个元件区分。It can be understood that the terms "first", "second", etc. used in this application can be used in this text to describe various elements, but unless otherwise specified, these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.
可以理解,在本申请的各实施例中,模态指的是每一种信息的来源或者信息的形式,诸如图像、语音、文本、视频等种类不同的信息表示不同的模态,雷达、红外、加速计等的测试结果由于来源不同,也可以表示为不同的模态。It can be understood that in the embodiments of the present application, the modal refers to the source of each type of information or the form of information. Different types of information such as images, voices, texts, and videos represent different modalities, such as radar and infrared. The test results of, accelerometer, etc. can also be expressed in different modes due to different sources.
下面将结合附图对本申请的实施例作进一步地详细描述。The embodiments of the present application will be described in further detail below in conjunction with the accompanying drawings.
图1根据本申请的一些实施例,示出了一种多模态全局搜索场景10。具体的,如图1所示,在该场景10包括电子设备100以及电子设备200。其中,电子设备100能够使用电子设备200训练出的多模态搜索模型,实现多模态搜索,即输入某一模态数据,多模态搜索模型能够生成对应该数据的特征向量,然后将生成的特征向量与多模态特征向量索引库中的特征向量进行比对,并将多模态特征向量索引库中满足预定条件的特征向量所对应的各模态数据作为搜索结果输出。电子设备200能够通过采用多种模态数据和各模态数据的特征来训练出多模态搜索模型,在该多 模态搜索模型中,类型相同或者相近的模态数据生成的特征向量相同或者相近。其中,特征向量相近是指两个特征向量之间的差值小于相近度阈值,并且,特征向量之间的差值可以以特征向量之间的欧氏距离来表示,欧氏距离越大、特征向量之间的差值越大、特征向量之间的相近度越小。此外,电子设备200不仅仅能够训练多模态搜索模型,也可以采用自身训练完成的多模态搜索模型实现各种搜索功能。Fig. 1 shows a multi-modal global search scene 10 according to some embodiments of the present application. Specifically, as shown in FIG. 1, the scene 10 includes an electronic device 100 and an electronic device 200. Among them, the electronic device 100 can use the multimodal search model trained by the electronic device 200 to realize multimodal search, that is, input a certain modal data, and the multimodal search model can generate a feature vector corresponding to the data, and then generate The feature vector of is compared with the feature vector in the multi-modal feature vector index library, and each modal data corresponding to the feature vector in the multi-modal feature vector index library that meets the predetermined conditions is output as the search result. The electronic device 200 can train a multimodal search model by using multiple modal data and the characteristics of each modal data. In the multimodal search model, the feature vectors generated by the same or similar modal data are the same or similar. Among them, the feature vector is similar means that the difference between the two feature vectors is less than the similarity threshold, and the difference between the feature vectors can be represented by the Euclidean distance between the feature vectors. The larger the Euclidean distance, the more the feature The greater the difference between the vectors, the smaller the similarity between the feature vectors. In addition, the electronic device 200 can not only train a multi-modal search model, but can also use a multi-modal search model completed by its own training to implement various search functions.
可以理解,在本申请中,电子设备100和电子设备200可以包括但不限于,膝上型计算机、台式计算机、平板计算机、手机、可穿戴设备、头戴式显示器、服务器、移动电子邮件设备、便携式游戏机、便携式音乐播放器、阅读器设备、其中嵌入或耦接有一个或多个处理器的电视机、或能够访问网络的其他电子设备。It can be understood that in this application, the electronic device 100 and the electronic device 200 may include, but are not limited to, laptop computers, desktop computers, tablet computers, mobile phones, wearable devices, head-mounted displays, servers, mobile email devices, Portable game consoles, portable music players, reader devices, televisions with one or more processors embedded or coupled therein, or other electronic devices that can access the Internet.
下文结合图2-6,以电子设备100为手机,电子设备200为服务器为例,说明本申请的技术方案。Hereinafter, in conjunction with FIGS. 2-6, the electronic device 100 is a mobile phone and the electronic device 200 is a server as an example to illustrate the technical solution of the present application.
如前所述,在本申请的一些实施例中,可以在服务器200上先训练出能够实现对多模态数据的搜索的多模态搜索模型,然后将该多模态搜索模型移植到手机100上,实现对手机100上各模态数据的全局搜索。图2根据本申请的一些实施例,示出了一种采用服务器200训练多模态搜索模型并将训练出的多模态搜索模型移植到手机100上进行全局搜索的技术方案。具体的,如图2所示:As mentioned above, in some embodiments of the present application, a multi-modal search model capable of searching for multi-modal data can be trained on the server 200, and then the multi-modal search model can be transplanted to the mobile phone 100. Above, the global search of the modal data on the mobile phone 100 is realized. Fig. 2 shows a technical solution of using the server 200 to train a multimodal search model and transplanting the trained multimodal search model to the mobile phone 100 to perform a global search according to some embodiments of the present application. Specifically, as shown in Figure 2:
(1)多模态搜索模型的训练(1) Training of multimodal search model
A)生成初始特征向量A) Generate initial feature vector
在训练多模态搜索模型时,服务器200首先需要对训练所用到的样本数据做特征提取。可以理解,在本申请中,样本数据可以包括多种模态的数据,例如,图像、语音、文本、视频、传感器测试数据等。这些样本数据(例如,一幅图像、一段语音或者一段文本)一般为结构各异的非结构化数据,具有维度较高,表现形式迥异,含有大量冗余信息等特点。因此需要提取可以表征这些样本数据的初始特征向量。可以理解的是,这些初始特征向量可以是一维的,也可以是多维的。比如,一个人的成绩排名可以通过这个人的语文成绩、数学成绩、英语成绩来共同表示,那么这个人的成绩排名的初始特征向量就具有三个维度,即(语文成绩,数学成绩,英语成绩),再比如,一个字的特征向量可以是一维的,即这个字的编码值,如果是一句话,比如“小白是狗”,则可以由多个一维的初始特征向量来共同表征,比如,词语“小白”的初始特征向量,词语“是”的初始特征向量,词语“狗”的初始特征向量,这三个初始特征向量共同表征“小白是狗”这句话。When training the multimodal search model, the server 200 first needs to perform feature extraction on the sample data used for training. It can be understood that, in this application, the sample data may include data of multiple modalities, for example, image, voice, text, video, sensor test data, and so on. These sample data (for example, an image, a speech, or a text) are generally unstructured data with different structures, which have the characteristics of high dimensionality, different forms of expression, and a lot of redundant information. Therefore, it is necessary to extract the initial feature vector that can characterize the sample data. It is understandable that these initial feature vectors can be one-dimensional or multi-dimensional. For example, a person’s performance ranking can be represented by the person’s Chinese performance, mathematics performance, and English performance. Then the initial feature vector of the person’s performance ranking has three dimensions, namely (Chinese performance, mathematics performance, and English performance). ), for another example, the feature vector of a character can be one-dimensional, that is, the code value of the character. If it is a sentence, such as "Xiaobai is a dog", it can be represented by multiple one-dimensional initial feature vectors. For example, the initial feature vector of the word "Xiaobai", the initial feature vector of the word "Yes", and the initial feature vector of the word "dog", these three initial feature vectors together represent the sentence "Xiaobai is a dog".
进一步的,假设建筑物A可以同时用彼此之间具有相关性的手绘画、语音、文本三种模态数据进行描述,那么可以分别采用特征提取算法1、特征提取算法2以及特征提取算法3来生成这三种模态数据的初始特征向量。Further, assuming that building A can be described by three modal data of hand-painting, voice, and text that are related to each other at the same time, then feature extraction algorithm 1, feature extraction algorithm 2, and feature extraction algorithm 3 can be used respectively. Generate the initial feature vectors of these three modal data.
如图3所示,例如,可以通过残差网络Resnet-34算法生成建筑物A的手绘画的初始特征向量T 1,比如T 1可以是(h 1,h 2),h 1是手绘画灰度值,h 2可手绘画尺寸的数值。然后通过语音特征提取算法梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)生成描述建筑物A的语音的初始特征向量T 2,例如T 2可以是(h 3,h 4),h 3,h 4可以是表示描述建筑物A的语音的某些特征的值,比如语音的频率和音调。然后可以通过双向长短期记忆网络(Attention-Based Bidirectional  Long Short-Term Memory Networks,BiLSTM+Attention)生成描述建筑物A的文本的初始特征向量T 3,例如T 3是(h 5),h 5可以是描述建筑A的文本的文字编码的特征值。为了更清楚的理解,上述例子中描述建筑物A的手绘画、语音、文本的初始特征向量T 1,T 2,T 3就可以分别表示为T 1(手绘画灰度特征值,手绘画尺寸特征值),T 2(语音频率特征值,语音音调特征值),T 3(文本编码特征值)。 As shown in Figure 3, for example, the initial feature vector T 1 of the hand-painted building A can be generated by the residual network Resnet-34 algorithm. For example, T 1 can be (h 1 , h 2 ), and h 1 is the hand-painted gray Degree value, h 2 can be the value of hand-drawing size. Then, through the speech feature extraction algorithm Mel Frequency Cepstrum Coefficient (MFCC), the initial feature vector T 2 describing the speech of the building A is generated. For example, T 2 can be (h 3 , h 4 ), h 3 , h 4 may be a value representing certain characteristics of the speech of the building A, such as the frequency and pitch of the speech. Then the initial feature vector T 3 describing the text of the building A can be generated through Attention-Based Bidirectional Long Short-Term Memory Networks (BiLSTM+Attention), for example, T 3 is (h 5 ), h 5 can be It is the characteristic value of the character code describing the text of the building A. For a clearer understanding, the initial feature vectors T 1 , T 2 , and T 3 describing the hand-painted, voice, and text of building A in the above example can be expressed as T 1 (hand-painted gray-scale feature value, hand-painted size Characteristic value), T 2 (characteristic value of speech frequency, characteristic value of speech pitch), T 3 (characteristic value of text encoding).
此外的,也可以采用其他算法生成各模态数据的初始特征向量,例如,还可以采用高斯函数差分(Difference of Gaussian,DOG)生成图像的初始特征向量,采用文本特征提取算法词库模型(Bag-of-words model)来生成文本的初始特征向量。In addition, other algorithms can also be used to generate the initial feature vector of each modal data. For example, the difference of Gaussian function (Difference of Gaussian, DOG) can also be used to generate the initial feature vector of the image, and the text feature extraction algorithm Thesaurus model (Bag -of-words model) to generate the initial feature vector of the text.
可以理解,此处提到的各特征的提取算法均属于是多模态搜索模型的一部分。It can be understood that the feature extraction algorithms mentioned here are all part of the multimodal search model.
B)特征向量的聚类B) Clustering of feature vectors
特征向量聚类指的是彼此间具有相关性的样本数据被输入多模态搜索模型后,输出的最终特征向量彼此间相近或者相同。其中,相关性指的是,各模态数据表示的内容相同或者近似。例如,每个样本数据可以提取出多个低层次的特征,而两个数据提取出的特征相近可以是提取出的多个特征中有占一定数量比例的特征的相同或者相近,或者两个数据提取出的相同特征的特征值之间的差值小于预定的阈值。例如,图像A和音频数据B分别提取出10和12个特征,其中图像A中有9个特征和音频数据B中的9个特征相同,则可认为图像A和音频数据B相近。再例如,图像A提取出的特征为“牧羊犬”和“成年犬”,而图像B提取出的特征为“哈士奇”和“成年犬”,则在一些要求动物种类识别的应用中,可以认为图像A和图像B均表示的是犬只,故两者的特征相近,而在一些要求犬只种类识别的应用中,可以认为图像A和图像B的特征不相近。Feature vector clustering refers to the fact that after sample data that are related to each other are input into the multimodal search model, the final feature vectors output are similar or identical to each other. Among them, correlation means that the content represented by each modal data is the same or similar. For example, multiple low-level features can be extracted from each sample data, and the features extracted from two data are similar, which can be that a certain proportion of the extracted multiple features are the same or similar, or two data The difference between the extracted feature values of the same feature is less than a predetermined threshold. For example, 10 and 12 features are extracted from image A and audio data B, respectively. If 9 features in image A are the same as 9 features in audio data B, it can be considered that image A and audio data B are similar. For another example, the features extracted from image A are "shepherd dogs" and "adult dogs", and the features extracted from image B are "huskies" and "adult dogs". In some applications that require animal species identification, it can be considered Both image A and image B represent dogs, so their features are similar. In some applications that require dog species recognition, it can be considered that the features of image A and image B are not similar.
现在参考图3并以上述描述建筑物A的三种模态数据为例,说明特征向量的聚类过程。如图3所示,基于A)中的技术得到描述建筑物A的手绘画、语音、文本的初始特征向量T 1,T 2,T 3以后,因为这些初始特征向量是用不同的特征提取算法生成的独立的初始特征向量,所以不在一个统一的特征向量表示空间内,彼此之间的向量值会相差很大,无法判断特征向量之间的相关性。例如,手绘画的初始特征向量是由手绘画的灰度值(例如为222)而语音是由语音的频率特征值(如20(Hz))表示的,此时,手绘画和语音虽然都是描述建筑物A的,但是两者的初始特征向量的向量值相差很大。因此,需要进行特征向量的聚类,将两者的特征向量聚类为相近。例如,采用长短期记忆网络(Long Short Term Memory network,LSTM)模型进行训练,使得长短期网络模型能够将处于不同向量空间的初始特征向量聚类为相近,即将其映射到同一向量空间。 Now referring to FIG. 3 and taking the three modal data described above for building A as an example, the clustering process of feature vectors will be described. As shown in Figure 3, based on the technique in A), the initial feature vectors T 1 , T 2 , and T 3 describing the hand-painted, voice, and text of the building A are obtained, because these initial feature vectors use different feature extraction algorithms The generated independent initial feature vector is not in a unified feature vector representation space, and the vector values between each other will be very different, and it is impossible to judge the correlation between the feature vectors. For example, the initial feature vector of a hand-painted painting is represented by the gray value of the hand-painted (for example, 222) and the voice is represented by the frequency feature value of the voice (such as 20 (Hz)). At this time, although both hand-painted and voice are Describes building A, but the vector values of the initial feature vectors of the two are very different. Therefore, it is necessary to perform feature vector clustering, and cluster the two feature vectors as similar. For example, a Long Short Term Memory network (LSTM) model is used for training, so that the long and short term network model can cluster the initial feature vectors in different vector spaces as similar, that is, map them to the same vector space.
如图3所示,LSTM模型在每个时刻的输出结果依赖与前一时刻的输出结果。具体的,例如,将一段文字“我爱祖国”输入至如图3所示的长短期记忆网络模型中,该模型会按照时间序列依次输出每个时刻的特征向量,比如,t 1时刻输入长短期记忆网络的是“我”,那么长短期记忆网络就会生成t 1时刻“我”对应的特征向量H 1(h 6),h 6表示“我”的编码特征值,t 2时刻输入长短期记忆网络的是“爱”,那么长短期记忆网络就会生成t 2时刻“爱”对应的特征向量H 2(h 6,h 7),h 7表示“爱” 的编码特征值,t 3时刻输入长短期记忆网络的是“祖国”,那么长短期记忆网络就会生成t 3时刻“祖国”对应的特征向量H 3(h 6,h 7,h 8),h 8表示“祖国”的编码特征值。因此,最终,由于特征向量H 3具有“我爱祖国”的全部特征,所以会用特征向量H 3来表示句子“我爱祖国”。而对于图像这种没有前后依赖关系的数据,在LSTM模型中,可以认为在某一时刻输出表示其特征的向量值,而其他时刻的向量值为0。例如,在t 1时刻输出的特征向量H 4(h 9)表示其灰度值,而t 2和t 3时刻输出的特征向量为H 5(h 9,0)和H 5(h 9,0,0)。 As shown in Figure 3, the output result of the LSTM model at each moment depends on the output result of the previous moment. Specifically, for example, input a paragraph of text "I love the motherland" into the long-term short-term memory network model shown in Figure 3. The model will output the feature vector at each moment in time sequence, for example, input long-term memory at t 1 The short-term memory network is "I", then the long-term short-term memory network will generate a feature vector H 1 (h 6 ) corresponding to "I" at t 1 , where h 6 represents the encoded feature value of "I", and input long at t 2 The short-term memory network is "love", then the long-term short-term memory network will generate the feature vector H 2 (h 6 , h 7 ) corresponding to "love" at t 2 where h 7 represents the coded feature value of "love", t 3 When the “motherland” is entered into the long-term and short-term memory network at all times, the long-term and short-term memory network will generate the feature vector H 3 (h 6 , h 7 , h 8 ) corresponding to the “motherland” at t 3 , and h 8 represents the "motherland" Encoding feature values. Therefore, in the end, since the feature vector H 3 has all the features of "I love the motherland", the feature vector H 3 will be used to represent the sentence "I love the motherland". For image data that does not have a front-to-back dependency relationship, in the LSTM model, it can be considered that a vector value representing its feature is output at a certain moment, and the vector value at other moments is 0. For example, the feature vector H 4 (h 9 ) output at t 1 represents its gray value, while the feature vector output at t 2 and t 3 is H 5 (h 9 , 0) and H 5 (h 9 , 0 , 0).
具体地,长短期记忆网络LSTM模型的训练过程如下:Specifically, the training process of the long and short-term memory network LSTM model is as follows:
I)预先准备多个模态不同的样本数据,并且这些多个模态数据可以表示不同的物体或者描述不同的事件,但是描述同一物体或者描述同一事件的数据之间具有相关性。然后通过上述A)生成这些样本数据的初始特征向量。I) Prepare multiple sample data with different modalities in advance, and these multiple modal data can represent different objects or describe different events, but there is correlation between data describing the same object or description of the same event. Then the initial feature vectors of these sample data are generated through the above A).
II)将描述同一物体或者同一事件的彼此间具有相关性的多个模态数据的初始特征向量输入LSTM模型中,得到LSTM输出的最终特征向量(此时的LSTM可以看作是多模态搜索模型的一部分)。计算这些最终特征向量是否相近或者相同,如果这些最终特征向量不相近或者相同,则调整LSTM的模型参数,然后重新将上述多个初始特征向量输入LSTM中,并计算输出的最终特征向量是否相同或者相近。II) Input the initial feature vectors of multiple modal data that describe the same object or the same event that are related to each other into the LSTM model to obtain the final feature vector output by the LSTM (the LSTM at this time can be regarded as a multi-modal search Part of the model). Calculate whether these final feature vectors are similar or the same. If these final feature vectors are not similar or the same, adjust the model parameters of the LSTM, and then re-input the above-mentioned multiple initial feature vectors into the LSTM, and calculate whether the output final feature vectors are the same or similar.
如此重复操作,直到LSTM模型输出的最终特征向量之间相近或者相同,则由于初始数据模态不同导致生成的向量空间不同的初始特征向量被映射到了同一向量空间,即LSTM模型训练完成。Repeat the operation in this way until the final feature vectors output by the LSTM model are similar or the same, and the generated initial feature vectors with different vector spaces due to different initial data modalities are mapped to the same vector space, that is, the training of the LSTM model is completed.
其中,在一些实施例中,两个特征向量是否相近可以通过以下公式计算:Among them, in some embodiments, whether the two feature vectors are similar can be calculated by the following formula:
Figure PCTCN2021079905-appb-000002
Figure PCTCN2021079905-appb-000002
其中,d表示两个特征向量之间的相近度,x i表示输入数据的特征向量,y i表示索引库中存储的多个特征向量,i表示输入数据的特征向量或索引库中存储的多个特征向量的维度。即d表示两个特征向量之间的欧氏距离,其中,欧氏距离越大特征向量之间的相近度越小。 Among them, d represents the similarity between the two feature vectors, x i represents the feature vector of the input data, y i represents the multiple feature vectors stored in the index library, and i represents the feature vector of the input data or multiple features stored in the index library. The dimensions of the eigenvectors. That is, d represents the Euclidean distance between two feature vectors, where the larger the Euclidean distance, the smaller the similarity between the feature vectors.
例如,将上述描述建筑物A的三种模态(手绘画、语音、文本)下的初始特征向量T 1、T 2以及T 3输入LSTM模型后,得到中间特征向量T 1'、T 2'以及T 3'。 For example, after inputting the initial feature vectors T 1 , T 2 and T 3 in the three modalities (hand drawing, speech, text) of building A described above into the LSTM model, the intermediate feature vectors T 1 ', T 2 'are obtained And T 3 '.
此处,对于上述特征向量T 1'、T 2'以及T 3',假设T 1'=(a 1,a 2,a 3),T 2'=(b 1,b 2,b 3),T 3'=(c 1,c 2,c 3)。那么,T 1'和T 2'之间的相近度可以表示为: Here, for the above-mentioned feature vectors T 1 ′, T 2 ′, and T 3 ′, suppose T 1 ′=(a 1 ,a 2 ,a 3 ), T 2 ′=(b 1 ,b 2 ,b 3 ), T 3 '=(c 1 , c 2 , c 3 ). Then, the similarity between T 1 ′ and T 2 ′ can be expressed as:
Figure PCTCN2021079905-appb-000003
Figure PCTCN2021079905-appb-000003
T 2'和T 3'之间的相近度可以表示为: The similarity between T 2 ′ and T 3 ′ can be expressed as:
Figure PCTCN2021079905-appb-000004
Figure PCTCN2021079905-appb-000004
而当d 1和d 2均小于预定的相近度阈值时,则认为T 1'、T 2'以及T 3'相同或者相近,并将此时的中间特征向量T 1'、T 2'以及T 3'作为最终特征向量。 When both d 1 and d 2 are less than the predetermined similarity threshold, it is considered that T 1 ', T 2 'and T 3 ' are the same or similar, and the intermediate feature vectors T 1 ', T 2 'and T 3 'as the final feature vector.
可以理解,在实际的模型训练中,可以根据实际需要设置相近度阈值,本申请在此不做限制。It can be understood that in actual model training, the similarity threshold can be set according to actual needs, and this application is not limited here.
此外,在一些实施例中,也可以利用损失函数来计算不同模态的样本数据的初始特征向量输入LSTM模型后,得到的中间特征向量之间的相近度。具体地,将类型相同或者相近的样本数据的初始特征向量输入LSTM模型后得到中间特征向量,采用损失函数计算输入的初始特征向量与输出的中间特征向量的之间的误差,根据该误差,求出偏导数。然后基于求出的偏导数对LSTM模型中的各模型参数进行调整。In addition, in some embodiments, the loss function can also be used to calculate the similarity between the intermediate feature vectors obtained after the initial feature vectors of the sample data of different modalities are input into the LSTM model. Specifically, the initial feature vector of the same or similar sample data is input into the LSTM model to obtain the intermediate feature vector, and the loss function is used to calculate the error between the input initial feature vector and the output intermediate feature vector, and based on the error, Partial derivative. Then, the model parameters in the LSTM model are adjusted based on the obtained partial derivatives.
为了使得上述训练过程更清楚,下面基于上文提到的建筑物A举个简单例子进行说明。例如,除了上文提到的建筑物A,还有描述音乐C的专辑封面图像、文本和音频数据,其中,描述音乐C的专辑封面图像对应的初始特征向量为T 4,描述音乐C的文本对应的初始特征向量为T 5以及描述音乐C的音频对应的初始特征向量为T 6,然后将描述建筑物A的三种模态(手绘画、语音、文本)的初始特征向量T 1、T 2、T 3以及描述音乐C的三种模态(专辑封面图像、文本和音频)的初始特征向量T 4、T 5、T 6作为训练数据输入到LSTM模型中,通过调整长短期记忆网络的相关参数,使得最终得到的T 1'、T 2'以及T 3'两两之间的欧式距离小于预定的相近度阈值,从而将描述建筑物A的各模态数据的特征向量聚类一起,然后用同样的方法,使得描述音乐B的初始特征向量T 4、T 5、T 6的最终的特征向量T 4'、T 5'以及T 6'两两之间的欧式距离小于预定的相近度阈值,从而将描述音乐C的各模态数据的特征向量聚类在一起。 In order to make the above training process clearer, a simple example is given below based on the above-mentioned building A. For example, in addition to the building A mentioned above, there are album cover images, text and audio data describing music C. Among them, the initial feature vector corresponding to the album cover image describing music C is T 4 , and the text describing music C The corresponding initial feature vector is T 5 and the initial feature vector corresponding to the audio describing music C is T 6 , and then the initial feature vectors T 1 , T describing the three modalities of building A (hand painting, speech, text) 2. T 3 and the initial feature vectors T 4 , T 5 , and T 6 describing the three modalities of music C (album cover image, text and audio) are input into the LSTM model as training data, and the long- and short-term memory network is adjusted by Relevant parameters make the final Euclidean distance between T 1 ', T 2 'and T 3 ' less than the predetermined similarity threshold, so that the feature vectors describing the modal data of the building A are clustered together, Then use the same method to make the final feature vectors T 4 ', T 5 'and T 6 ' which describe the initial feature vectors T 4 , T 5 , and T 6 of music B. The Euclidean distance between each pair is less than the predetermined similarity. Threshold, thus clustering the feature vectors of each modal data describing music C together.
此外,可以理解,在本申请的其他实施例中,也可以采用其他方式确定两个特征向量之间的相近度,不限于上述公式中的欧式距离和损失函数,例如,还可以通过余弦相似度,或者皮尔逊相关系数来进行特征向量之间的相近度计算。In addition, it can be understood that in other embodiments of the present application, other methods can also be used to determine the similarity between two feature vectors, which are not limited to the Euclidean distance and loss function in the above formula. For example, cosine similarity can also be used , Or Pearson correlation coefficient to calculate the similarity between feature vectors.
可以理解,虽然上述实施例中是采用服务器200来进行多模态搜索模型训练的,在其他实施例中,也可以采用其他计算机设备训练多模态搜索模型。在此不做限制。It can be understood that although the server 200 is used to train the multi-modal search model in the foregoing embodiment, in other embodiments, other computer equipment may also be used to train the multi-modal search model. There is no restriction here.
(2)建立手机100上的多模态数据与特征向量之间的索引关系(2) Establish the index relationship between the multi-modal data on the mobile phone 100 and the feature vector
继续参考图2,在服务器200上训练好多模态搜索模型后,可以建立一个Android工程,将该模型通过前述工程中的模型读取接口读取并解析该模型,然后编译生成APK(Android application package,Android应用程序包)文件,安装到手机100中,完成多模态搜索模型的移植。然后将手机100中的各种模态的数据(图像、语音、文本、视频等)输入多模态搜索模型中,得到对应各数据的特征向量,并建立各数据和各数据的特征向量之间的索引关系,得到索引库。Continuing to refer to Figure 2, after training the multi-modal search model on the server 200, an Android project can be built, the model can be read and parsed through the model reading interface in the aforementioned project, and then compiled to generate an APK (Android application package). , Android application package) file, installed in the mobile phone 100 to complete the transplantation of the multi-modal search model. Then input various modal data (image, voice, text, video, etc.) in the mobile phone 100 into the multi-modal search model to obtain the feature vector corresponding to each data, and establish the relationship between each data and the feature vector of each data. The index relationship is obtained, and the index library is obtained.
例如,可以将手机100上的图像1-图像100,语音1-语音50,文本1-文本80全部输入多模态搜索模型中,然后得到对应图像1-图像100的特征向量T 1-T 100,对应语音1-语音50的特征向量T 101-T 150,对应文本1-文本80的特征向量T 151-T 230。然后,可以建立上述得到的各特征向量与对应的数据的标识之间的索引关系,例如,数据的数据标识可以是为上述图像、文本、语音文件设置的标识,也可以是上述数据在手机100中的名称,或者将上述数据的整个源数据作为标识。例如,将T 1与图像1的名称“20200107adefeg”建立索引关系,然后将该索引关系存储在索引库中。 For example, image 1-image 100, speech 1-speech 50, text 1-text 80 on mobile phone 100 can be input into the multimodal search model, and then the feature vector T 1 -T 100 corresponding to image 1-image 100 can be obtained. , Corresponding to the feature vector T 101 -T 150 of speech 1-speech 50, corresponding to the feature vector T 151- T 230 of text 1-text 80. Then, the index relationship between each feature vector obtained above and the identification of the corresponding data can be established. For example, the data identification of the data can be an identification set for the above-mentioned image, text, and voice file, or the above-mentioned data can be stored in the mobile phone 100. Or use the entire source data of the above data as the identifier. For example, create an index relationship between T 1 and the name "20200107adefeg" of image 1, and then store the index relationship in the index library.
可以理解,索引库可以在所述多模态搜索模型中,以数据库的形式存在,特征向量和对应的数据的标识可以以字段的形式存储在数据库中。It can be understood that the index library may exist in the form of a database in the multi-modal search model, and the feature vector and the identification of the corresponding data may be stored in the database in the form of a field.
在一些实施例中,在手机100上建立索引库的具体过程如图4所示:将训练好的多模态搜索模型移植到手机100中,通过该多模态搜索模型生成手机100中的各种模态的数据,如图像、语音、文本(在其他实施例中,也可以包括视频、传感器检测数据等)的特征向量,并将这些由多模态搜索模型生成的特征向量存储在手机100中,同时,建立手机100上各数据和特征向量之间的索引关系,得到索引库。例如,将建筑物A的手绘画、语音、文本以及另外的会议录音、周一日程备忘录等数据分别通过多模态搜索模型生成的各自的特征向量T 1、T 2、T 3、T 150、T 230,并将这些特征向量存储至手机100中,同时,将这些数据与各自的特征向量之间的索引关系存储到索引库中。 In some embodiments, the specific process of establishing an index library on the mobile phone 100 is shown in FIG. 4: The trained multi-modal search model is transplanted to the mobile phone 100, and the multi-modal search model is used to generate the information in the mobile phone 100. Various modal data, such as image, voice, text (in other embodiments, may also include video, sensor detection data, etc.) feature vectors, and these feature vectors generated by the multi-modal search model are stored in the mobile phone 100 At the same time, an index relationship between each data and feature vector on the mobile phone 100 is established to obtain an index library. For example, the respective feature vectors T 1 , T 2 , T 3 , T 150 , T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150 and T 230 , and store these feature vectors in the mobile phone 100, and at the same time, store the index relationship between these data and the respective feature vectors in the index library.
(3)全局搜索(3) Global search
继续参考图2,在手机100上建立好索引库后,可以在手机100上实现全局搜索,即在搜索时,可以输入各种模态的数据,例如,图像、文本、语音、视频、传感器检测数据等等,多模态搜索模型能够将这些数据转换为特征向量,通过将转换后的对应搜索输入数据的特征向量与索引库中的特征向量进行比对来得到搜索结果,其中,搜索结果可以包括手机上各种模态的数据。Continuing to refer to Figure 2, after the index library is established on the mobile phone 100, a global search can be implemented on the mobile phone 100, that is, when searching, various modal data can be input, for example, image, text, voice, video, sensor detection Data, etc., the multi-modal search model can convert these data into feature vectors, and obtain the search results by comparing the converted feature vectors corresponding to the search input data with the feature vectors in the index library. Among them, the search results can be Including various modal data on the mobile phone.
在一些实施例中,用户可以通过在手机100负一屏的搜索框中输入搜索关键词来实现全局搜索。具体的,如图5a所示,当用户在手机100负一屏进行全局搜索时,用户可以通过输入关键字进行搜索,比如,用户在手机100负一屏的搜索框内输入搜索的关键字,手机100通过上述移植到全局搜索中的多模态搜索模型生成用户输入的关键字的特征向量,然后可以通过上述计算向量间的欧式距离的公式,计算关键字的特征向量与上述建立好的索引库中的特征向量的相近度,对于与关键词的特征向量相近度大于相近度阈值的索引库中的特征向量,根据索引库中的索引关系,获取这些特征向量对应的各种模态的数据,并将这些数据显示在手机100的负一屏上。In some embodiments, the user can implement a global search by entering search keywords in the search box of the mobile phone 100 on one screen. Specifically, as shown in Figure 5a, when the user performs a global search on the mobile phone 100 on one screen, the user can search by entering keywords. For example, the user enters the search keyword in the search box of the mobile phone 100 on one screen. The mobile phone 100 generates the feature vector of the keyword input by the user through the above-mentioned multi-modal search model transplanted to the global search, and then can calculate the feature vector of the keyword and the index established above through the above formula for calculating the Euclidean distance between the vectors. The similarity of the feature vectors in the library, for the feature vectors in the index library whose similarity to the feature vector of the keyword is greater than the similarity threshold, according to the index relationship in the index library, obtain the data of various modalities corresponding to these feature vectors , And display these data on the negative screen of the mobile phone 100.
例如,如图5b所示,用户在手机负一屏的搜索框内输入要搜索的内容,如“戴帽子的女人”,手机100会根据“戴帽子”、“女人”等具体的关键字进行全局搜索,然后将搜索结果(例如,所有图像中有上述关键字的图像、或者包含有上述关键字的日程备忘录以及语音中包含有上述关键字的具体语音以及语音文本)显示在搜索栏的下方。同时,用户也可以通过语音或者图片输入等方式在手机负一屏的搜索栏中输入想要得到的内容。For example, as shown in Figure 5b, the user enters the content to be searched in the search box on the screen of the mobile phone, such as "woman wearing a hat", and the mobile phone 100 will perform the search based on specific keywords such as "wearing a hat" and "woman". Global search, and then the search results (for example, images with the above keywords in all images, or schedule memos containing the above keywords, and specific voices and voice texts containing the above keywords in the voice) are displayed below the search bar . At the same time, users can also enter the content they want in the search bar on the one-screen screen of the mobile phone through voice or picture input.
更具体的,手机100中的多模态搜索模型会提取上述搜索文本的特征“戴帽子”、“女人”,然后生成上述搜索文本对应特征向量T 搜索文本,然后通过上述公式计算特征向量T 搜索文本与索引库中的特征向量的相近度,选择相近度大于相近度阈值的特征向量,进而找到相近度大于相近度阈值的特征向量所对应的数据,作为搜索结果输出,例如,搜索结果中包括具有戴帽子的女人的图像一、图像二、图像三,手机日程中的相关内容,以及语音备忘录中的相关音频文件。其中,在搜结果中,可以直接显示图像,也可以显示图像的缩略图或者图像的名称和缩略图。 More specifically, the multi-modal search model in the mobile phone 100 will extract the features "wearing a hat" and "woman" of the above search text, and then generate the feature vector T search text corresponding to the above search text, and then calculate the feature vector T search by the above formula The similarity between the text and the feature vector in the index library, select the feature vector with the similarity greater than the similarity threshold, and then find the data corresponding to the feature vector with the similarity greater than the similarity threshold, and output it as the search result, for example, the search result includes There are image one, image two, image three of a woman wearing a hat, related content in the mobile phone schedule, and related audio files in the voice memo. Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.
再例如,当用户输入关键字为“建筑物A”,手机通过上述多模态搜索模型生成关键字“建筑物A”的特征向量T 建筑物,然后通过上述公式计算特征向量T 建筑物与存储在多模态特征向量索引数据库中的特征向量之间的相近度,确定特征向量T 1、T 2、T 3与关键字“建筑物A”的特征向量T 建筑物的相接近,然后输出根据特征向量T 1、T 2、T 3输出建筑物A的各模态数据,即建筑物A的手绘画、描述建筑物A的语音以及文本。 For another example, when the user enters the keyword "building A", the mobile phone generates the feature vector T building of the keyword "building A" through the above multi-modal search model, and then calculates the feature vector T building and storage through the above formula closeness between feature vectors in the feature vector multimodal index database, determining eigenvector T 1, T 2, T 3 and close with the keyword "building a" T eigenvectors of buildings, and then output according to The feature vectors T 1 , T 2 , and T 3 output various modal data of the building A, that is, the hand drawing of the building A, the speech and the text describing the building A.
同样的,用户在进行全局搜索时,也可以通过语音输入单元170a输入语音,然后手机通过上述多模态搜索模型生成用户输入语音的特征向量T 语音,然后计算用户输入的语音的特征向量T 语音与上述索引库中的特征向量的相近度,根据计算结果确定搜索结果,并将搜索结果通过显示屏显示在手机100负一屏上。例如,用户通过170a输入语音“会议安排”,多模态搜索模型提取“会议安排”的特征向量T 会议 安排,然后利用上述相近度计算方法计算T 会议安排与索引库中各特征向量的相近度,得知T 会议安排与特征向量T 150、T 230相接近,就确定“会议安排”的搜索结果为会议录音以及周一日程备忘录,并通过显示屏将会议录音音频和周一日程备忘录输出在手机100负一屏。 Similarly, when the user performs a global search, he can also input voice through the voice input unit 170a, and then the mobile phone generates the feature vector T voice of the user input voice through the above-mentioned multi-modal search model, and then calculates the feature vector T voice of the user input voice The similarity with the feature vector in the aforementioned index library is determined according to the calculation result, and the search result is displayed on the negative screen of the mobile phone 100 through the display screen. For example, the user enters the voice "meeting arrangement" through 170a, the multimodal search model extracts the feature vector T meeting arrangement of the "meeting arrangement", and then uses the above-mentioned similarity calculation method to calculate the similarity between the T meeting arrangement and each feature vector in the index library , Knowing that the T meeting schedule is close to the feature vectors T 150 and T 230 , determine that the search results of "meeting schedule" are meeting recordings and Monday agenda memos, and output the meeting recording audio and Monday agenda memos to the mobile phone 100 through the display screen. Negative screen.
另外,用户也可以通过图像输入单元193a输入图像,然后手机通过上述多模态搜索模型生成用户输入的图像的特征向量T 图像,然后计算输入图像的特征向量T 图像与上述索引库中的特征向量的相近度,根据计算结果确定搜索结果,并将搜索结果通过显示屏显示在手机100负一屏上。例如,用户通过图像输入单元190a输入图像“建筑A”,多模态搜索模型提取“建筑物A”的特征向量T 建筑物A图像,然后利用上述相近度计算方法计算T 建筑物A图像与存储在索引库中各特征向量的相近度,得知T 建筑物A图像与T 1相接近,确定“建筑物A”的搜索结果为建筑物A的手绘画,并根据索引库中的各模态特征向量之间的索引关系将与建筑物A的手绘画具有相关性其他模态数据,如描述建筑物A的语音、描述建筑物A的文本都通过显示屏输出在手机100负一屏。 In addition, the user can also input an image through the image input unit 193a, and then the mobile phone generates the feature vector T image of the image input by the user through the above-mentioned multi-modal search model, and then calculates the feature vector T image of the input image and the feature vector in the index library. The search result is determined according to the calculation result, and the search result is displayed on the negative screen of the mobile phone 100 through the display screen. For example, the user inputs the image "building A" through the image input unit 190a, the multi-modal search model extracts the feature vector of the "building A" image of the building A , and then uses the above-mentioned similarity calculation method to calculate the image of the building A and store it. According to the similarity of each feature vector in the index library, it is known that the image of T building A is close to T 1 , and the search result of "building A" is determined to be the hand-painted drawing of building A, and according to the modalities in the index library The index relationship between the feature vectors will be related to the hand-drawn drawing of the building A. Other modal data, such as the speech describing the building A and the text describing the building A, are all output on the mobile phone 100 through the display screen.
进一步的,在本申请的另一个实施例中,本申请的方案也适用于在手机100的备忘录中进行全局搜索,具体的,如图6所示,用户可以通过在手机备忘录的搜索栏600输入具体的文本、语音或者图片来进行搜索。Further, in another embodiment of the present application, the solution of the present application is also suitable for global search in the memo of the mobile phone 100. Specifically, as shown in FIG. 6, the user can enter the memo in the search bar 600 of the mobile phone. Specific text, voice or picture to search.
更具体的,用户搜索戴帽子女人的图片过程可以如下:More specifically, the user can search for pictures of women in hats as follows:
当用户在手机100的备忘录搜索栏600中进行搜索时,手机100根据用户的输入数据的模态来进行相对应的搜索:When the user searches in the memo search bar 600 of the mobile phone 100, the mobile phone 100 performs a corresponding search according to the modal of the user's input data:
(a)当用户输入文本“戴帽子的女人”时,手机100会通过上述多模态搜索模 型提取输入“戴帽子的女人”的特征向量T’ 戴帽子女人文本,并通过上述计算方法计算将该特征向量T’ 戴帽子女人文本与上述存储在索引库中的特征向量T 戴帽子女人图像、T 戴帽子女人文本、T 戴帽子女人音频之间的相近度,确定T’ 戴帽子女人图文本与T 戴帽子女人文本相近或者接近,然后根据索引库中的索引关系,确定输出与T’ 戴帽子女人文本最相近的特征向量所对应的所有相关内容,即输出戴帽子的女人的备忘录、或者附件中的戴帽子的女人的图像(或者图像的ID)以及戴帽子的女人的音频,并且同时显示搜索到的符合条件的项目数字,(如显示已找到1项)。其中,在搜结果中,可以直接显示图像,也可以显示图像的缩略图或者图像的名称和缩略图。 (a) When the user enters the text "woman in a hat", the mobile phone 100 will extract the feature vector T'woman in a hat text of the input "woman in a hat" through the above-mentioned multimodal search model, and use the above calculation method to calculate the The similarity between the feature vector T'woman wearing a hat text and the feature vectors T wearing a hat woman image , T wearing a hat woman text , and T wearing a hat woman audio stored in the index database, determine the T'wearing woman image text It is similar to or close to the text of the woman wearing a hat , and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector closest to the text of the woman wearing a hat , that is, output the memo of the woman wearing a hat, or The image of the woman in the hat (or the ID of the image) and the audio of the woman in the hat are attached in the attachment, and the number of items that meet the search criteria is displayed at the same time (for example, it shows that 1 item has been found). Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.
(b)当用户在搜索时通过语音输入单元170a输入语音“戴帽子的女人”,手机100会通过上述多模态搜索模型提取输入的语音的“戴帽子的女人”的特征向量T’ 戴帽子女人音频,并计算将该特征向量T’ 戴帽子女人音频与上述存储在索引库中的特征向量T 戴帽子女人图像、T 戴帽子女人文本、T 戴帽子女人音频之间的相近度,确定T’ 戴帽子女人音频与T 戴帽子女人音频相近或者接近,然后根据索引库中的索引关系,确定输出与T’ 戴帽子女人音频最相近的特征向量所对应的所有相关内容,即输出戴帽子的女人的备忘录、或者附件中的戴帽子的女人的图像(或者图像的ID)以及戴帽子的女人的音频,并且同时显示搜索到的符合条件的项目数字,(如显示已找到1项)。其中,在搜结果中,可以直接显示图像,也可以显示图像的缩略图或者图像的名称和缩略图。 (b) when the user input through the voice input unit 170a when the search speech "woman with a hat", the phone 100 will pass the voice of the plurality extract the input modal search model "woman with a hat" feature vector T 'hat Woman audio , and calculate the similarity between the feature vector T'woman wearing a hat audio and the feature vector T wearing a hat woman image stored in the index database, T wearing a hat woman text , and T wearing a hat woman audio to determine T The audio of the woman wearing a hat is similar or close to the audio of the woman wearing a hat , and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector that is closest to the audio of the woman wearing a hat, that is, the output of the woman wearing a hat The memo of the woman, or the image of the woman wearing the hat (or the ID of the image) and the audio of the woman wearing the hat in the attachment, and the number of items that meet the search criteria is displayed at the same time (for example, it shows that 1 item has been found). Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.
(c)当用户在搜索时通过图像输入单元193a输入图像“戴帽子的女人”时,手机100会通过上述多模态搜索模型提取输入的图像“戴帽子的女人”的特征向量T’ 戴帽子女人图像,并计算将该特征向量T’ 戴帽子女人图像与上述存储在索引库中的特征向量T 戴帽子女人图像、T 戴帽子女人文本、T 戴帽子女人音频之间的相近度,确定T’ 戴帽子女人图像与T 戴帽子女人图像相近或者接近,然后根据索引库中的索引关系,确定输出与T’ 戴帽子女人图像最相近的特征向量所对应的所有相关内容,即输出戴帽子的女人的备忘录、或者附件中的戴帽子的女人的图像(或者图像的ID)以及戴帽子的女人的音频,并且同时显示搜索到的符合条件的项目数字,(如显示已找到1项)。其中,在搜结果中,可以直接显示图像,也可以显示图像的缩略图或者图像的名称和缩略图。 (c) when 193a inputted image of the user through the image input unit, when searching "woman with a hat", the phone 100 will pass the image of the plurality extract the input modal search model "woman with a hat" feature vector T 'hat woman image, and calculates the feature vector T 'hat woman above image feature vectors stored in the index repository T hat image woman, a woman wearing a hat text T, T closeness between the wear cap audio woman, determining T The image of a woman wearing a hat is similar to or close to the image of a woman wearing a T hat, and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector that is closest to the image of a woman wearing a hat, that is, the output wearing a hat The memo of the woman, or the image of the woman wearing the hat (or the ID of the image) and the audio of the woman wearing the hat in the attachment, and the number of items that meet the search criteria is displayed at the same time (for example, it shows that 1 item has been found). Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.
另外,对应上述搜索方法,图7示出了一种电子设备的结构示意图,可以理解,上述搜索方法中的具体技术细节,在该电子设备中也适用,为了避免重复,在此不再赘述。In addition, corresponding to the foregoing search method, FIG. 7 shows a schematic structural diagram of an electronic device. It can be understood that the specific technical details in the foregoing search method are also applicable to the electronic device. In order to avoid repetition, it will not be repeated here.
如图7所示,该电子设备包括:As shown in Figure 7, the electronic device includes:
获取模块701,用于获取用户输入的搜索数据;The obtaining module 701 is used to obtain search data input by the user;
特征提取模块702,用于提取所述搜索数据的特征,并基于提取的所述特征生成所述搜索数据的搜索特征向量;The feature extraction module 702 is configured to extract features of the search data, and generate a search feature vector of the search data based on the extracted features;
相近度计算模块703:用于将所述搜索特征向量与索引库中的多个索引特征向量进行比对,以选择出所述索引库中与所述搜索特征向量之间的相近度大于相近度阈值的索引特征向量,Similarity calculation module 703: used to compare the search feature vector with multiple index feature vectors in the index library to select the similarity between the search feature vector and the search feature vector in the index library to be greater than the similarity Threshold index feature vector,
其中,在所述索引库中,所述多个索引特征向量与多个模态的多个结果数据之间存在对应关系,并且不同的所述结果数据被提取的特征越相近,所述不同结果数据所对应的索引特征向量之间的相近度越大;Wherein, in the index library, there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities, and the more similar the extracted features of different result data are, the different results The greater the similarity between the index feature vectors corresponding to the data;
输出模块704:将与所述选择出的索引特征向量对应的结果数据作为搜索结果输出,其中,所述搜索结果包括的结果数据具有多个模态。Output module 704: output the result data corresponding to the selected index feature vector as a search result, where the result data included in the search result has multiple modalities.
另外,图8根据本申请的实施例,示出了一种电子设备800的结构示意图。电子设备800可以用于训练上述多模态搜索模型,还可以从别的电子设备接收上述多模态模型,然后基于上述多模态模型对电子设备800上的各种模态的数据进行全局检索。电子设备800可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。In addition, FIG. 8 shows a schematic structural diagram of an electronic device 800 according to an embodiment of the present application. The electronic device 800 can be used to train the aforementioned multi-modal search model, and can also receive the aforementioned multi-modal model from another electronic device, and then perform a global search on the data of various modalities on the electronic device 800 based on the aforementioned multi-modal model . The electronic device 800 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2. , Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本发明实施例示意的结构并不构成对电子设备800的具体限定。在本申请另一些实施例中,电子设备800可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 800. In other embodiments of the present application, the electronic device 800 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,并且进行上述的模态数据的特征提取、以及多模态搜索模型的训练。例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, and perform the above-mentioned feature extraction of modal data and training of a multi-modal search model. For example: the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, and a video processor. Codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc. Among them, the different processing units may be independent devices or integrated in one or more processors.
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger.
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
移动通信模块150可以提供应用在电子设备800上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。The mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 800. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.
无线通信模块160可以提供应用在电子设备800上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system, GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。The wireless communication module 160 can provide applications on the electronic device 800 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module.
电子设备800通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device 800 implements a display function through a GPU, a display screen 194, and an application processor. The GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备800可以包括1个或N个显示屏194,N为大于1的正整数。The display screen 194 is used to display images, videos, and the like. The display screen 194 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 800 may include one or N display screens 194, and N is a positive integer greater than one.
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备800可以包括1个或N个摄像头193,N为大于1的正整数。The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the electronic device 800 may include one or N cameras 193, and N is a positive integer greater than one.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备800在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 800 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备800的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解、图像聚类等。NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the electronic device 800 can be realized, such as image recognition, face recognition, voice recognition, text understanding, image clustering, and so on.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备800使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备800的各种功能应用以及数据处理。同时,内部存储器121也可以存储手机100中的各种模态数据、以及移植至手机100的多模态搜索模型并且存储模型的中间计算数据,存储模型参数,索引库等。The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function, and the like. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 800. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the electronic device 800 by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor. At the same time, the internal memory 121 can also store various modal data in the mobile phone 100, a multi-modal search model transplanted to the mobile phone 100 and store intermediate calculation data of the model, store model parameters, an index library, etc.
电子设备800可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device 800 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备800可以通过扬声器170A收听音乐,或收听免提通话。The speaker 170A, also called "speaker", is used to convert audio electrical signals into sound signals. The electronic device 800 can listen to music through the speaker 170A, or listen to a hands-free call.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备800接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 800 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备800可以设置至少一个麦克风170C。在另一些实施例中,电子设备800可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备800还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C. The electronic device 800 may be provided with at least one microphone 170C. In other embodiments, the electronic device 800 can be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 800 may also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。The earphone interface 170D is used to connect wired earphones. The earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, and a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
现参考图9,是本申请实施例中的一种电子设备的软件结构框图。电子设备900可以用于训练上述多模态搜索模型,还可以从别的电子设备接收上述多模态模型,然后基于上述多模态模型对电子设备900上的各种模态的数据进行全局检索。图9为该电子设备的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明终端设备的软件结构。Now referring to FIG. 9, it is a software structure block diagram of an electronic device in an embodiment of the present application. The electronic device 900 can be used to train the aforementioned multi-modal search model, and can also receive the aforementioned multi-modal model from other electronic devices, and then perform a global search on the data of various modalities on the electronic device 900 based on the aforementioned multi-modal model. . Figure 9 shows that the software system of the electronic device can adopt a layered architecture, event-driven architecture, micro-core architecture, micro-service architecture, or cloud architecture. The embodiment of the present invention takes an Android system with a layered architecture as an example to exemplarily illustrate the software structure of a terminal device.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
应用程序层可以包括一系列应用程序包。The application layer can include a series of application packages.
如图9所示,应用程序包可以包括电话、相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。As shown in Figure 9, the application package may include applications such as phone, camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.
如图9所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in Figure 9, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是 否有状态栏,锁定屏幕,截取屏幕等。The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, and so on.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。The content provider is used to store and retrieve data and make these data accessible to applications. The data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
电话管理器用于提供终端设备的通信功能。例如通话状态的管理(包括接通,挂断等)。The telephone manager is used to provide the communication function of the terminal device. For example, the management of the call status (including connecting, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,终端设备振动,指示灯闪烁等。The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, prompt text messages in the status bar, sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
2D图形引擎是2D绘图的绘图引擎。The 2D graphics engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
虽然通过参照本申请的某些优选实施例,已经对本申请进行了图示和描述,但本领域的普通技术人员应该明白,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。Although the present application has been illustrated and described by referring to certain preferred embodiments of the present application, those of ordinary skill in the art should understand that various changes can be made in form and details without departing from the present application. The spirit and scope of the application.

Claims (10)

  1. 一种用于电子设备的搜索方法,其特征在于,包括:A search method for electronic equipment, characterized in that it comprises:
    获取用户输入的搜索数据;Obtain the search data entered by the user;
    提取所述搜索数据的特征,并基于提取的所述特征生成所述搜索数据的搜索特征向量;Extracting features of the search data, and generating a search feature vector of the search data based on the extracted features;
    将所述搜索特征向量与索引库中的多个索引特征向量进行比对,以选择出所述索引库中与所述搜索特征向量之间的相近度大于相近度阈值的索引特征向量,Comparing the search feature vector with a plurality of index feature vectors in an index library to select an index feature vector in the index library whose similarity with the search feature vector is greater than a similarity threshold,
    其中,在所述索引库中,所述多个索引特征向量与多个模态的多个结果数据之间存在对应关系;Wherein, in the index library, there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities;
    将与所述选择出的索引特征向量对应的结果数据作为搜索结果输出,其中,所述搜索结果包括的结果数据具有多个模态。The result data corresponding to the selected index feature vector is output as a search result, wherein the result data included in the search result has multiple modalities.
  2. 根据权利要求1所述的搜索方法,其特征在于,所述搜索特征向量与所述索引特征向量之间的相近度通过以下公式计算得出:The search method according to claim 1, wherein the similarity between the search feature vector and the index feature vector is calculated by the following formula:
    Figure PCTCN2021079905-appb-100001
    Figure PCTCN2021079905-appb-100001
    其中,d表示搜索特征向量与索引库中存储的多个特征向量之间的相近度,x i表示输入数据的特征向量,y i表示索引库中存储的多个特征向量,i表示输入数据的特征向量或索引库中存储的多个特征向量的维度。 Among them, d represents the similarity between the search feature vector and the multiple feature vectors stored in the index library, x i represents the feature vector of the input data, y i represents the multiple feature vectors stored in the index library, and i represents the value of the input data. The dimension of the feature vector or multiple feature vectors stored in the index library.
  3. 根据权利要求1或2所述的搜索方法,其特征在于,所述电子设备上具有所述索引库,并且,与所述索引库中的所述多个索引特征向量具有对应关系的所述多个模态的多个结果数据为所述电子设备上的数据。The search method according to claim 1 or 2, wherein the electronic device has the index library, and the plurality of index feature vectors that have a corresponding relationship with the plurality of index feature vectors in the index library The multiple result data of each modality are data on the electronic device.
  4. 根据权利要求3所述的搜索方法,其特征在于,所述电子设备为移动终端。The search method according to claim 3, wherein the electronic device is a mobile terminal.
  5. 根据权利要求4所述的搜索方法,其特征在于,所述用户在所述移动终端的负一屏输入所述搜索数据。The search method according to claim 4, wherein the user inputs the search data on a negative screen of the mobile terminal.
  6. 根据权利要求4所述的搜索方法,其特征在于,所述用户在所述移动终端的备忘录中输入所述搜索数据。The search method according to claim 4, wherein the user inputs the search data in a memo of the mobile terminal.
  7. 根据权利要求1至6中任一项所述的搜索方法,其特征在于,所述多个模态包括图像、视频、音频、文本、所述电子设备的传感器的检测数据。The search method according to any one of claims 1 to 6, wherein the multiple modalities include image, video, audio, text, and detection data of a sensor of the electronic device.
  8. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    获取模块,用于获取用户输入的搜索数据;The obtaining module is used to obtain the search data input by the user;
    特征提取模块,用于提取所述搜索数据的特征,并基于提取的所述特征生成所述搜索数据的搜索特征向量;A feature extraction module, configured to extract features of the search data, and generate a search feature vector of the search data based on the extracted features;
    相近度计算模块:用于将所述搜索特征向量与索引库中的多个索引特征向量进行比对,以选择出所述索引库中与所述搜索特征向量之间的相近度大于相近度阈值的索引特征向量,Similarity calculation module: used to compare the search feature vector with multiple index feature vectors in the index library to select that the similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold Index feature vector,
    其中,在所述索引库中,所述多个索引特征向量与多个模态的多个结果数据之间存在对应关系;Wherein, in the index library, there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities;
    输出模块:将与所述选择出的索引特征向量对应的结果数据作为搜索结果输出,其中,所述搜索结果包括的结果数据具有多个模态。Output module: output the result data corresponding to the selected index feature vector as a search result, wherein the result data included in the search result has multiple modalities.
  9. 一种机器可读介质,其特征在于,所述机器可读介质上存储有指令,该指令在机器上执行时使机器执行权利要求1至7中任一项所述的方法。A machine-readable medium, characterized in that an instruction is stored on the machine-readable medium, and when the instruction is executed on a machine, the machine executes the method according to any one of claims 1 to 7.
  10. 一种电子设备,包括:存储器,用于存储由系统的一个或多个处理器执行的指令,以及处理器,是系统的处理器之一,用于执行权利要求1至7中任一项所述的方法。An electronic device, comprising: a memory for storing instructions executed by one or more processors of the system, and a processor, one of the processors of the system, for executing any one of claims 1 to 7 The method described.
PCT/CN2021/079905 2020-03-10 2021-03-10 Electronic device and search method thereof, and medium WO2021180109A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010164088.6A CN111460231A (en) 2020-03-10 2020-03-10 Electronic device, search method for electronic device, and medium
CN202010164088.6 2020-03-10

Publications (1)

Publication Number Publication Date
WO2021180109A1 true WO2021180109A1 (en) 2021-09-16

Family

ID=71678431

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/079905 WO2021180109A1 (en) 2020-03-10 2021-03-10 Electronic device and search method thereof, and medium

Country Status (2)

Country Link
CN (1) CN111460231A (en)
WO (1) WO2021180109A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210294840A1 (en) * 2020-03-19 2021-09-23 Adobe Inc. Searching for Music
CN116089368A (en) * 2022-08-01 2023-05-09 荣耀终端有限公司 File searching method and related device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460231A (en) * 2020-03-10 2020-07-28 华为技术有限公司 Electronic device, search method for electronic device, and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060233325A1 (en) * 2005-04-14 2006-10-19 Cheng Wu System and method for management of call data using a vector based model and relational data structure
CN101334796A (en) * 2008-02-29 2008-12-31 浙江师范大学 Personalized and synergistic integration network multimedia search and enquiry method
CN103955543A (en) * 2014-05-20 2014-07-30 电子科技大学 Multimode-based clothing image retrieval method
CN109783655A (en) * 2018-12-07 2019-05-21 西安电子科技大学 A kind of cross-module state search method, device, computer equipment and storage medium
CN110851629A (en) * 2019-10-14 2020-02-28 信阳农林学院 Image retrieval method
CN111460231A (en) * 2020-03-10 2020-07-28 华为技术有限公司 Electronic device, search method for electronic device, and medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346778B (en) * 2011-10-11 2013-08-21 北京百度网讯科技有限公司 Method and equipment for providing searching result
BR112019021201A8 (en) * 2017-04-10 2023-04-04 Hewlett Packard Development Co MACHINE LEARNING IMAGE SEARCH
CN110309324B (en) * 2018-03-09 2024-03-22 北京搜狗科技发展有限公司 Searching method and related device
CN109740077B (en) * 2018-12-29 2021-02-12 北京百度网讯科技有限公司 Answer searching method and device based on semantic index and related equipment thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060233325A1 (en) * 2005-04-14 2006-10-19 Cheng Wu System and method for management of call data using a vector based model and relational data structure
CN101334796A (en) * 2008-02-29 2008-12-31 浙江师范大学 Personalized and synergistic integration network multimedia search and enquiry method
CN103955543A (en) * 2014-05-20 2014-07-30 电子科技大学 Multimode-based clothing image retrieval method
CN109783655A (en) * 2018-12-07 2019-05-21 西安电子科技大学 A kind of cross-module state search method, device, computer equipment and storage medium
CN110851629A (en) * 2019-10-14 2020-02-28 信阳农林学院 Image retrieval method
CN111460231A (en) * 2020-03-10 2020-07-28 华为技术有限公司 Electronic device, search method for electronic device, and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210294840A1 (en) * 2020-03-19 2021-09-23 Adobe Inc. Searching for Music
US11461649B2 (en) * 2020-03-19 2022-10-04 Adobe Inc. Searching for music
US20230097356A1 (en) * 2020-03-19 2023-03-30 Adobe Inc. Searching for Music
US11636342B2 (en) 2020-03-19 2023-04-25 Adobe Inc. Searching for music
CN116089368A (en) * 2022-08-01 2023-05-09 荣耀终端有限公司 File searching method and related device
CN116089368B (en) * 2022-08-01 2023-12-19 荣耀终端有限公司 File searching method and related device

Also Published As

Publication number Publication date
CN111460231A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2021180109A1 (en) Electronic device and search method thereof, and medium
US11783191B2 (en) Method and electronic device for providing text-related image
WO2021036906A1 (en) Picture processing method and apparatus
WO2023125335A1 (en) Question and answer pair generation method and electronic device
CN111985240B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
CN110209784B (en) Message interaction method, computer device and storage medium
WO2022100221A1 (en) Retrieval processing method and apparatus, and storage medium
WO2021254411A1 (en) Intent recognigion method and electronic device
WO2024040865A1 (en) Video editing method and electronic device
US20240105159A1 (en) Speech processing method and related device
WO2021147421A1 (en) Automatic question answering method and apparatus for man-machine interaction, and intelligent device
CN114281956A (en) Text processing method and device, computer equipment and storage medium
CN112765387A (en) Image retrieval method, image retrieval device and electronic equipment
CN111444321B (en) Question answering method, device, electronic equipment and storage medium
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
WO2021238371A1 (en) Method and apparatus for generating virtual character
CN110245334A (en) Method and apparatus for output information
KR20210120203A (en) Method for generating metadata based on web page
WO2019228140A1 (en) Instruction execution method and apparatus, storage medium, and electronic device
CN111597823A (en) Method, device and equipment for extracting central word and storage medium
CN116431838B (en) Document retrieval method, device, system and storage medium
WO2023168997A1 (en) Cross-modal retrieval method and related device
CN117076702B (en) Image searching method and electronic equipment
WO2022143083A1 (en) Application search method and device, and medium
WO2024012171A1 (en) Binary quantization method, neural network training method, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21768297

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21768297

Country of ref document: EP

Kind code of ref document: A1