WO2021180109A1

WO2021180109A1 - Electronic device and search method thereof, and medium

Info

Publication number: WO2021180109A1
Application number: PCT/CN2021/079905
Authority: WO
Inventors: 吴大; 李艳明; 唐吴全
Original assignee: 华为技术有限公司
Priority date: 2020-03-10
Filing date: 2021-03-10
Publication date: 2021-09-16
Also published as: CN111460231A

Abstract

An electronic device and a search method thereof, and a medium. The method comprises: obtaining search data input by a user; extracting a feature of the search data, and generating a search feature vector of the search data on the basis of the extracted feature; comparing the search feature vector with a plurality of index feature vectors in an index library to select, in the index library, an index feature vector having greater similarity between the index feature vector and the search feature vector than a similarity threshold, wherein there is a correspondence between the plurality of index feature vectors and a plurality of result data of a plurality of modes in the index library; and outputting the result data corresponding to the selected index feature vector as a search result, wherein the result data comprised in the search result has the plurality of modes, so as to achieve a multimode global search function.

Description

Electronic equipment and search method and medium for electronic equipment

This application claims the priority of the Chinese patent application with the application number 202010164088.6 and the invention title of "Electronic Equipment and Electronic Equipment Search Method and Medium" filed on March 10, 2020, the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence, and in particular to an electronic device and a search method and medium for the electronic device.

Background technique

In recent years, the rapid development of machine learning and deep learning has greatly promoted the development of search functions. At present, mobile phones have a global search function, which can search for pictures in the gallery, search on the Internet through a browser, and search for other applications. Most of the existing search technologies are content-based searches. Taking image search as an example, the main process is: manually or pre-trained models automatically label images, save the labels in the database, and then the user enters The keyword matches the text of the label in the database, and the search result is returned. And most of the searches are searches for single-modal media data, such as searching for pictures in a gallery through text input, or searching for pictures in some Internet.

Summary of the invention

The embodiments of the application provide an electronic device and a search method and medium for the electronic device, which can map the multi-modal data feature vector to a high-dimensional unified vector, and then realize the multi-modal global search on the electronic device through a model .

In the first aspect, the embodiments of the present application provide an electronic device and a search method of the electronic device, and the foregoing method includes:

Obtain the search data entered by the user; extract the low-level features of the search data, and generate the search feature vector of the search data based on the extracted low-level features; compare the search feature vector with multiple index feature vectors in the index library to select The index feature vector whose similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold is output, wherein in the index library, there is a correspondence between multiple index feature vectors and multiple result data of multiple modalities. The result data corresponding to the selected index feature vector is output as the search result, where the result data included in the search result has multiple modalities. That is, first obtain the search data input by the user, for example, obtain the image data input by the user, and then extract the low-level features of the image data, such as extracting low-level features such as color, texture, and gray of the image, and then generate these low-level features The feature vector corresponding to the feature vector is compared with the index feature vector stored in the index library to obtain a feature vector that is highly similar to the feature vector of the search data, and then according to the index feature vector in the index library. The correlation index relationship of, determines multiple feature vectors and the result data corresponding to multiple feature vectors.

In a possible implementation of the foregoing first aspect, the foregoing method further includes:

The similarity between the search feature vector and the index feature vector is calculated by the following formula:

Among them, d represents the similarity between the search feature vector and the multiple feature vectors stored in the index library, x _i represents the feature vector of the input data, y _i represents the multiple feature vectors stored in the index library, and i represents the value of the input data. The dimension of the feature vector or multiple feature vectors stored in the index library.

That is, the similarity between the search feature vector and the index feature vector can be calculated by Euclidean distance. Of course, it is understandable that the calculation of the similarity can also be calculated by other methods, such as Pearson coefficient.

The electronic device has an index library, and multiple result data of multiple modalities that have a corresponding relationship with multiple index feature vectors in the index library are data on the electronic device.

The electronic device is a mobile terminal. That is, the electronic device is not limited to mobile terminals such as mobile phones, but can also be electronic devices such as servers and PCs.

The user inputs search data on the negative screen of the mobile terminal. That is, the multi-modal search on electronic devices can be applied to the negative screen of mobile terminals such as mobile phones.

The user inputs search data in the memo of the mobile terminal. That is, multi-modal search on electronic devices can be applied to memos of mobile terminals such as mobile phones.

Multiple modalities include image, video, audio, text, detection data of sensors of electronic devices. That is, modal refers to the source form or existence of data, so multiple modal data includes image, text, video, audio and other data.

In a second aspect, an embodiment of the present application provides an electronic device, and the above-mentioned electronic device includes:

The obtaining module is used to obtain the search data input by the user;

The feature extraction module is used to extract the features of the search data and generate the search feature vector of the search data based on the extracted features;

Similarity calculation module: used to compare the search feature vector with multiple index feature vectors in the index library to select the index feature vector whose similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold,

Among them, in the index library, there is a correspondence between multiple index feature vectors and multiple result data of multiple modalities;

Output module: output the result data corresponding to the selected index feature vector as the search result, where the result data included in the search result has multiple modalities.

In a third aspect, an embodiment of the present application provides a machine-readable medium on which an instruction is stored. When the instruction is executed on a machine, the machine can execute any one of the possible methods of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the system, and a processor, which is one of the processors of the system, for executing Any one of the possible methods of the first aspect described above.

In a fifth aspect, an embodiment of the present application provides an electronic device that has the function of implementing the above search method. The function can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes one or more units corresponding to the above-mentioned functions.

Description of the drawings

Fig. 1 shows a multi-modal global search scene 10 according to some embodiments of the present application.

Fig. 2 shows a schematic diagram of a method for performing a global search in the mobile phone 100 according to some embodiments of the present application.

Fig. 3 shows a schematic diagram of a process of generating a multi-modal search model according to some embodiments of the present application.

Fig. 4 shows a schematic diagram of a process of establishing an index library according to some embodiments of the present application.

Fig. 5a shows a negative screen of a mobile phone 100 according to some embodiments of the present application.

Fig. 5b shows a schematic diagram of performing a global search on a negative screen of a mobile phone according to some embodiments of the present application.

Fig. 6 shows a schematic diagram of performing a global search in a mobile phone memo according to some embodiments of the present application.

Fig. 7 shows a schematic structural diagram of an electronic device according to some embodiments of the present application.

Fig. 8 shows a schematic structural diagram of another electronic device according to some embodiments of the present application.

Fig. 9 shows a software structure block diagram of an electronic device according to some embodiments of the present application.

Detailed ways

The illustrative embodiments of the present application include, but are not limited to, an electronic device and a search method, medium and system of the electronic device.

It can be understood that the terms "first", "second", etc. used in this application can be used in this text to describe various elements, but unless otherwise specified, these elements are not limited by these terms. These terms are only used to distinguish the first element from another element.

It can be understood that in the embodiments of the present application, the modal refers to the source of each type of information or the form of information. Different types of information such as images, voices, texts, and videos represent different modalities, such as radar and infrared. The test results of, accelerometer, etc. can also be expressed in different modes due to different sources.

The embodiments of the present application will be described in further detail below in conjunction with the accompanying drawings.

Fig. 1 shows a multi-modal global search scene 10 according to some embodiments of the present application. Specifically, as shown in FIG. 1, the scene 10 includes an electronic device 100 and an electronic device 200. Among them, the electronic device 100 can use the multimodal search model trained by the electronic device 200 to realize multimodal search, that is, input a certain modal data, and the multimodal search model can generate a feature vector corresponding to the data, and then generate The feature vector of is compared with the feature vector in the multi-modal feature vector index library, and each modal data corresponding to the feature vector in the multi-modal feature vector index library that meets the predetermined conditions is output as the search result. The electronic device 200 can train a multimodal search model by using multiple modal data and the characteristics of each modal data. In the multimodal search model, the feature vectors generated by the same or similar modal data are the same or similar. Among them, the feature vector is similar means that the difference between the two feature vectors is less than the similarity threshold, and the difference between the feature vectors can be represented by the Euclidean distance between the feature vectors. The larger the Euclidean distance, the more the feature The greater the difference between the vectors, the smaller the similarity between the feature vectors. In addition, the electronic device 200 can not only train a multi-modal search model, but can also use a multi-modal search model completed by its own training to implement various search functions.

It can be understood that in this application, the electronic device 100 and the electronic device 200 may include, but are not limited to, laptop computers, desktop computers, tablet computers, mobile phones, wearable devices, head-mounted displays, servers, mobile email devices, Portable game consoles, portable music players, reader devices, televisions with one or more processors embedded or coupled therein, or other electronic devices that can access the Internet.

Hereinafter, in conjunction with FIGS. 2-6, the electronic device 100 is a mobile phone and the electronic device 200 is a server as an example to illustrate the technical solution of the present application.

As mentioned above, in some embodiments of the present application, a multi-modal search model capable of searching for multi-modal data can be trained on the server 200, and then the multi-modal search model can be transplanted to the mobile phone 100. Above, the global search of the modal data on the mobile phone 100 is realized. Fig. 2 shows a technical solution of using the server 200 to train a multimodal search model and transplanting the trained multimodal search model to the mobile phone 100 to perform a global search according to some embodiments of the present application. Specifically, as shown in Figure 2:

(1) Training of multimodal search model

A) Generate initial feature vector

When training the multimodal search model, the server 200 first needs to perform feature extraction on the sample data used for training. It can be understood that, in this application, the sample data may include data of multiple modalities, for example, image, voice, text, video, sensor test data, and so on. These sample data (for example, an image, a speech, or a text) are generally unstructured data with different structures, which have the characteristics of high dimensionality, different forms of expression, and a lot of redundant information. Therefore, it is necessary to extract the initial feature vector that can characterize the sample data. It is understandable that these initial feature vectors can be one-dimensional or multi-dimensional. For example, a person’s performance ranking can be represented by the person’s Chinese performance, mathematics performance, and English performance. Then the initial feature vector of the person’s performance ranking has three dimensions, namely (Chinese performance, mathematics performance, and English performance). ), for another example, the feature vector of a character can be one-dimensional, that is, the code value of the character. If it is a sentence, such as "Xiaobai is a dog", it can be represented by multiple one-dimensional initial feature vectors. For example, the initial feature vector of the word "Xiaobai", the initial feature vector of the word "Yes", and the initial feature vector of the word "dog", these three initial feature vectors together represent the sentence "Xiaobai is a dog".

Further, assuming that building A can be described by three modal data of hand-painting, voice, and text that are related to each other at the same time, then feature extraction algorithm 1, feature extraction algorithm 2, and feature extraction algorithm 3 can be used respectively. Generate the initial feature vectors of these three modal data.

_{As shown in Figure 3, for example, the initial feature vector T 1} of the hand-painted building A can be generated by the residual network Resnet-34 algorithm. For example, T ₁ can be (h ₁ , h ₂ ), and h ₁ is the hand-painted gray Degree value, h ₂ can be the value of hand-drawing size. Then, through the speech feature extraction algorithm Mel Frequency Cepstrum Coefficient (MFCC), the initial feature vector T ₂ describing the speech of the building A is generated. For example, T ₂ can be (h ₃ , h ₄ ), h ₃ , h ₄ may be a value representing certain characteristics of the speech of the building A, such as the frequency and pitch of the speech. _{Then the initial feature vector T 3} describing the text of the building A can be generated through Attention-Based Bidirectional Long Short-Term Memory Networks (BiLSTM+Attention), for example, T ₃ is (h ₅ ), h ₅ can be It is the characteristic value of the character code describing the text of the building A. _{For a clearer understanding, the initial feature vectors T 1} , T ₂ , and T ₃ describing the hand-painted, voice, and text of building A in the above example can be expressed as T ₁ (hand-painted gray-scale feature value, hand-painted size Characteristic value), T ₂ (characteristic value of speech frequency, characteristic value of speech pitch), T ₃ (characteristic value of text encoding).

In addition, other algorithms can also be used to generate the initial feature vector of each modal data. For example, the difference of Gaussian function (Difference of Gaussian, DOG) can also be used to generate the initial feature vector of the image, and the text feature extraction algorithm Thesaurus model (Bag -of-words model) to generate the initial feature vector of the text.

It can be understood that the feature extraction algorithms mentioned here are all part of the multimodal search model.

B) Clustering of feature vectors

Feature vector clustering refers to the fact that after sample data that are related to each other are input into the multimodal search model, the final feature vectors output are similar or identical to each other. Among them, correlation means that the content represented by each modal data is the same or similar. For example, multiple low-level features can be extracted from each sample data, and the features extracted from two data are similar, which can be that a certain proportion of the extracted multiple features are the same or similar, or two data The difference between the extracted feature values of the same feature is less than a predetermined threshold. For example, 10 and 12 features are extracted from image A and audio data B, respectively. If 9 features in image A are the same as 9 features in audio data B, it can be considered that image A and audio data B are similar. For another example, the features extracted from image A are "shepherd dogs" and "adult dogs", and the features extracted from image B are "huskies" and "adult dogs". In some applications that require animal species identification, it can be considered Both image A and image B represent dogs, so their features are similar. In some applications that require dog species recognition, it can be considered that the features of image A and image B are not similar.

Now referring to FIG. 3 and taking the three modal data described above for building A as an example, the clustering process of feature vectors will be described. As shown in Figure 3, based on the technique in A), the initial feature vectors T ₁ , T ₂ , and T ₃ describing the hand-painted, voice, and text of the building A are obtained, because these initial feature vectors use different feature extraction algorithms The generated independent initial feature vector is not in a unified feature vector representation space, and the vector values between each other will be very different, and it is impossible to judge the correlation between the feature vectors. For example, the initial feature vector of a hand-painted painting is represented by the gray value of the hand-painted (for example, 222) and the voice is represented by the frequency feature value of the voice (such as 20 (Hz)). At this time, although both hand-painted and voice are Describes building A, but the vector values of the initial feature vectors of the two are very different. Therefore, it is necessary to perform feature vector clustering, and cluster the two feature vectors as similar. For example, a Long Short Term Memory network (LSTM) model is used for training, so that the long and short term network model can cluster the initial feature vectors in different vector spaces as similar, that is, map them to the same vector space.

As shown in Figure 3, the output result of the LSTM model at each moment depends on the output result of the previous moment. Specifically, for example, input a paragraph of text "I love the motherland" into the long-term short-term memory network model shown in Figure 3. The model will output the feature vector at each moment in time sequence, for example, input long-term memory at t ₁ The short-term memory network is "I", then the long-term short-term memory network will generate _{a feature vector H 1} (h ₆ ) corresponding to "I" at _{t 1} , where h ₆ represents the encoded feature value of "I", and input long at _{t 2} The short-term memory network is "love", then the long-term short-term memory network will generate _{the feature vector H 2} (h ₆ , h ₇ ) corresponding to "love" at _{t 2} where h ₇ represents the coded feature value of "love", t ₃ When the “motherland” is entered into the long-term and short-term memory network at all times, the long-term and short-term memory network will generate _{the feature vector H 3} (h ₆ , h ₇ , h ₈ ) corresponding to the “motherland” at _{t 3} , and h ₈ represents the "motherland" Encoding feature values. Therefore, in the end, since the feature vector H ₃ has all the features of "I love the motherland", the feature vector H ₃ will be used to represent the sentence "I love the motherland". For image data that does not have a front-to-back dependency relationship, in the LSTM model, it can be considered that a vector value representing its feature is output at a certain moment, and the vector value at other moments is 0. For example, the feature vector H ₄ (h ₉ _{) output at t 1} represents its gray value, while the feature vector output at _{t 2} and t ₃ _{is H 5} (h ₉ , 0) and H ₅ (h ₉ , 0 , 0).

Specifically, the training process of the long and short-term memory network LSTM model is as follows:

I) Prepare multiple sample data with different modalities in advance, and these multiple modal data can represent different objects or describe different events, but there is correlation between data describing the same object or description of the same event. Then the initial feature vectors of these sample data are generated through the above A).

II) Input the initial feature vectors of multiple modal data that describe the same object or the same event that are related to each other into the LSTM model to obtain the final feature vector output by the LSTM (the LSTM at this time can be regarded as a multi-modal search Part of the model). Calculate whether these final feature vectors are similar or the same. If these final feature vectors are not similar or the same, adjust the model parameters of the LSTM, and then re-input the above-mentioned multiple initial feature vectors into the LSTM, and calculate whether the output final feature vectors are the same or similar.

Repeat the operation in this way until the final feature vectors output by the LSTM model are similar or the same, and the generated initial feature vectors with different vector spaces due to different initial data modalities are mapped to the same vector space, that is, the training of the LSTM model is completed.

Among them, in some embodiments, whether the two feature vectors are similar can be calculated by the following formula:

Among them, d represents the similarity between the two feature vectors, x _i represents the feature vector of the input data, y _i represents the multiple feature vectors stored in the index library, and i represents the feature vector of the input data or multiple features stored in the index library. The dimensions of the eigenvectors. That is, d represents the Euclidean distance between two feature vectors, where the larger the Euclidean distance, the smaller the similarity between the feature vectors.

_{For example, after inputting the initial feature vectors T 1} , T ₂ and T ₃ in the three modalities (hand drawing, speech, text) of building A described above into the LSTM model, the intermediate feature vectors T ₁ ', T ₂ 'are obtained And T ₃ '.

Here, for the above-mentioned feature vectors T ₁ ′, T ₂ ′, and T ₃ ′, suppose T ₁ ′=(a ₁ ,a ₂ ,a ₃ ), T ₂ ′=(b ₁ ,b ₂ ,b ₃ ), T ₃ '=(c ₁ , c ₂ , c ₃ ). Then, _{the similarity between T 1} ′ and T ₂ ′ can be expressed as:

The similarity between T ₂ ′ and T ₃ ′ can be expressed as:

When both d ₁ and d ₂ are less than the predetermined similarity threshold, it is considered that T ₁ ', T ₂ 'and T ₃ ' are the same or similar, and the intermediate feature vectors T ₁ ', T ₂ 'and T ₃ 'as the final feature vector.

It can be understood that in actual model training, the similarity threshold can be set according to actual needs, and this application is not limited here.

In addition, in some embodiments, the loss function can also be used to calculate the similarity between the intermediate feature vectors obtained after the initial feature vectors of the sample data of different modalities are input into the LSTM model. Specifically, the initial feature vector of the same or similar sample data is input into the LSTM model to obtain the intermediate feature vector, and the loss function is used to calculate the error between the input initial feature vector and the output intermediate feature vector, and based on the error, Partial derivative. Then, the model parameters in the LSTM model are adjusted based on the obtained partial derivatives.

In order to make the above training process clearer, a simple example is given below based on the above-mentioned building A. For example, in addition to the building A mentioned above, there are album cover images, text and audio data describing music C. Among them, the initial feature vector corresponding to the album cover image _{describing music C is T 4} , and the text describing music C The corresponding initial feature vector is T ₅ and the initial feature vector corresponding to the audio describing music C is T ₆ _{, and then the initial feature vectors T 1} , T describing the three modalities of building A (hand painting, speech, text) _2. T ₃ _{and the initial feature vectors T 4} , T ₅ , and T ₆ describing the three modalities of music C (album cover image, text and audio) are input into the LSTM model as training data, and the long- and short-term memory network is adjusted by Relevant parameters make the final _{Euclidean distance between T 1} ', T ₂ 'and T ₃ ' less than the predetermined similarity threshold, so that the feature vectors describing the modal data of the building A are clustered together, Then use the same method to make the final feature vectors T ₄ ', T ₅ 'and T ₆ _{' which describe the initial feature vectors T 4} , T ₅ , and T ₆ of music B. The Euclidean distance between each pair is less than the predetermined similarity. Threshold, thus clustering the feature vectors of each modal data describing music C together.

In addition, it can be understood that in other embodiments of the present application, other methods can also be used to determine the similarity between two feature vectors, which are not limited to the Euclidean distance and loss function in the above formula. For example, cosine similarity can also be used , Or Pearson correlation coefficient to calculate the similarity between feature vectors.

It can be understood that although the server 200 is used to train the multi-modal search model in the foregoing embodiment, in other embodiments, other computer equipment may also be used to train the multi-modal search model. There is no restriction here.

(2) Establish the index relationship between the multi-modal data on the mobile phone 100 and the feature vector

Continuing to refer to Figure 2, after training the multi-modal search model on the server 200, an Android project can be built, the model can be read and parsed through the model reading interface in the aforementioned project, and then compiled to generate an APK (Android application package). , Android application package) file, installed in the mobile phone 100 to complete the transplantation of the multi-modal search model. Then input various modal data (image, voice, text, video, etc.) in the mobile phone 100 into the multi-modal search model to obtain the feature vector corresponding to each data, and establish the relationship between each data and the feature vector of each data. The index relationship is obtained, and the index library is obtained.

For example, image 1-image 100, speech 1-speech 50, text 1-text 80 on mobile phone 100 can be input into the multimodal search model, and then the feature vector T ₁ -T _{100 corresponding to image 1-image 100 can be obtained.} , Corresponding to the feature vector T ₁₀₁ -T _{150 of} speech 1-speech 50, corresponding to the feature vector T _151- T _{230 of} text 1-text 80. Then, the index relationship between each feature vector obtained above and the identification of the corresponding data can be established. For example, the data identification of the data can be an identification set for the above-mentioned image, text, and voice file, or the above-mentioned data can be stored in the mobile phone 100. Or use the entire source data of the above data as the identifier. For example, _{create an index relationship between T 1} and the name "20200107adefeg" of image 1, and then store the index relationship in the index library.

It can be understood that the index library may exist in the form of a database in the multi-modal search model, and the feature vector and the identification of the corresponding data may be stored in the database in the form of a field.

In some embodiments, the specific process of establishing an index library on the mobile phone 100 is shown in FIG. 4: The trained multi-modal search model is transplanted to the mobile phone 100, and the multi-modal search model is used to generate the information in the mobile phone 100. Various modal data, such as image, voice, text (in other embodiments, may also include video, sensor detection data, etc.) feature vectors, and these feature vectors generated by the multi-modal search model are stored in the mobile phone 100 At the same time, an index relationship between each data and feature vector on the mobile phone 100 is established to obtain an index library. For example, the respective feature vectors T ₁ , T ₂ , T ₃ , T ₁₅₀ , T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150, T 150 and T ₂₃₀ , and store these feature vectors in the mobile phone 100, and at the same time, store the index relationship between these data and the respective feature vectors in the index library.

(3) Global search

Continuing to refer to Figure 2, after the index library is established on the mobile phone 100, a global search can be implemented on the mobile phone 100, that is, when searching, various modal data can be input, for example, image, text, voice, video, sensor detection Data, etc., the multi-modal search model can convert these data into feature vectors, and obtain the search results by comparing the converted feature vectors corresponding to the search input data with the feature vectors in the index library. Among them, the search results can be Including various modal data on the mobile phone.

In some embodiments, the user can implement a global search by entering search keywords in the search box of the mobile phone 100 on one screen. Specifically, as shown in Figure 5a, when the user performs a global search on the mobile phone 100 on one screen, the user can search by entering keywords. For example, the user enters the search keyword in the search box of the mobile phone 100 on one screen. The mobile phone 100 generates the feature vector of the keyword input by the user through the above-mentioned multi-modal search model transplanted to the global search, and then can calculate the feature vector of the keyword and the index established above through the above formula for calculating the Euclidean distance between the vectors. The similarity of the feature vectors in the library, for the feature vectors in the index library whose similarity to the feature vector of the keyword is greater than the similarity threshold, according to the index relationship in the index library, obtain the data of various modalities corresponding to these feature vectors , And display these data on the negative screen of the mobile phone 100.

For example, as shown in Figure 5b, the user enters the content to be searched in the search box on the screen of the mobile phone, such as "woman wearing a hat", and the mobile phone 100 will perform the search based on specific keywords such as "wearing a hat" and "woman". Global search, and then the search results (for example, images with the above keywords in all images, or schedule memos containing the above keywords, and specific voices and voice texts containing the above keywords in the voice) are displayed below the search bar . At the same time, users can also enter the content they want in the search bar on the one-screen screen of the mobile phone through voice or picture input.

More specifically, the multi-modal search model in the mobile phone 100 will extract the features "wearing a hat" and "woman" of the above search text, and then generate the feature vector T _{search text} corresponding to the above search text, and then calculate the feature vector T _{search by the above formula The similarity between the text} and the feature vector in the index library, select the feature vector with the similarity greater than the similarity threshold, and then find the data corresponding to the feature vector with the similarity greater than the similarity threshold, and output it as the search result, for example, the search result includes There are image one, image two, image three of a woman wearing a hat, related content in the mobile phone schedule, and related audio files in the voice memo. Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.

For another example, when the user enters the keyword "building A", the mobile phone generates the feature vector T _{building of the} keyword "building A" through the above multi-modal search model, and then calculates the feature vector T _building and storage through the above formula closeness between feature vectors in the feature vector multimodal index database, determining eigenvector T _1, T _2, T ₃ and close with the keyword "building a" T eigenvectors of _buildings, and then output according to The feature vectors T ₁ , T ₂ , and T ₃ output various modal data of the building A, that is, the hand drawing of the building A, the speech and the text describing the building A.

Similarly, when the user performs a global search, he can also input voice through the voice input unit 170a, and then the mobile phone generates the feature vector T _{voice of the} user input voice through the above-mentioned multi-modal search model, and then calculates the feature vector T _{voice of the user input voice} The similarity with the feature vector in the aforementioned index library is determined according to the calculation result, and the search result is displayed on the negative screen of the mobile phone 100 through the display screen. For example, the user enters the voice "meeting arrangement" through 170a, the multimodal search model extracts the feature vector T _meeting _{arrangement of the} "meeting arrangement", and then uses the above-mentioned similarity calculation method to calculate the similarity between the T _{meeting arrangement} and each feature vector in the index library , Knowing that the T _{meeting schedule is} close to the feature vectors T ₁₅₀ and T ₂₃₀ , determine that the search results of "meeting schedule" are meeting recordings and Monday agenda memos, and output the meeting recording audio and Monday agenda memos to the mobile phone 100 through the display screen. Negative screen.

In addition, the user can also input an image through the image input unit 193a, and then the mobile phone generates the feature vector T _image of the image input by the user through the above-mentioned multi-modal search model, and then calculates the feature vector T _{image of the} input image and the feature vector in the index library. The search result is determined according to the calculation result, and the search result is displayed on the negative screen of the mobile phone 100 through the display screen. For example, the user inputs the image "building A" through the image input unit 190a, the multi-modal search model extracts the feature vector of the "building A" _{image of the building A} , and then uses the above-mentioned similarity calculation method to calculate the _{image of the building A} and store it. According to the similarity of each feature vector in the index library, it is known that the _{image of} T building A is close to T ₁ , and the search result of "building A" is determined to be the hand-painted drawing of building A, and according to the modalities in the index library The index relationship between the feature vectors will be related to the hand-drawn drawing of the building A. Other modal data, such as the speech describing the building A and the text describing the building A, are all output on the mobile phone 100 through the display screen.

Further, in another embodiment of the present application, the solution of the present application is also suitable for global search in the memo of the mobile phone 100. Specifically, as shown in FIG. 6, the user can enter the memo in the search bar 600 of the mobile phone. Specific text, voice or picture to search.

More specifically, the user can search for pictures of women in hats as follows:

When the user searches in the memo search bar 600 of the mobile phone 100, the mobile phone 100 performs a corresponding search according to the modal of the user's input data:

(a) When the user enters the text "woman in a hat", the mobile phone 100 will extract the feature vector _{T'woman in a hat text of the} input "woman in a hat" through the above-mentioned multimodal search model, and use the above calculation method to calculate the The similarity between the feature vector T'woman _{wearing a hat text} and the feature vectors T _{wearing a hat woman image} , T _{wearing a hat woman text} , and T _{wearing a hat woman audio} stored in the index database, determine the _{T'wearing woman image text} It is _{similar to or close to the text of the woman wearing a hat} , and then according to the index relationship in the index library, it is determined to output _{all the relevant content corresponding to the feature vector closest to the text of the woman wearing a hat} , that is, output the memo of the woman wearing a hat, or The image of the woman in the hat (or the ID of the image) and the audio of the woman in the hat are attached in the attachment, and the number of items that meet the search criteria is displayed at the same time (for example, it shows that 1 item has been found). Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.

(b) when the user input through the voice input unit 170a when the search speech "woman with a hat", the phone 100 will pass the voice of the plurality extract the input modal search model "woman with a hat" feature vector T _{'hat Woman audio} , and calculate the similarity between the feature vector T'woman _{wearing a hat audio} and the feature vector T _{wearing a hat woman image} stored in the index database, T _{wearing a hat woman text} , and T _{wearing a hat woman audio to} determine T The _{audio of the woman} _{wearing a hat is similar or close to the audio of the woman wearing a hat} , and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector that is closest to the _{audio of the woman wearing a hat, that is, the output of the woman wearing a hat} The memo of the woman, or the image of the woman wearing the hat (or the ID of the image) and the audio of the woman wearing the hat in the attachment, and the number of items that meet the search criteria is displayed at the same time (for example, it shows that 1 item has been found). Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.

(c) when 193a inputted image of the user through the image input unit, when searching "woman with a hat", the phone 100 will pass the image of the plurality extract the input modal search model "woman with a hat" feature vector T _{'hat woman image,} and calculates the feature vector T _{'hat woman} above _image feature vectors stored in the index repository T _{hat image woman,} _{a woman wearing a hat text} T, T closeness between the _{wear cap audio woman,} determining T The _{image of a woman} _{wearing a hat is similar to or close to the image of a woman wearing a} T hat, and then according to the index relationship in the index library, it is determined to output all the relevant content corresponding to the feature vector that is closest to the _{image of a woman wearing a hat, that is, the output wearing a hat} The memo of the woman, or the image of the woman wearing the hat (or the ID of the image) and the audio of the woman wearing the hat in the attachment, and the number of items that meet the search criteria is displayed at the same time (for example, it shows that 1 item has been found). Among them, in the search results, the image can be displayed directly, or the thumbnail of the image or the name and thumbnail of the image can be displayed.

In addition, corresponding to the foregoing search method, FIG. 7 shows a schematic structural diagram of an electronic device. It can be understood that the specific technical details in the foregoing search method are also applicable to the electronic device. In order to avoid repetition, it will not be repeated here.

As shown in Figure 7, the electronic device includes:

The obtaining module 701 is used to obtain search data input by the user;

The feature extraction module 702 is configured to extract features of the search data, and generate a search feature vector of the search data based on the extracted features;

Similarity calculation module 703: used to compare the search feature vector with multiple index feature vectors in the index library to select the similarity between the search feature vector and the search feature vector in the index library to be greater than the similarity Threshold index feature vector,

Wherein, in the index library, there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities, and the more similar the extracted features of different result data are, the different results The greater the similarity between the index feature vectors corresponding to the data;

Output module 704: output the result data corresponding to the selected index feature vector as a search result, where the result data included in the search result has multiple modalities.

In addition, FIG. 8 shows a schematic structural diagram of an electronic device 800 according to an embodiment of the present application. The electronic device 800 can be used to train the aforementioned multi-modal search model, and can also receive the aforementioned multi-modal model from another electronic device, and then perform a global search on the data of various modalities on the electronic device 800 based on the aforementioned multi-modal model . The electronic device 800 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2. , Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 800. In other embodiments of the present application, the electronic device 800 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, and perform the above-mentioned feature extraction of modal data and training of a multi-modal search model. For example: the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, and a video processor. Codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc. Among them, the different processing units may be independent devices or integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.

The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.

The mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 800. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.

The wireless communication module 160 can provide applications on the electronic device 800 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module.

The electronic device 800 implements a display function through a GPU, a display screen 194, and an application processor. The GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display screen 194 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 800 may include one or N display screens 194, and N is a positive integer greater than one.

The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the electronic device 800 may include one or N cameras 193, and N is a positive integer greater than one.

Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 800 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.

NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the electronic device 800 can be realized, such as image recognition, face recognition, voice recognition, text understanding, image clustering, and so on.

The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required by at least one function, and the like. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 800. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the electronic device 800 by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor. At the same time, the internal memory 121 can also store various modal data in the mobile phone 100, a multi-modal search model transplanted to the mobile phone 100 and store intermediate calculation data of the model, store model parameters, an index library, etc.

The electronic device 800 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.

The speaker 170A, also called "speaker", is used to convert audio electrical signals into sound signals. The electronic device 800 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 800 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C. The electronic device 800 may be provided with at least one microphone 170C. In other embodiments, the electronic device 800 can be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 800 may also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.

The earphone interface 170D is used to connect wired earphones. The earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, and a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

Now referring to FIG. 9, it is a software structure block diagram of an electronic device in an embodiment of the present application. The electronic device 900 can be used to train the aforementioned multi-modal search model, and can also receive the aforementioned multi-modal model from other electronic devices, and then perform a global search on the data of various modalities on the electronic device 900 based on the aforementioned multi-modal model. . Figure 9 shows that the software system of the electronic device can adopt a layered architecture, event-driven architecture, micro-core architecture, micro-service architecture, or cloud architecture. The embodiment of the present invention takes an Android system with a layered architecture as an example to exemplarily illustrate the software structure of a terminal device.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.

The application layer can include a series of application packages.

As shown in Figure 9, the application package may include applications such as phone, camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 9, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.

The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, and so on.

The content provider is used to store and retrieve data and make these data accessible to applications. The data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.

The telephone manager is used to provide the communication function of the terminal device. For example, the management of the call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, prompt text messages in the status bar, sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.

Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.

The core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.

The application layer and application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.

The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.

Although the present application has been illustrated and described by referring to certain preferred embodiments of the present application, those of ordinary skill in the art should understand that various changes can be made in form and details without departing from the present application. The spirit and scope of the application.

Claims

A search method for electronic equipment, characterized in that it comprises:

Obtain the search data entered by the user;

Extracting features of the search data, and generating a search feature vector of the search data based on the extracted features;

Comparing the search feature vector with a plurality of index feature vectors in an index library to select an index feature vector in the index library whose similarity with the search feature vector is greater than a similarity threshold,

Wherein, in the index library, there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities;

The result data corresponding to the selected index feature vector is output as a search result, wherein the result data included in the search result has multiple modalities.
The search method according to claim 1, wherein the similarity between the search feature vector and the index feature vector is calculated by the following formula:

Among them, d represents the similarity between the search feature vector and the multiple feature vectors stored in the index library, x i represents the feature vector of the input data, y i represents the multiple feature vectors stored in the index library, and i represents the value of the input data. The dimension of the feature vector or multiple feature vectors stored in the index library.
The search method according to claim 1 or 2, wherein the electronic device has the index library, and the plurality of index feature vectors that have a corresponding relationship with the plurality of index feature vectors in the index library The multiple result data of each modality are data on the electronic device.
The search method according to claim 3, wherein the electronic device is a mobile terminal.
The search method according to claim 4, wherein the user inputs the search data on a negative screen of the mobile terminal.
The search method according to claim 4, wherein the user inputs the search data in a memo of the mobile terminal.
The search method according to any one of claims 1 to 6, wherein the multiple modalities include image, video, audio, text, and detection data of a sensor of the electronic device.
An electronic device, characterized in that it comprises:

The obtaining module is used to obtain the search data input by the user;

A feature extraction module, configured to extract features of the search data, and generate a search feature vector of the search data based on the extracted features;

Similarity calculation module: used to compare the search feature vector with multiple index feature vectors in the index library to select that the similarity between the search feature vector and the search feature vector in the index library is greater than the similarity threshold Index feature vector,

Wherein, in the index library, there is a correspondence between the multiple index feature vectors and multiple result data of multiple modalities;

Output module: output the result data corresponding to the selected index feature vector as a search result, wherein the result data included in the search result has multiple modalities.
A machine-readable medium, characterized in that an instruction is stored on the machine-readable medium, and when the instruction is executed on a machine, the machine executes the method according to any one of claims 1 to 7.
An electronic device, comprising: a memory for storing instructions executed by one or more processors of the system, and a processor, one of the processors of the system, for executing any one of claims 1 to 7 The method described.