WO2019169872A1

WO2019169872A1 - Method and device for searching for content resource, and server

Info

Publication number: WO2019169872A1
Application number: PCT/CN2018/111433
Authority: WO
Inventors: 董维山; 王园; 毛妤; 袁洁; 陈曼仪; 杨茗名
Original assignee: 北京百度网讯科技有限公司
Priority date: 2018-03-09
Filing date: 2018-10-23
Publication date: 2019-09-12
Also published as: CN108416028A; CN108416028B

Abstract

The present application discloses a method and a device for searching for a content resource, and a server. The method comprises: acquiring a preset image and at least one candidate content resource, and separately extracting text features and visual features of the preset image and text features and visual features of the candidate content resource; determining text similarities between the text features of the preset image and the text features of the candidate content resource; determining visual similarities between the visual features of the preset image and the visual feature of the candidate content resource; and determining, according to the determined text similarities and visual similarities, a target content resource from the at least one candidate content resource. The technical solution provided in embodiments of the present application combines text similarities and visual similarities in a content searching process, thereby enabling accurate searches for required content resources.

Description

Method, device and server for searching content resources

The present application claims priority to the Chinese Patent Application entitled "Method, Apparatus and Server for Searching for Content Resources", filed on March 9, 2018, in the Chinese Patent Office, Application No. 201101195551.6, the entire contents of which are incorporated by reference. In this application.

Technical field

The present application relates to the field of computer network technologies, and in particular, to a method, an apparatus, and a server for searching for content resources.

Background technique

With the development of computer technology, many new types of content resources have appeared, such as panoramic images, panoramic videos, three-dimensional (3D) models, three-dimensional animations and their virtual reality (VR) and augmented reality (augmented). Reality, AR) shows under the scene. At the same time, photographic techniques (such as fisheye lenses), modeling techniques, and programming tools are constantly evolving, making it easier to generate the above-mentioned content resources. On the Internet, this new type of content resource is also appearing more and more. Compared with traditional text, 2D pictures, ordinary video and audio, these content resources have the advantages of coherence, multi-linearity, multi-angle, presence, large space, high interaction, information immediacy, online linkage and offline. .

Traditional Internet search technology mainly uses text information to index massive web content. Typically, word frequency TF-IDF (term frequency-inverse document frequency) and word vector (word2vec) technology are used to create a text index in a webpage library, and search for content and user text query (query) ) matching pages. With the emergence of a large number of pictures and video content and the development of deep neural network technology, image search, voice search and music search have also appeared.

However, for the new types of content resources mentioned above, because their form transcends the expression space of text, ordinary two-dimensional pictures, videos, and music, it is difficult for users to search for these content resources conveniently and quickly using current search engine technology. .

Summary of the invention

The embodiment of the present application provides a method, an apparatus, and a server for searching for a content resource, to solve or alleviate one or more technical problems in the prior art, and at least provide a beneficial choice.

In a first aspect, the embodiment of the present application provides a method for searching for a content resource, including:

Obtaining a preset picture and at least one candidate content resource, and separately extracting text features and visual features of the preset picture and text features and visual features of the candidate content resource;

Determining a text similarity between a text feature of the preset picture and a text feature of the candidate content resource;

Determining a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource;

The target content resource is determined from the at least one candidate content resource based on the determined text similarity and visual similarity.

With reference to the first aspect, in the first implementation manner of the first aspect, the present application determines, according to the determined text similarity and visual similarity, the target content resource from the at least one candidate content resource, including:

Obtaining an overall similarity between the preset picture and the candidate content resource according to the determined text similarity and visual similarity;

The target content resource is determined from the at least one candidate content resource based on the obtained overall similarity.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the overall similarity between the preset picture and the candidate content resource is obtained according to the determined text similarity and visual similarity Degree, including:

Selecting the determined text similarity and visual similarity based on a preset threshold;

Obtaining an overall similarity between the preset picture and the candidate content resource according to the selected text similarity and visual similarity.

With reference to the first aspect, in the third implementation manner of the first aspect, the text feature of the preset picture is extracted, including:

Identifying a picture content included in the preset picture by using a preset picture classification model to extract a text feature from the preset picture; or

Obtaining corresponding webpage content according to the uniform resource locator of the preset image, and extracting text features of the preset image from the webpage content.

With reference to the first aspect, the first embodiment of the first aspect or the second embodiment of the first aspect or the third embodiment of the first aspect, the fourth embodiment of the first aspect of the present application Determining text similarity between the text feature of the preset picture and the text feature of the candidate content resource, including:

Obtaining a content resource in the preset content resource library as the at least one candidate content resource, and determining a text similarity between a text feature of the preset image and a text tag of the candidate content resource, where the preset The content repository includes a plurality of content resources and their corresponding text labels.

With reference to the first aspect, the first embodiment of the first aspect or the second embodiment of the first aspect or the third embodiment of the first aspect, the application is further in the fifth embodiment of the first aspect The method includes: determining a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource, including:

Sampling the candidate content resource to obtain at least one sample picture corresponding to the content resource;

For each sampled picture, determining a visual similarity between a visual feature of the sampled picture and a visual feature of the predetermined picture;

Determining a visual feature of the candidate content resource and a visual feature of the preset image according to a visual similarity between a visual feature of each sampled picture corresponding to the candidate content resource and a visual feature of the preset image Visual similarity between.

With reference to the fifth implementation manner of the first aspect, in the sixth implementation manner of the first aspect, the sampling, by the candidate content resource, includes:

In the visible space, the candidate content resources are subjected to perspective sampling in a preset observation manner and a sampling manner;

Wherein, the preset observation mode is based on at least one of an observation position, an angle, and a visible range, and a combination thereof.

In a second aspect, an embodiment of the present application provides an apparatus for searching for a content resource, including:

An acquiring module, configured to acquire a preset picture and at least one candidate content resource, and separately extract text features and visual features of the preset picture and text features and visual features of the candidate content resource;

a first determining module, configured to determine a text similarity between a text feature of the preset picture and a text feature of the candidate content resource;

a second determining module, configured to determine a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource;

The target determining module is configured to determine the target content resource from the at least one candidate content resource according to the determined text similarity and visual similarity.

With reference to the second aspect, in the first implementation manner of the second aspect, the target determining module includes:

a first calculation submodule configured to obtain an overall similarity between the preset picture and the candidate content resource according to the determined text similarity and visual similarity;

The target determining submodule is configured to determine the target content resource from the at least one candidate content resource according to the obtained overall similarity.

With reference to the second implementation manner of the second aspect, in the third implementation manner of the second aspect, the first computing submodule is further configured to:

With reference to the second aspect, in the third implementation manner of the second aspect, the acquiring module includes:

An identifier module configured to identify a picture content included in the preset picture by using a preset picture classification model to extract a text feature from the preset picture; or

The extracting sub-module is configured to obtain a corresponding webpage content according to the uniform resource locator of the preset image, and extract a text feature of the preset image from the webpage content.

With reference to the second aspect, the first embodiment of the second aspect, the second embodiment of the second aspect, or the third embodiment of the second aspect, in the fourth embodiment of the second aspect of the present application, The first determining module includes:

a first determining submodule configured to obtain a content resource in the preset content resource library as the at least one candidate content resource, and determine that the text feature of the preset image is similar to the text label of the candidate content resource And the preset content resource library includes a plurality of content resources and corresponding text labels.

With reference to the second aspect, the first embodiment of the second aspect, the second embodiment of the second aspect, or the third embodiment of the second aspect, in a fifth implementation manner of the second aspect of the present application, The second determining module includes:

a sampling sub-module configured to sample the candidate content resource to obtain at least one sample picture corresponding to the content resource;

a second determining submodule configured to determine a visual similarity between a visual feature of the sampled picture and a visual feature of the preset picture for each sampled picture;

a third determining submodule configured to determine a visual feature of the candidate content resource according to a visual similarity between a visual feature of each sampled picture corresponding to the candidate content resource and a visual feature of the preset image The visual similarity between the visual features of the preset picture.

With reference to the fifth implementation manner of the second aspect, in a sixth implementation manner of the application, the sampling submodule is further configured to:

In a third aspect, an embodiment of the present application provides a server, where the server includes:

One or more processors;

a storage device for storing one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described above.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, configured to store computer software instructions for a device for searching for a content resource, where the method for searching for a content resource in the foregoing first aspect is searching for content The program involved in the device of the resource.

The technical solution of the foregoing technical solution has the following advantages or advantages: in the technical solution of the query in the embodiment of the present application, the text feature of the preset picture and the text of each content resource may be determined based on the text feature and the visual feature of the preset picture. The text similarity between the features, determining a visual similarity between the visual features of the preset picture and the visual features of each content resource, and then determining the target content from the content resources according to the determined text similarity and visual similarity Resources. Since the similarity and visual similarity of the text are combined in the search process, the desired content resources can be accurately searched.

The above summary is for the purpose of illustration only and is not intended to be limiting. Further aspects, embodiments, and features of the present application will be readily apparent from the Detailed Description of the Drawings.

DRAWINGS

In the drawings, the same reference numerals are used to refer to the The drawings are not necessarily to scale. It is to be understood that the appended drawings are not intended to

1 is a flowchart of a method for searching for content resources according to Embodiment 1 of the present application;

2 is a flowchart of a method for searching for content resources according to Embodiment 2 of the present application;

3 is a schematic diagram of performing perspective sampling on content resources in a method for searching for content resources according to Embodiment 2 of the present application;

4 is a schematic diagram of a comparison of visual features of a preset picture and a content resource in a method for searching for a content resource according to Embodiment 2 of the present application;

FIG. 5 is a flowchart of a method for searching for content resources according to Embodiment 3 of the present application;

6 is a schematic diagram of an apparatus for searching for content resources according to Embodiment 4 of the present application;

7 is a schematic diagram of an apparatus for searching for content resources according to Embodiment 5 of the present application;

FIG. 8 is a schematic diagram of a server according to Embodiment 6 of the present application.

Detailed ways

In the following, only certain exemplary embodiments are briefly described. The described embodiments may be modified in various different ways, without departing from the spirit and scope of the invention. Accordingly, the drawings and description are to be regarded as illustrative rather

Embodiment 1

The embodiment of the present application provides a method for searching for content resources. FIG. 1 is a flowchart of a method for searching for content resources according to an embodiment of the present application. The method for searching for content resources in the embodiment of the present application includes the following steps:

S101. Acquire a preset picture and at least one candidate content resource, and respectively extract text features and visual features of the preset picture and text features and visual features of the candidate content resource.

In the embodiment of the present application, the preset picture may include, but is not limited to, a network picture, a picture stored in an album, a picture taken by a camera, or a hand-drawn sketch.

The embodiments of the present application may receive a picture sent by a client as a preset picture according to various API (Application Programming Interface) interfaces of a protocol such as HTTP (HyperText Transfer Protocol) and HTTPS. The preset image is obtained by the webpage address of the image input by the user.

The method for obtaining the text feature of the preset picture may be: analyzing the preset picture to obtain one or more short texts capable of describing or representing the picture content, thereby converting the query condition in the form of the picture into the query condition in the text form. The specific analysis method may include: constructing a picture classifier by using a machine learning algorithm, and then inputting a preset picture into a picture classifier for analysis. The picture classifier can be used to analyze the preset picture to obtain the picture content, and output the description text of the preset picture. For example, if you enter the picture of Tyrannosaurus in the image classifier, you can output the text "T-Rex".

S102. Determine a text similarity between a text feature of the preset picture and a text feature of the candidate content resource.

Content resources related to embodiments of the present application include, but are not limited to, normal video, panoramic picture, panoramic video, three-dimensional (3D) model, three-dimensional animation, and their presentation in virtual reality (VR) and augmented reality (AR) scenarios.

For example, a panoramic photo (PANORAMIC PHOTO or PANORAMA) includes a normal effective viewing angle (approximately 90 degrees horizontally, 70 degrees vertical) of a person's eyes or a binocular residual light angle (approximately 180 degrees horizontally, 90 degrees vertical) or even 360 degrees complete. A photo taken in the scene range.

Wherein, the content resource may be obtained by a web crawler from the Internet or by a content producer. For example, in order to facilitate searching for content resources, web crawler technology can be used to obtain content resources. Content resource producers can also create content resources and build them into content repositories. Content resources can be tagged with text to facilitate classification, management, and retrieval. In addition, the content repository can be updated at preset intervals. In this way, when searching for content resources, you can search in the content resource library to improve search efficiency.

In the implementation of the present application, the text features of the preset picture and the text features of each content resource in the content resource library may be compared one by one to determine the text similarity between the preset picture and each content resource. For example, the text feature of the preset picture is "Tiananmen." If a content resource is a panorama, the text feature of the content resource is "Tiananmen Square." Comparing the "Tiananmen" of the preset picture with the "Tiananmen Square" of the content resource, it can be determined that the text similarity is high. If the text feature of another content resource is "Nanjing", the "Tiananmen" of the preset picture is compared with the "Nanjing" of the content resource, and the text similarity between the two is low.

S103. Determine a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource.

The visual feature may be attribute data that represents semantics of the image, such as color, texture, and the like of the image.

S104. Determine a target content resource from the at least one candidate content resource according to the determined text similarity and visual similarity.

In step S101 of the embodiment of the present application, the text feature and the visual feature of the preset picture may be simultaneously extracted, and the text feature and the visual feature of the preset picture may be separately extracted. In the embodiment of the present application, the order of extracting the text features and the visual features of the preset picture is not limited. For example, after extracting the text feature of the preset picture, step S102 may be performed to determine the text similarity; after extracting the visual similarity of the preset picture, step 103 is performed to determine the visual similarity. After the visual similarity of the preset picture is extracted, step 103 is performed to determine the visual similarity; after extracting the text feature of the preset picture, step S102 is performed to determine the text similarity. Alternatively, the extraction of these two features and the corresponding process of determining the similarity can also be performed in parallel.

The technical solution of the search provided by the embodiment of the present application may determine the text similarity between the text feature of the preset image and the text feature of each content resource according to the text feature and the visual feature of the preset image, and determine the visual feature of the preset image. The visual similarity with the visual features of each content resource, and then determining the target content resource from each content resource according to the determined text similarity and visual similarity, since the similarity and vision of the text are combined in the process of searching Similarity, which can accurately search for the required content resources, and is suitable for searching various content resources such as panoramic pictures, panoramic videos, three-dimensional models, three-dimensional animations, display in virtual reality and augmented reality scenes.

Embodiment 2

On the basis of the first embodiment, the embodiment of the present application provides a method for searching for content resources. FIG. 2 is a flowchart of a method for searching for content resources according to an embodiment of the present application. The method for searching for content resources in the embodiment of the present application includes the following steps:

S201. Acquire a preset picture and at least one candidate content resource, and respectively extract text features and visual features of the preset picture and text features and visual features of the candidate content resource.

In an embodiment, the preset picture classification model may be used to identify the picture content of the preset picture, and the text feature is extracted from the preset picture. For example, the image classification model can be trained based on the convolutional neural network algorithm and according to the vertical class, and the image of the Tyrannosaurus can be input to the image classification model, and the image classification model can output the text classification label "Treasure Dragon".

In another embodiment, the text feature of the preset picture may be extracted by acquiring a corresponding webpage content according to a Uniform Resource Locator (URL) of the preset image, and extracting a preset from the webpage content. The textual characteristics of the picture. For example, when the preset picture is from the Internet, or when the web page contains the same picture as the preset picture, the URL of the preset picture or the same picture can be obtained. Secondly, the content included in the webpage indicated by the URL is processed to extract the text feature of the preset image. For example, input the preset picture of the Tiananmen Gate and the web address of the preset picture, and process the content in the webpage to generate a short text "Tiananmen" and then output the short text "Tiananmen".

S202. The content resource in the preset content resource library is obtained as the at least one candidate content resource, and the text similarity between the text feature of the preset image and the text label of the candidate content resource is determined.

For ease of management, you can set a text label for a content resource as it is generated. For example, it can be implemented using a "keyword text similarity calculation" module based on natural language processing technology commonly used by web search engines.

The following example shows the keyword text similarity calculation process: given two keywords (both short texts), the text similarity calculation model is adopted, which is constructed based on user keyword data and click log data, and has been offline. The pre-training is completed (for example, based on a neural network or a bag model), and the semantic similarity of the two keywords is scored by a text similarity calculation model. The higher the score, the closer the two keywords are semantically similar, and vice versa. Taking the calculation method based on cosine similarity as an example, the output score value ranges from [-1, 1]. For example, the similarity scores of "Tiananmen Square" and "Tiananmen Square" are set to s, the value of s should be close to 1, and the similarity scores of "Tiananmen" and "Nanjing Road" should be significantly lower than s.

S203. Sample the candidate content resource to obtain at least one sample picture corresponding to the content resource.

In step S203, the content resources may be sampled in the visual space in a preset observation mode and a sampling manner.

The sampling manner may include equal interval sampling, random sampling, sampling based on user interaction history record distribution, and the like.

Taking the sampling at equal intervals as an example, when sampling the content resources, the user's observation mode can be simulated to sample the angle of view of the entire visible space, that is, the content resources are planarly projected at the simulated observation point, and the corresponding point is obtained. Sampling the picture and then adjusting the simulated observation point, ie changing the viewing position. The perspective sampling method is common to all types of content resources. When sampling in a predetermined observation mode, the sampling interval also needs to consider factors such as calculation amount, storage space and accuracy, and recall rate. In the sampling of panoramic video and 3D animation containing animation content, it is necessary to further combine the frame sampling method, that is, to generate an output image on the time axis, and the sampling interval also needs to consider the calculation amount, storage space and accuracy, and recall rate. factor.

For example, as shown in FIG. 3, the content resource is a 3D model of Tyrannosaurus Rex, and different observation modes are used, for example, a certain plane angle is rotated, and a 3D model of Tyrannosaurus Rex is sampled once to obtain a sample picture 1 and a sample picture 2 , sample picture 3... sample picture n.

S204. Determine, for each sample picture, a visual similarity between a visual feature of the sampled picture and a visual feature of the preset picture.

In a specific implementation, first, a picture feature extractor is used to extract a visual feature of a preset picture.

This step can be implemented by a visual feature-based "similar map retrieval" module commonly used by image search engines in the conventional technology. The process of similarity map retrieval includes: taking a preset picture, and adopting a pre-trained or pre-trained picture feature extractor (for example, based on a convolutional neural network or the like) to perform visual feature extraction on the preset picture.

When extracting the visual features of each sampled picture, the above method can also be used to extract the visual features of the sampled picture.

Then, the visual features of the preset picture and the visual features of each sample picture are determined, and the visual similarity between the preset picture and each sample picture is obtained.

S205. Determine, according to a visual similarity between a visual feature of each sample picture corresponding to the candidate content resource and a visual feature of the preset picture, a visual feature of the candidate content resource and a visual of the preset picture. Visual similarity between features.

In step S205, for the candidate content resource, multiple sampled pictures are obtained. At this time, the preset pictures are respectively compared with each sampled picture of the candidate content resources, and the visual similarity between the preset picture and each sampled picture is calculated. Degree, and output visual similarity for all sampled pictures.

According to the visual feature extracted in step S205, the visual similarity between the visual feature of the preset picture and the visual feature of each sample picture of the candidate content resource is determined. A higher visual similarity indicates that the preset picture is closer to the sampled picture in visual semantics, and vice versa.

S206. Obtain an overall similarity between the preset picture and the candidate content resource according to the determined text similarity and visual similarity.

Step S206 includes: A, selecting the determined text similarity and visual similarity based on the preset threshold; B, obtaining the preset picture and the according to the selected text similarity and visual similarity The overall similarity of candidate content resources.

Among them, methods for calculating the overall similarity include, but are not limited to, linear weighting, product, value domain normalization, and the like. Taking the linear weighting method as an example: suppose a content resource corresponds to a text feature, and correspondingly, the preset image has a value corresponding to the text similarity of a content resource. When calculating the visual similarity, the content resource may be sampled to obtain a plurality of sampled pictures, and corresponding multiple visual similarities are obtained. Thus, the text similarity is taken as an item in the formula, and each visual similarity is taken as an item in the formula, and then each item is multiplied by the corresponding weight, and then summed to obtain the overall similarity, the formula as follows:

Q=aS ₀ +bS ₁ +cS ₂ +...+nS _n

Where, Q represents the overall degree of similarity, a, b, c ...... n is the weight, S ₀ represents text _{_{similarity, S 1, S 2 ...... S}} n represent visual similarity.

In addition, it is sometimes necessary to consider the impact of other additional factors on overall similarity, such as content quality assessment index (quality, low quality, resolution, model fineness, etc.), user history click records, and prohibitions of laws and regulations.

Determining the target content resource can be done in the following ways:

1. The content resource whose overall similarity is greater than the first preset threshold is the target content resource. For example, if the first preset threshold is 80%, the content resources whose overall similarity is greater than 80% are outputted among all the content resources that are searched.

Second, the searched content resources may be sorted according to the overall similarity size, and the content resources ranked in the first few bits are determined as the target content resources. For example, the content function is sorted in order of overall similarity by using the ranking function rank, and the first five content resources are reserved.

In the embodiment of the present application, in order to improve the calculation efficiency and avoid processing too much content resources, after calculating the text similarity and the visual similarity, the text similarity and/or the visual similarity may be separately filtered according to the value. In one embodiment, the content resource whose text similarity is smaller than the first preset threshold is filtered; and the content resource whose visual similarity is less than the second preset threshold is filtered. This can reduce the amount of content resources that need to calculate the overall similarity, thereby reducing the amount of calculation and improving the calculation efficiency.

S207. Determine a target content resource from the at least one candidate content resource according to the obtained overall similarity.

As shown in FIG. 4, the preset picture is a certain scenic spot picture, and the candidate content resource is a panoramic picture. At this time, the panoramic view is sampled to obtain a plurality of sampled pictures, and then the scenic picture is compared with each sampled picture. If the two match, the panoramic image corresponding to the sampled image is a content resource matching the image of a certain scenic spot, and the panoramic image is output.

In a specific implementation, each content resource in the content resource library is used as a candidate content resource, but sometimes in order to reduce the amount of calculation, only a certain category of content resources is used as a candidate content resource, and then the above steps are repeated to obtain a preset image and The overall similarity of each candidate content resource, and then the candidate content is sorted according to the size of the overall similarity. The higher the overall similarity value, the higher the similarity between the preset image and the content resource.

In addition, each content resource has its corresponding identifier. For the sake of implementation, when the searched content resource is output, the content resource itself is not directly output, but the identifier (ID) of the content resource is output. For example, the overall similarity and the identifier (ID) of the corresponding content resource are output, the content resources are sorted according to the overall similarity size, and then the IDs of the first n content resources are output.

Then, when the content resource is displayed on the client, the content resource storage address can be obtained according to the ID, and then the user selects the multimedia file to be displayed in the candidate multimedia file list through a certain interaction mode, for example, selecting a remote player, etc. in the browser interface. , that is, the target content resource.

The technical solution of the embodiment of the present application samples the content resources by using different viewing modes and sampling intervals of the content resources, so that the visual features of the preset images can be comprehensively matched with the visual features of the content resources, so that the search content resources are accurate. Higher degrees.

Embodiment 3

On the basis of the second embodiment, the embodiment of the present application provides a method for searching for content resources. As shown in FIG. 5, it is a flowchart of a method for searching for content resources according to an embodiment of the present application. The method for searching for content resources in the embodiment of the present application includes:

1Preset picture (query): As the initial input of the system, generated by the browser client, containing the URL content of the picture content and picture.

Among them, the form of the picture is not limited, and can be a picture file uploaded by the user, a picture taken by the camera or a hand-drawn sketch.

2 network interface: receive and parse the preset picture sent by the client, and return the search result of the content resource to the client. Possible implementations include, but are not limited to, various types of API interface definitions based on protocols such as HTTP and HTTPS.

3 Picture Guess: Enter the preset picture passed for the network interface, and output as one or more short text segments that can describe or represent the content of the picture. The role of the picture guessing word is to convert the preset picture of the picture form into a keyword of the text form.

This step can be implemented by using the "figure map" module commonly used by image search engines. Typically, the mapping functions include:

a) use the same picture to match or use the URL information to aggregate, extract, and generate text output on the text information on the source page of the Internet (in the case where the picture is reproduced, there may be multiple source pages) . For example, given the picture input of the Tiananmen Gate, the module outputs the short text "Tiananmen".

b) Using a picture classifier pre-trained for the vertical class (for example, a picture classifier based on a convolutional neural network algorithm), the picture content is identified, and the classification label text is output. For example, given the picture input of Tyrannosaurus Rex, the module outputs the short text "Overlord Dragon".

4 Text similarity calculation: input is the output of 3 (the result of the picture guessing word and the set of text labels carried by each resource in the content resource library). This step performs a pairwise matching between the guess word result text and all content resource text labels to obtain multiple text matching pairs, and calculates the text similarity between each two text matching pairs, and outputs the text similarity score.

This step can be implemented by a "query text similarity calculation" module based on natural language processing technology commonly used by web search engines. Typically, the text similarity calculation function of the preset picture is: given two pieces of keywords (both short texts), using a text similarity calculation model based on user query data and click log offline pre-training (eg based on neural network) Or the word bag model), which scores the semantic similarity of two keywords. The higher the score, the closer the two keywords are semantically similar, and vice versa. Taking the calculation method based on cosine similarity as an example, the output score value ranges from [-1, 1]. For example, the similarity scores of "Tiananmen Square" and "Tiananmen Square" are set to s, the value of s should be close to 1, and the similarity scores of "Tiananmen" and "Nanjing Road" should be significantly lower than s.

5 Content Resource Library: The content resource library is a collection of various resources, which are provided by search engine crawler crawling or content producers, and each resource has a text label for classification, management and retrieval.

6-view sampling: The input is any resource in the content library, and the output is several sample images. Referring to the embodiment shown in FIG. 3, for a given content resource, the viewing angle can be viewed in the entire visible space by simulating and changing the user's viewing manner, including but not limited to viewing position, angle, visual range, and the like. Sampling, obtaining multiple images, each of which is a plane projection of the content to the simulated observation point. Viewing angle sampling can be used for panoramic/3D/AR/VR content. The sampling interval of viewing position, angle and visual range can be traded off between calculation amount, storage space and accuracy, and recall rate; for panoramic video containing animation content and The 3D animation is further matched with the frame sampling to generate an output sample picture on the time axis, and the sampling time interval is also trade-off between the calculation amount, the storage space and the accuracy, and the recall rate. Typical sampling techniques include, but are not limited to, equally spaced sampling, random sampling, sampling based on user interaction history distribution, and the like.

7 Visual similarity calculation: input is the preset picture and the output of 5 (the sample picture set obtained by the view sampling step of each content resource in the content resource library). The module performs a pairwise matching between the preset picture and all the content resource sample pictures to obtain a plurality of picture matching pairs, and calculates a visual similarity between each two picture matching pairs, and outputs the visual similarity scores.

This step can be implemented by a visual feature-based "similar map retrieval" module commonly used by image search engines. Typically, the function of the similarity map search is: given a preset picture, using a pre-defined or offline pre-trained picture feature extractor (eg based on a convolutional neural network, etc.) to perform visual feature extraction on the preset picture, The extracted features are compared with the features of each picture in the picture library, and the similarity of the visual features is scored. The higher the score, the closer the visual difference between the preset picture and the picture in a certain library, and vice versa.

8 Overall similarity calculation: the input is the output of 4 and 6, that is, the text matching pair of the preset picture and the resource in the content library and the similarity score of the picture matching pair in the text and the visual, and the output is the overall similarity and corresponding Candidate content resource ID. The higher the overall similarity score, the higher the correlation between the preset picture and the corresponding candidate content resource.

The overall similarity calculation is based on a combination of text and visual similarity scores, and possible implementations include, but are not limited to, linear weighting, product, range normalization, and the like. At the same time, additional factors may be considered, including but not limited to content quality assessment indices (high quality, low quality, resolution, model sophistication, etc.), user history click records, laws and regulations, and the like.

In order to speed up the calculation process of this module and avoid dealing with too many matching pairs, the text similarity score and the visual similarity score entering the calculation may be separately filtered, for example, the similarity score below a certain threshold is directly filtered. Do not enter the overall similarity calculation process to reduce the amount of calculation.

9Top k sorting: The output of input 7 is the overall similarity and the corresponding candidate content resource ID, and the output is the first k candidate content resource IDs arranged in descending order of the overall similarity score.

10 client display: According to the output of 9, the user selects the content resource to be displayed in the candidate content resource list through a certain interaction mode in the browser interface, and is displayed by the browser client.

Among the above modules, 3-9 can be pre-calculated by offline method, thereby accelerating the online search process. For example, the image library of the whole webpage may be processed in advance in an offline manner, the similarity score calculation and sorting may be performed offline, a static lookup table structure may be established, and the image in any webpage may be associated with the content resource. . When searching online, you can quickly get matching content by looking up the table. This lookup table can be updated in incremental calculations. If the user preset picture is not in the entire network, 3-9 can be calculated online. The above online and offline calculation processes can be accelerated by techniques such as parallel computing. The typical matching result is shown in the embodiment shown in FIG. 3. It can be seen that the matching can be accurate to a specific viewing angle, the matching precision is high, and the user experience is good.

Embodiment 4

An embodiment of the present application provides an apparatus for searching for a content resource. FIG. 6 is a schematic diagram of an apparatus for searching for content resources according to an embodiment of the present application. The device for searching for content resources in the embodiment of the present application includes:

The obtaining module 61 is configured to acquire a preset picture and at least one candidate content resource, and respectively extract text features and visual features of the preset picture and text features and visual features of the candidate content resource;

The first determining module 62 is configured to determine a text similarity between the text feature of the preset picture and the text feature of the candidate content resource;

a second determining module 63, configured to determine a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource;

The target determining module 64 is configured to determine the target content resource from the at least one candidate content resource according to the determined text similarity and visual similarity.

The technical solution of the embodiment of the present application can realize the combination of the text features of the preset picture and the content resource, and the accuracy of searching for the content resource is high. The technical effect is the same as that of the first embodiment, and the technical effect is not the same. Let me repeat.

Embodiment 5

On the basis of the fourth embodiment, the embodiment of the present application provides an apparatus for searching for content resources. FIG. 7 is a schematic diagram of an apparatus for searching for content resources according to an embodiment of the present application. In the device for searching content resources in the embodiment of the present application:

The target determination module 64 includes:

The first calculation sub-module 641 is configured to obtain an overall similarity between the preset picture and the candidate content resource according to the determined text similarity and visual similarity;

The target determining sub-module 642 is configured to determine the target content resource from the at least one candidate content resource according to the obtained overall similarity.

The first calculation submodule is further configured to:

Further, the obtaining module 61 includes:

The identification sub-module 611 is configured to identify a picture content included in the preset picture by using a preset picture classification model to extract a text feature from the preset picture; or

The extraction sub-module 612 is configured to obtain a corresponding webpage content according to the uniform resource locator of the preset image, and extract a text feature of the preset image from the webpage content.

Further, the first determining module 62 includes:

The first determining sub-module 621 is configured to acquire a content resource in the preset content resource library as the at least one candidate content resource, and determine a text between the text feature of the preset image and the text label of the candidate content resource. Similarity, wherein the preset content resource library includes a plurality of content resources and their corresponding text labels.

Further, the second determining module 63 includes:

The sampling sub-module 631 is configured to sample the candidate content resource to obtain at least one sample picture corresponding to the content resource;

a second determining sub-module 632 configured to determine a visual similarity between a visual feature of the sampled picture and a visual feature of the preset picture for each sampled picture;

The third determining sub-module 633 is configured to determine a visual feature of the candidate content resource according to a visual similarity between a visual feature of each sampled picture corresponding to the candidate content resource and a visual feature of the preset picture The visual similarity between the visual features of the preset picture.

The sampling sub-module is specifically configured to: perform, in a visible space, a sampling manner of the candidate content resource in a preset observation manner and a sampling manner;

The technical solution of the embodiment of the present application can sample the content resources by using multiple observation modes and sampling modes, so that the accuracy of searching the content resources is high, and the technical effect is the same as that of the second embodiment, and details are not described herein again. .

Embodiment 6

The embodiment of the present application provides an information classification device. As shown in FIG. 8, the device includes a memory 81 and a processor 82. The memory 81 stores a computer program executable on the processor 82. The processor 82 implements the information classification method in the above embodiment when the computer program is executed. The number of memories 81 and processors 82 may be one or more.

The device also includes:

The communication interface 83 is used for communication between the memory 81 and the processor 82 and an external device.

The memory 81 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.

If the memory 81, the processor 82, and the communication interface 83 are implemented independently, the memory 81, the processor 82, and the communication interface 83 can be connected to each other through a bus and complete communication with each other. The bus may be an Industrial Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an Extended Industry Standard Component (EISA) bus. The bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 8, but it does not mean that there is only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 81, the processor 82, and the communication interface 83 are integrated on one chip, the memory 81, the processor 82, and the communication interface 83 can complete communication with each other through the internal interface.

Example 7

A computer readable storage medium storing a computer program that, when executed by a processor, implements the method of any of the embodiments of FIGS. 1-5.

In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. A structure, material or feature is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. In addition, various embodiments or examples described in the specification, as well as features of various embodiments or examples, may be combined and combined.

Moreover, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, features defining "first" and "second" may include at least one of the features, either explicitly or implicitly. In the description of the present application, the meaning of "a plurality" is two or more unless specifically and specifically defined otherwise.

Any process or method description in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a particular logical function or process. And the scope of the preferred embodiments of the present application includes additional implementations, in which the functions may be performed in a substantially simultaneous manner or in the reverse order depending on the functions involved, in accordance with the illustrated or discussed order. It will be understood by those skilled in the art to which the embodiments of the present application pertain.

The logic and/or steps represented in the flowchart or otherwise described herein, for example, may be considered as an ordered list of executable instructions for implementing logical functions, and may be embodied in any computer readable medium, Used in conjunction with, or in conjunction with, an instruction execution system, apparatus, or device (eg, a computer-based system, a system including a processor, or other system that can fetch instructions and execute instructions from an instruction execution system, apparatus, or device) Or use with equipment. For the purposes of this specification, a "computer-readable medium" can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with such an instruction execution system, apparatus, or device.

The computer readable medium described in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. More specific examples of computer readable storage media, at least (non-exhaustive list) include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM) ), read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and portable read only memory (CDROM). In addition, the computer readable storage medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if necessary, other Processing is performed in a suitable manner to obtain the program electronically and then stored in computer memory.

In an embodiment of the present application, a computer readable signal medium may include a data signal that is propagated in a baseband or as part of a carrier, carrying computer readable program code. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer readable signal medium can also be any computer readable medium other than a computer readable storage medium, which can transmit, propagate, or transport a program for use in or in connection with an instruction execution system, an input method, or a device. . Program code embodied on a computer readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), and the like, or any suitable combination of the foregoing.

It should be understood that portions of the application can be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

One of ordinary skill in the art can understand that all or part of the steps carried by the method of implementing the above embodiments can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, one or a combination of the steps of the method embodiments is included.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may also be stored in a computer readable storage medium. The storage medium may be a read only memory, a magnetic disk or an optical disk, or the like.

The above description is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or within the technical scope disclosed in the present application. In addition, these should be covered by the scope of the present application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims

A method for searching a content resource, the method comprising:

Obtaining a preset picture and at least one candidate content resource, and separately extracting text features and visual features of the preset picture and text features and visual features of the candidate content resource;

Determining a text similarity between a text feature of the preset picture and a text feature of the candidate content resource;

Determining a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource;

The target content resource is determined from the at least one candidate content resource based on the determined text similarity and visual similarity.
The method according to claim 1, wherein determining the target content resource from the at least one candidate content resource according to the determined text similarity and visual similarity comprises:

Obtaining an overall similarity between the preset picture and the candidate content resource according to the determined text similarity and visual similarity;

The target content resource is determined from the at least one candidate content resource based on the obtained overall similarity.
The method according to claim 2, wherein the overall similarity between the preset picture and the candidate content resource is obtained according to the determined text similarity and visual similarity, including:

Selecting the determined text similarity and visual similarity based on a preset threshold;

Obtaining an overall similarity between the preset picture and the candidate content resource according to the selected text similarity and visual similarity.
The method according to claim 1, wherein extracting text features of the preset picture comprises:

Identifying a picture content included in the preset picture by using a preset picture classification model to extract a text feature from the preset picture; or

Obtaining corresponding webpage content according to the uniform resource locator of the preset image, and extracting text features of the preset image from the webpage content.
The method according to any one of claims 1 to 4, wherein the determining a text similarity between a text feature of the preset picture and a text feature of the candidate content resource comprises:

Obtaining a content resource in the preset content resource library as the at least one candidate content resource, and determining a text similarity between a text feature of the preset image and a text tag of the candidate content resource, where the preset The content repository includes a plurality of content resources and their corresponding text labels.
The method according to any one of claims 1 to 4, wherein the determining a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource comprises:

Sampling the candidate content resource to obtain at least one sample picture corresponding to the content resource;

For each sampled picture, determining a visual similarity between a visual feature of the sampled picture and a visual feature of the predetermined picture;

Determining a visual feature of the candidate content resource and a visual feature of the preset image according to a visual similarity between a visual feature of each sampled picture corresponding to the candidate content resource and a visual feature of the preset image Visual similarity between.
The method according to claim 6, wherein sampling the candidate content resources comprises:

In the visible space, the candidate content resources are subjected to perspective sampling in a preset observation manner and a sampling manner;

Wherein, the preset observation mode is based on at least one of an observation position, an angle, and a visible range, and a combination thereof.
An apparatus for searching for a content resource, the apparatus comprising:

An acquiring module, configured to acquire a preset picture and at least one candidate content resource, and separately extract text features and visual features of the preset picture and text features and visual features of the candidate content resource;

a first determining module, configured to determine a text similarity between a text feature of the preset picture and a text feature of the candidate content resource;

a second determining module, configured to determine a visual similarity between a visual feature of the preset picture and a visual feature of the candidate content resource;

The target determining module is configured to determine the target content resource from the at least one candidate content resource according to the determined text similarity and visual similarity.
The apparatus according to claim 8, wherein the target determining module comprises:

a first calculation submodule configured to obtain an overall similarity between the preset picture and the candidate content resource according to the determined text similarity and visual similarity;

The target determining submodule is configured to determine the target content resource from the at least one candidate content resource according to the obtained overall similarity.
The apparatus according to claim 9, wherein the first calculation submodule is further configured to:

Selecting the determined text similarity and visual similarity based on a preset threshold;

Obtaining an overall similarity between the preset picture and the candidate content resource according to the selected text similarity and visual similarity.
The device according to claim 8, wherein the obtaining module comprises:

An identifier module configured to identify a picture content included in the preset picture by using a preset picture classification model to extract a text feature from the preset picture; or

The extracting sub-module is configured to obtain a corresponding webpage content according to the uniform resource locator of the preset image, and extract a text feature of the preset image from the webpage content.
The apparatus according to any one of claims 8 to 12, wherein the first determining module comprises:

a first determining submodule configured to obtain a content resource in the preset content resource library as the at least one candidate content resource, and determine that the text feature of the preset image is similar to the text label of the candidate content resource And the preset content resource library includes a plurality of content resources and corresponding text labels.
The apparatus according to any one of claims 8 to 12, wherein the second determining module comprises:

a sampling sub-module configured to sample the candidate content resource to obtain at least one sample picture corresponding to the content resource;

a second determining submodule configured to determine a visual similarity between a visual feature of the sampled picture and a visual feature of the preset picture for each sampled picture;

a third determining submodule configured to determine a visual feature of the candidate content resource according to a visual similarity between a visual feature of each sampled picture corresponding to the candidate content resource and a visual feature of the preset image The visual similarity between the visual features of the preset picture.
The apparatus of claim 13, wherein the sampling submodule is further configured to:

In the visible space, the candidate content resources are subjected to perspective sampling in a preset observation manner and a sampling manner;

Wherein, the preset observation mode is based on at least one of an observation position, an angle, and a visible range, and a combination thereof.
A server, wherein the server comprises:

One or more processors;

a storage device for storing one or more programs;

The one or more processors are caused to perform the method of any one of claims 1-7 when the one or more programs are executed by the one or more processors.
A computer readable storage medium storing a computer program, wherein the program, when executed by a processor, implements the method of any of claims 1-7.