Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for retrieving multimedia resources, which can more fully retrieve multimedia resources satisfying the retrieval condition, thereby better satisfying the retrieval requirement of the multimedia resources.
Based on the above purpose, the present invention provides a multimedia resource retrieval method, which includes:
receiving a query request sent by a user;
searching in a multimedia resource search library according to the query request, and returning a search result;
the multi-mode information of a plurality of multimedia resources is stored in the multimedia resource search library.
Preferably, the multimedia resource search library further stores: cataloging information for each multimedia asset.
Wherein the multimodal information of the multimedia resource comprises textual information; and
the text information is pre-stored in the multimedia resource search library:
identifying text information from a video of the multimedia resource;
and storing the identified text information into the multimedia resource search library.
Wherein the multimodal information of the multimedia resource comprises speech information; wherein, the voice information is pre-stored in the multimedia resource search library in an audio compression coding form and/or a text form:
extracting audio from the multimedia resource, performing voice recognition, converting the audio into text content, and storing the text content obtained by conversion into the multimedia resource retrieval library as voice information of the multimedia resource in a text form; and/or
And extracting audio from the multimedia resource, further extracting the characteristics of the audio and carrying out compression coding on the extracted audio characteristics to obtain the voice information of the multimedia resource in an audio compression coding form.
Wherein the multimodal information of the multimedia asset comprises image information; wherein, the image information is pre-stored in the multimedia resource search library in a pixel compression coding mode and/or a character mode:
extracting key frames from the video of the multimedia resources, carrying out image content description and/or image object labeling on the key frames, and storing character contents obtained by image content description and/or image object labeling into the multimedia resource retrieval library as image information of the multimedia resources in a character form; and/or
Extracting key frames from the video of the multimedia resources, extracting picture pixel characteristics of the key frames, performing compression coding, and storing image information of the multimedia resources in a pixel compression coding form into the multimedia resource retrieval library.
Wherein, the searching in the multimedia resource search library according to the query request comprises:
analyzing the query request to obtain a keyword set K of the query request;
expanding the keyword set K to obtain an expanded keyword set K';
and searching in the multimedia resource search library according to the expanded keyword set K'.
Or, the retrieving in the multimedia resource retrieval library according to the query request includes:
analyzing the query request to obtain an audio clip in the query request;
and according to the audio segments, searching in the audio information in the audio compression coding form in the multimedia resource search library.
Or, the retrieving in the multimedia resource retrieval library according to the query request includes:
analyzing the query request to obtain a picture in the query request;
and according to the picture, searching in the image information in a pixel compression coding mode in the multimedia resource search library.
Further, after the retrieval is performed in the multimedia resource retrieval library according to the query request, the method further includes:
aiming at the same multimedia resource, obtaining cataloguing information of the multimedia resource and the integrating degrees of the information in different modes corresponding to the query request respectively;
respectively carrying out weighted average on the cataloguing information of the multimedia resources and the integrating degrees of the information in different modes corresponding to the query request, and taking the obtained weighted average as a score of the multimedia resources matched with the query request;
sorting in descending order according to the scores of the multimedia resources;
and taking the sequencing result of each multimedia resource as the retrieval result.
The invention also provides a multimedia resource retrieval device, comprising:
the multimedia resource search library is used for storing multi-modal information of a plurality of multimedia resources;
the query request receiving module is used for receiving a query request sent by a user;
and the retrieval module is used for retrieving in the multimedia resource retrieval library according to the query request and returning a retrieval result.
Further, the multimedia resource search library further stores: cataloging information for each multimedia asset.
Wherein the multi-modal information of the multimedia resource comprises at least one of the following information: text information, voice information, image information; the voice information is pre-stored in the multimedia resource search library in an audio compression coding mode and/or a text mode; the image information is pre-stored in the multimedia resource search library in a pixel compression coding mode and/or a text mode.
Further, the apparatus further comprises: a multimodal information storage module; and
the multi-modal information storage module comprises at least one of the following units:
the text information storage unit is used for identifying text information from the video of the multimedia resource; storing the identified text information into the multimedia resource search library;
the voice information storage unit is used for extracting audio from the multimedia resources, performing voice recognition on the audio, converting the audio into text contents, and storing the text contents obtained through conversion into the multimedia resource retrieval library as voice information of the multimedia resources in a text form; and/or extracting audio from the multimedia resource, further extracting the characteristics of the audio and performing compression coding on the extracted audio characteristics to obtain voice information of the multimedia resource in an audio compression coding form, and storing the obtained voice information of the multimedia resource in the audio compression coding form into the multimedia resource retrieval library;
the image information storage unit is used for extracting key frames from the video of the multimedia resources, carrying out image content description and/or image object labeling on the key frames, and storing the text content obtained by image content description and/or the text content obtained by image object labeling into the multimedia resource retrieval library as the image information of the text form of the multimedia resources; and/or extracting key frames from the video of the multimedia resources, extracting picture pixel characteristics of the key frames, performing compression coding, and storing image information of the multimedia resources in a pixel compression coding form into the multimedia resource search library.
In the technical scheme of the invention, the multi-mode information of the multimedia resources is stored in the multimedia resource retrieval library, retrieval is carried out in the multimedia resource retrieval library according to the query request, and retrieval can be carried out based on information richer than cataloged information, so that the multimedia resources meeting retrieval conditions can be retrieved more fully, and the retrieval requirements of the multimedia resources are better met.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
The inventors consider that multi-modal information, such as text, speech, images, etc., is contained in a multimedia asset (video). If the information is utilized during retrieval, the multimedia resources meeting the retrieval conditions can be retrieved more fully, thereby better meeting the retrieval requirements of the multimedia resources.
The technical scheme of the invention is described in detail in the following with reference to the accompanying drawings.
Based on the above thought, in order to utilize the multimodal information of the multimedia resources during the retrieval, in the technical solution of the embodiment of the present invention, the stored multimedia resources are preprocessed, and the multimodal information is extracted from the multimedia resources and stored in the multimedia resource retrieval library. In the multimedia resource search library provided in the embodiment of the present invention, the multi-modal information of each multimedia resource may include at least one of the following information: text information, voice information, image information. The multi-modal information of the multimedia resources is pre-stored in a multimedia resource search library, wherein the voice information is pre-stored in the multimedia resource search library in an audio compression coding mode and/or a text mode; the image information is pre-stored in the multimedia resource search library in a pixel compression coding mode and/or a text mode. How to acquire and store information of multiple modalities will be described in detail later. Of course, it is preferable that the cataloging information of the multimedia assets is also stored in the multimedia asset repository.
Based on the above multimedia resource search library, the multimedia resource search method provided in the embodiment of the present invention has a process as shown in fig. 1, and includes the following steps:
s101: and receiving a query request sent by a user.
In this step, the received query request may include a keyword to be queried, an audio clip to be queried, or a picture to be queried.
S102: and retrieving in a multimedia resource retrieval library according to the query request.
In this step, for a query request including a keyword to be queried, the query request may be first analyzed to obtain a keyword set K of the query request; for example, the query request may be analyzed by using techniques such as word segmentation, chinese word segmentation, named entity recognition, emotion analysis, and the like, so as to obtain the keyword set K of the query request.
Further, expanding the keyword set K to obtain an expanded keyword set K'; for example, the keyword set K may be expanded by a method such as a knowledge graph or synonym expansion.
Then, searching in the multi-modal information of the multimedia resource search library according to the expanded keyword set K'; or, searching in the multi-modal information and cataloguing information of the multimedia resource search library according to the expanded keyword set K'.
The keyword set is expanded here to improve the completeness of the query. For example, if the user query request includes "tomato", the technical solution of the present invention may also query a video including "tomato" content for the synonym "tomato" of "tomato". That is, the search is performed according to the expanded keyword set, so that more search results related to the query condition in the query request can be obtained.
Methods for searching according to the keyword set are well known to those skilled in the art, and are not described herein.
In the step, for the query request including the audio clip to be queried, the query request is firstly analyzed to obtain the audio clip in the query request; and further, according to the audio segments, searching the audio information in the audio compression coding form in the multimedia resource search library: and after audio features of the audio segments are extracted, carrying out compression coding, and searching for similar audio information in the audio compression coding form in the multimedia resource search library by utilizing a clustering algorithm.
In the step, for the query request including the picture to be queried, the query request is firstly analyzed to obtain the picture in the query request; and then according to the picture, searching in the image information in a pixel compression coding form in the multimedia resource search library: and after the picture pixel characteristics of the picture are extracted and compressed and coded, searching similar image information in the pixel compression coding mode in the multimedia resource retrieval library by utilizing a clustering algorithm.
Further, after retrieval is performed in the multi-modal information and the cataloguing information of the multimedia resource retrieval library, the cataloguing information of the same multimedia resource and the degrees of engagement, or matching degrees, of the information (i.e., text information, voice information, image information) of different modalities respectively corresponding to the query request can be obtained, the cataloguing information of the multimedia resource and the degrees of engagement of the information (i.e., text information, voice information, image information) of different modalities respectively corresponding to the query request are weighted-averaged, and the obtained weighted average is used as the score of the multimedia resource matching the query request. Sorting in descending order according to the scores of the multimedia resources; and taking the sequencing result of each multimedia resource as the retrieval result.
S103: and returning a retrieval result.
After the retrieval result matched with the query condition in the query request is obtained, the retrieval result is returned to the user, and the user can know the multimedia resource meeting the query condition or the multimedia resource meeting the condition similar to the query condition.
The multi-modal information of each multimedia resource in the multimedia resource search library is obtained and stored in advance, wherein a specific method flow for obtaining and storing text information of multimedia resources provided by the embodiment of the present invention is shown in fig. 2, and includes the following steps:
s201: text information is identified from the video of the multimedia asset.
Specifically, the image frames with high similarity in the multimedia resource may be deduplicated, and the image frames of the multimedia resource video after deduplication may be subjected to character recognition.
S202: and storing the identified text information into the multimedia resource search library.
In this step, preferably, the identified text information may be subjected to deduplication processing, and the deduplicated text information is stored in the multimedia resource search library. The deduplication processing is beneficial to removing a large amount of redundant information, and the space of the multimedia resource search library is saved.
The specific method flow for pre-acquiring and storing the voice information of the multimedia resource provided by the embodiment of the invention is shown in fig. 3, and comprises the following steps:
s301: audio is extracted from the multimedia asset.
S302: and performing voice recognition on the extracted audio frequency, converting the audio frequency into character content, and/or further extracting the characteristics of the audio frequency, and performing compression coding on the extracted audio frequency characteristics to obtain voice information of the multimedia resource in an audio frequency compression coding form.
S303: and storing the converted text content into the multimedia resource search library as the voice information of the multimedia resource, and/or storing the voice information of the multimedia resource in the form of audio compression coding obtained after compression coding into the multimedia resource search library.
In this step, preferably, text summary is performed on the text contents obtained by conversion, and the text contents obtained by summary are stored in the multimedia resource search library as the voice information of the multimedia resources; and/or
In this step, the voice information in the form of audio compression coding of the multimedia resource obtained after compression coding in step S302 is stored in the multimedia resource search library.
Generally, the speech content in multimedia assets is large, but only a portion of it is useful. Therefore, the text summary is made on the converted text content, and the content without practical significance is removed. And then adding the text content obtained by the abstract into a multi-mode media resource search library. Therefore, a large amount of redundant information can be removed, and the space of the multimedia resource search library is saved.
The specific method flow for acquiring and storing the image information of the multimedia resource in advance provided by the embodiment of the invention is shown in fig. 4, and comprises the following steps:
s401: and extracting key frames from the video of the multimedia resource.
In fact, the video of the multimedia resource is composed of one frame and one frame of pictures, and semantic information contained in the pictures is crucial for understanding the video content. The system firstly extracts key frames from the video to obtain the key frames.
S402: and carrying out image content description and/or image object labeling on the extracted key frames, and/or extracting picture pixel characteristics of the key frames and carrying out compression coding.
In this step, each key frame is subjected to image content description to generate text content describing the key frame, and/or each key frame is subjected to image object labeling to obtain character content labeled by the image object. Specifically, the image content description can be performed on the key frame by adopting an artificial intelligence related technology such as deep learning, and the like, so as to obtain the described text content; the image object labeling on the key frame specifically refers to character labeling on an object image identified in the key frame. And/or
In the step, after the picture pixel characteristics of each key frame are extracted and compressed and encoded, the image information of the multimedia resource in the pixel compression encoding form is obtained.
S403: and storing the text content obtained by describing the image content and/or the text content obtained by labeling the image object into the multimedia resource retrieval library as the image information of the multimedia resource in the text form, and/or storing the obtained image information of the multimedia resource in the pixel compression coding form into the multimedia resource retrieval library.
In this step, preferably, the text content obtained by describing the image content and/or the text content obtained by labeling the image object may be subjected to duplication elimination, and the duplicated text content is stored in the multimedia resource search library as the image information of the multimedia resource in the text form; and/or
In this step, the obtained image information of the multimedia resource in the form of pixel compression coding is stored in the multimedia resource search library.
Based on the above method, an internal block diagram of a multimedia resource retrieval device provided in an embodiment of the present invention is shown in fig. 5, and includes: a multimedia resource search library 501, a query request receiving module 502 and a search module 503.
The multimedia resource search library 501 is used for storing multi-modal information of a plurality of multimedia resources; preferably, the multimedia resource search library 501 may further have stored therein: cataloging information for each multimedia asset. Wherein the multi-modal information of the multimedia resource comprises at least one of the following information: text information, voice information, image information.
The query request receiving module 502 is used for receiving a query request sent by a user.
The retrieval module 503 is configured to perform retrieval in the multimedia resource retrieval library 501 according to the query request received by the query request receiving module 502, and return a retrieval result.
Preferably, the retrieving module 503 is configured to analyze the query request to obtain a keyword set K of the query request; expanding the keyword set K to obtain an expanded keyword set K'; and searching in the multimedia resource search library according to the expanded keyword set K'. The specific retrieving method of the retrieving module 503 may refer to the content in the step S102, and is not described herein again.
Further, after retrieving in the multi-modal information and the cataloguing information of the multimedia resource repository according to the expanded keyword set K', for the same multimedia resource, the retrieval module 503 may obtain the cataloguing information of the multimedia resource and the degrees of agreeing, or matching degrees, of the information of different modalities respectively corresponding to the query request, and perform weighted average on the cataloguing information of the multimedia resource and the degrees of agreeing, respectively corresponding to the query request, of the information of different modalities, and take the obtained weighted average as a score of the multimedia resource matching the query request. And returning the retrieval results to the user in a descending order according to the scores.
Alternatively, the retrieval module 503 may be further configured to analyze the query request, and obtain an audio segment in the query request; and according to the audio segments, searching in the audio information in the audio compression coding form in the multimedia resource search library.
Or, the retrieval module 503 may also be configured to analyze the query request and obtain a picture in the query request; and according to the picture, searching in the image information in a pixel compression coding mode in the multimedia resource search library.
Further, the apparatus for retrieving a multimedia resource provided in an embodiment of the present invention may further include: a multimodal information storage module 504;
the multimodal information storage module 504 includes at least one of the following: a text information storage unit 511, a voice information storage unit 512, and an image information storage unit 513.
The text information storage unit 511 is used for identifying text information from the video of the multimedia resource; the recognized text information is stored in the multimedia resource search library 501. The specific method for acquiring and storing the text information of the multimedia resource by the text information storage unit 511 can refer to the above-mentioned steps shown in fig. 2, and will not be described herein again.
The voice information storage unit 512 is configured to extract audio from the multimedia resource, perform voice recognition, convert the audio into text content, and store the text content obtained through conversion into the multimedia resource search library as voice information of the multimedia resource in a text form; extracting audio from the multimedia resource, further extracting the characteristics of the audio, and performing compression coding on the extracted audio characteristics to obtain the voice information of the multimedia resource in the form of audio compression coding, and storing the obtained voice information of the multimedia resource in the form of audio compression coding into the multimedia resource search library 501. The specific method for acquiring and storing the voice information of the multimedia resource by the voice information storage unit 512 can refer to the above steps shown in fig. 3, and is not described herein again.
The image information storage unit 513 extracts a key frame from the video of the multimedia resource, performs image content description and/or image object labeling on the key frame, and stores text content obtained through image content description and/or text content obtained through image object labeling into the multimedia resource search library as image information of the text form of the multimedia resource; and/or extracting a key frame from the video of the multimedia resource, extracting picture pixel characteristics of the key frame, performing compression coding, and storing image information of the multimedia resource in a pixel compression coding form into the multimedia resource search library 501. The specific method for acquiring and storing the image information of the multimedia resource by the image information storage unit 513 can refer to the steps shown in fig. 4, which is not described herein again.
In the technical scheme of the invention, the multi-mode information of the multimedia resources is stored in the multimedia resource retrieval library, retrieval is carried out in the multimedia resource retrieval library according to the query request, and retrieval can be carried out based on information richer than cataloged information, so that the multimedia resources meeting retrieval conditions can be retrieved more fully, and the retrieval requirements of the multimedia resources are better met.
Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in the flows, which have been discussed in the present application, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the various operations, methods, procedures disclosed in the prior art and the present invention can also be alternated, changed, rearranged, decomposed, combined, or deleted.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.