CN106844410B

CN106844410B - Determining quality of a summary of multimedia content

Info

Publication number: CN106844410B
Application number: CN201610877283.7A
Authority: CN
Inventors: N·莫达尼; V·苏布拉马尼安; S·古普塔; P·R·马内里克; G·希拉南达尼; A·R·辛哈; 尤特帕尔
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2015-12-04
Filing date: 2016-09-30
Publication date: 2022-02-08
Anticipated expiration: 2036-09-30
Also published as: GB201616833D0; AU2016238832B2; GB2545051A; US9454524B1; CN106844410A; AU2016238832A1; DE102016011905A1

Abstract

The present disclosure relates to determining the quality of a summary of multimedia content. The quality metric for a multimedia summary of a multimedia content item is determined based in part on semantic similarity of the summary and the content item, rather than based solely on word frequency. This is accomplished in some embodiments by identifying the abstract and semantic meaning of the multimedia content item using vector analysis. The vector of summaries and the vector of multimedia content items are compared to determine semantic similarity. In other examples, the quality metric of the multimedia summary is determined based in part on a coherence between an image portion of the summary and a text portion of the summary used to determine the quality metric of the multimedia summary.

Description

Determining quality of a summary of multimedia content

Technical Field

The present disclosure relates generally to characterizing multimedia content. In particular, the present disclosure relates to determining the quality of a summary of multimedia content, where both the summary and the multimedia content include text and images.

Background

Multimedia content refers primarily to digital content that includes some combination of different content forms, including text and images (video, animation, graphics, etc.). Such multimedia content is so pervasive and inexpensive that users are often overwhelmed by the process of selecting multimedia content items for consumption. Because of this, users of multimedia content often rely on summaries of multimedia content items. These summaries may be used instead for consuming multimedia content items or to facilitate the selection of multimedia content items to be consumed. Thus, the quality of the multimedia summary may have a significant impact on the intended reader's decision to consume a given content item. However, there is currently no suitable method for assessing the quality of the multimedia summary.

Drawings

Fig. 1 is a high-level flow diagram illustrating a method for determining a quality metric for a summary corresponding to a multimedia content item according to one embodiment of the present disclosure.

Fig. 2 is a detailed flow diagram illustrating a method for determining a quality metric of a summary corresponding to a multimedia content item according to one embodiment of the present disclosure.

Fig. 3 is a block diagram of a distributed processing environment including a quality metric determination system remotely coupled to a given user's computing device by a communication network according to one embodiment of the present disclosure.

Fig. 4 is a block diagram of a quality metric determination system for determining the quality of a multimedia summary of a multimedia content item according to one embodiment of the present disclosure.

The figures depict various embodiments of the present disclosure for purposes of example only. Many variations, configurations, and other embodiments will become apparent from the following detailed discussion.

Detailed Description

As previously indicated, there is no technique for evaluating the quality of a given multimedia summary. However, such summaries may have a significant impact on the intended user's decision, including whether to consume a full version of the summarized digital content item. Therefore, techniques for assessing the quality of a summary of a multimedia content item are desirable from a market development perspective. Consider, for example, a digital article having both image and text portions. As will be appreciated in light of this disclosure, an abstract of the article with a high degree of coherence between the image and text portions may help the reader to have a better understanding of the article faster than an abstract that would have lacked coherence between the image and text portions. In a more general sense, the degree to which the summary represents the corresponding multimedia content item may be quantified as a quality metric. The quality metric of the summary may then be used, for example, to gauge the likelihood that the summary will be effective in causing consumption of the content item itself. Although some available algorithms may be available to evaluate the textual portion of a given multimedia summary (or simply referred to herein as a "summary" for brevity) of a multimedia content item, such algorithms will fail to account for the non-textual portion of the summary. In particular, the algorithm used to evaluate the content will likely operate by comparing the word frequencies in the text portion of the multimedia content with the word frequencies in the corresponding summaries. The more similar the word frequency of the summary is to the word frequency in the multimedia content item, the higher the quality score. Examples of such algorithms include retention rate (which may operate, for example, by dividing the number of unique words in the summary by the number of unique words in the multimedia content item), KL divergence (which may operate, for example, by measuring the distribution of word frequencies in the content and corresponding summary), bilingual assessment substitution ("BLEU") (which determines the quality of machine-translated text from one language to another), and recall-oriented substitution ("ROUGE") for gist assessment (which uses a human-generated summary as a reference to determine the quality of the summary).

However, as will be appreciated in light of this disclosure, the above and similar algorithms are insufficient if used to determine the quality of a summary of a multimedia content item. One reason is that because these algorithms rely primarily on word frequency, the semantic meaning of the summary is not compared to the semantic meaning of the multimedia (non-text) content item. This word frequency approach may thus problematically generate a high quality metric value even for digests having a semantic meaning that is very different from the corresponding multimedia content item. Consider for example a simple example of a text portion of a multimedia content item stating "this girl dislikes cheese". A corresponding summary with a text portion stating that "this girl likes cheese" would have a good score using the word frequency algorithm, but would be inaccurate given the absence of "negatives" in the summary. In another example scenario, a multimedia content item that includes a text portion that uses pronouns to reference an accompanying image portion may have a high scoring summary with no information. Consider, for example, a multimedia content item that includes a picture accompanied by a shirt with the text title "this is good". Without analysis of the image portion of the shirt, the abstract stating "this is good" may be given a high quality metric because it exactly fits the text portion of the multimedia content item (i.e., there is a high degree of correlation between the text of the abstract and the text of the full text). However, if the image is actually considered, the summary may already be "this shirt is good", which is a relatively much more accurate summary and therefore should be scored higher than the score based on text alone. Thus, using currently available algorithms, the summary may be misleading determined to have a high quality score, but does not accurately reflect the semantic meaning of the multimedia content item.

To this end, techniques are provided herein for determining a quality metric for a multimedia summary of a multimedia content item by considering both textual and non-textual components of the summary. In some embodiments, the quality metric is based in part on semantic similarity of the summary and the content item rather than just word frequency. This is accomplished in some embodiments by identifying the abstract and semantic meaning of the multimedia content using vector analysis. The vector of summaries and the vector of multimedia content items are compared to determine semantic similarity. Note that both textual and non-textual items can be easily represented by vectors, thereby facilitating vector-based comparisons.

In addition to assessing semantic meaning similarity between a given multimedia content item and its multimedia summary, the present technique may also include determining a degree of correlation between textual and non-textual portions of the summary itself. As will be appreciated in light of this disclosure, a high degree of correlation or "coherence" between the text and non-text portions of the summary tends to indicate a higher quality summary. Accordingly, some embodiments of the present disclosure provide methods for determining a quality metric of a multimedia summary of a multimedia content item based in part on determining a coherence between an image portion of the summary and a text portion of the summary used to determine the quality metric of the multimedia summary. "coherence" refers to the semantic meaning similarity between the text portion of the multimedia summary and the image portion of the multimedia summary and is determined according to the method described below. At a high level, determining coherence is achieved by generating vectors from both the segments of the text portion and the segments of the image portion and projecting the vectors onto a common unit space. The projected vectors are then compared. Vectors that are adjacent to each other in a common unit space correspond to semantically similar information across both the text portion and the image portion of the summary, and thus correspond to a high degree of coherence between those portions. Note that if a given multimedia summary includes video instead of (or in addition to) still images, the video may be viewed as a collection of still images (or frames), where each image is evaluated separately against the text portion of the summary in the same manner as a still image. An average or other suitable statistical representation of the individual comparisons can then be calculated to provide a degree of overall coherence between the text portion and the video. For purposes herein, reference to "image" is intended to include a frame of video content.

One benefit of some embodiments of the present disclosure is improved accuracy of the quality metric. There are several reasons for the improved accuracy. One reason is that some embodiments of the present disclosure analyze both text portions and image portions of a multimedia content item and a corresponding summary. This improves the accuracy of the quality metric, since the quality metric thus reflects the semantic meaning conveyed between the text portion and the image portion of the multimedia content item and the corresponding summary. Another reason for the increased accuracy is that some embodiments analyze and incorporate coherence between the text portion of the summary and the image portion of the summary. This improves accuracy because semantically similar digests with text portions and image portions will yield a high quality metric when using embodiments of the present disclosure.

Another benefit of some embodiments of the present disclosure is the ability to customize the weights of the three different contributions to the multimedia quality metric. In particular, by means of user-selectable coefficients, according to some embodiments, the individual contributions of the following information contents may be weighted according to user preferences: (1) information content ("text overlay") of the text portion of the summary relative to the text portion of the multimedia content; (2) information content ("image overlay") of the image portion of the summary relative to the image portion of the multimedia content item; and (3) coherence between the text and the image of the summary. Some embodiments are customized to make an assessment of an abstract that is consistent with a set of topics or consistent with user-selected topics and interests. Some embodiments may be customized to improve the accuracy of the comparison between semantic meanings of image portions, text portions, or both.

As used herein, the term multimedia content item refers to a content item that includes a text portion and an image portion. The image portion may be a still image of any format in any type of digital resource (e.g., e-book, web page, mobile application, digital photograph) or a frame of video as previously explained. Each of the text portion and the image portion includes a text segment and an image segment, respectively. A text segment is a sentence, a clause of a sentence, a word or character (i.e., number, symbol, letter) in a sentence. An image segment is a frame or portion of a frame of an image or an object within a frame of an image. The informational content of a text portion or text segment refers to the number of words (e.g., nouns, verbs, and adjectives) in the text portion or text segment that can convey meaning, as opposed to words (e.g., conjunctions and articles) that themselves generally do not convey meaning. The information content of an image portion or image segment refers to a frame, portion of a frame, or object within a frame (e.g., an image of a face compared to an unfocused background) that may convey meaning. As indicated above, "relevance" refers to semantic meaning similarity between the text portion of the summary and the image portion of the summary. The term "quality" as used herein refers to the degree of similarity between the semantic meaning of the summary compared to the semantic meaning of the corresponding multimedia content item. The higher the value of the quality metric, the closer the digest and the corresponding multimedia content item are in semantic meaning.

Determining a quality measureMethod

Fig. 1 is a high-level flow diagram illustrating a method 100 for determining a quality metric for a multimedia summary corresponding to a multimedia content item, according to one embodiment of the present disclosure. The method 100 begins by receiving 104 a multimedia content item and also receiving 108 a multimedia summary corresponding to the multimedia content item. As presented above, the application of the method 100 to a multimedia content item and a multimedia summary is only one embodiment. Other embodiments of the present disclosure are applicable to content items and summaries containing only one or the other of a text portion and an image portion.

Some embodiments of the present disclosure then analyze 112 both the multimedia content item and the multimedia summary. The analysis 112 is described in more detail below in the context of fig. 2. Based on the analysis 112, a quality metric of the multimedia summary is determined 116. The quality metric and its determination 116 are also described in more detail below in the context of fig. 2.

Fig. 2 is a detailed flow diagram illustrating a method 200 for determining a quality metric for a multimedia summary corresponding to a multimedia content item according to one embodiment of the present disclosure. For ease of illustration, the method is illustrated as including three meta-steps (not presented in a particular order): (1) analyzing 204 semantic similarities between sentences of the text portion of the multimedia content item and sentences of the text portion of the summary; (2) analyzing 208 semantic similarities between sentences of the text portion of the summary and images of the image portion of the summary; and (3) analyzing 212 semantic similarity between the image of the image portion of the multimedia content item and the image of the image portion of the summary. For ease of illustration, elements of method 100 related to accepting a multimedia content item and a multimedia summary are omitted from fig. 2.

The meta-step 204 of the method 200 illustrates operations for analyzing similarities between sentences (or sentence fragments) of the text portion of the multimedia content item and sentences (or sentence fragments) of the text portion of the abstract. The function and benefit of this analysis 204 operation is to determine the degree to which semantic meaning is comparable between the text portion of the multimedia content item and the text portion of the corresponding summary. This analysis 204 is accomplished by first generating 216 a vector for sentences in the text portions of the multimedia content item and the summary, respectively, to determine whether the text portions of the summary convey the same (or similar) semantic meaning as that conveyed by the text portions of the multimedia content item. The more similar the semantic meaning conveyed, the higher the contribution to the quality metric of the text portion of the summary.

The vector is generated 216 by first processing the text portions of both the multimedia content item and the summary using a recursive auto-encoder. First training the coding matrix W_e。W_eOnce trained, is used to analyze the multimedia content item and the corresponding summarized sentences to extract the respective semantic meanings and compare them in a common unit space (described in more detail below).

To train the coding matrix W_eThe recursive automatic encoder first generates a syntax parse tree for at least one training sentence. Semantic vectors are generated for each word and clause within each training sentence. Each non-terminal (i.e., non-leaf) node of the parse tree is generated according to equation 1 below.

s＝f(W_e[c₁，c₂]+ b) equation 1

In equation 1, s represents a non-leaf node, W_eIs a trained coding matrix, and c₁And c₂(more generally, c_i) Is a word-to-vector representation. Specifically, c_iIncluding sentence fragments, which are elements of the parse tree. The sentence fragments are a subset of one or more of the training sequences. The term b in equation 1 is a constant. The function f is in one example a sigmoid function that produces results between 0 and 1 when it operates on the variables of the function.

For matrix W_eThe recursive auto-encoder reconstructs the elements under each node in the parse tree for each sentence of the multimedia content item and corresponding summary according to equation 2 below.

[x₁′∶y₁′]＝f(W_dy₂+ b) equation 2

Equation of2 description based on matrix W_dFor sentence y₂Outputs a plurality of vectors (from vector x)₁' to y₁') that output is subsequently processed with a sigmoid function f.

After completing the pair matrix W_eThen using the trained matrix W_eA vector representation of the root of the parse tree is generated and used as a representation vector for the sentence. The vector generated for each sentence is then used to calculate the cosine similarity between the sentence of the multimedia content item and the corresponding sentence of the summary. Determining similarity S between sentences of text portions of the multimedia content item and the text portion of the summary based on cosine similarity according to equation 3 below_T(u，v)。

Equation 3

In the case of the equation 3,

and

are vector representations of the text portions of the summary (u) and the text segments of the text portion of the multimedia content item (v), respectively. Cosine similarity quantifies semantic meaning similarity between a multimedia content item and a text portion of a sentence of a summary, which similarity then later serves as a contribution to a multimedia summary quality metric, as described in more detail below.

The meta-step 208 of the method 200 illustrates operations for analyzing similarities between the text portion of the summary and the sentences of the accompanying image portion of the summary. The function and benefit of this analysis 204 operation is to determine the extent to which semantic meanings between the text portion of the summary and the accompanying image portion of the summary correspond to each other. The more semantic similarity between the text and the accompanying image, the higher the quality of the multimedia summary.

In a process similar to that described above, vectors corresponding to the Image content and text content of the summary are generated 224 in a manner similar to that described by Karpathy et al (Deep Fragment outlines for Bidirectional Image Processing Systems,2014, pp.1889-1897), which is incorporated herein by reference in its entirety. First, a process for generating a vector of an image portion of a digest is described.

The process for generating 224 a vector corresponding to an image portion of a summary includes first identifying a segment of the image portion that may be relevant to the summary. The segments are identified by training a deep neural network auto-encoder, which is then applied to the image to extract relevant image portions. At a high level, this process is accomplished by extracting pixel values from the image and using the pixel values individually or in associated groups to identify higher levels of organization within the image that correspond to objects in the image.

Once the image segments are identified, a Regional Convolutional Neural Network (RCNN) is used to generate a vector corresponding to each of the identified image segments. In one embodiment, the RCNN is as described by Girshick et al, incorporated herein by reference in its entirety (see Rich features technologies for Accurate Object Detection and magnetic segmentation,Computer Vision and Pattern Recognition2014) generates a 4096-dimensional vector corresponding to each identified segment. A 4096-dimensional vector represents a convenient trade-off between consumption of computational resources and output quality. Since 4096 is equal to 2¹²It is therefore conveniently applied to binary data bits. A lower dimensional space may be used but with less discrimination between features. Higher dimensional space may also be used, but the consumption of computing resources increases.

An intersection between any two vectors is identified. A subset of segments for which vectors are generated is selected based on a likelihood of one of the image segments corresponding to a portion of the image that is semantically related to the summary. In some embodiments, the identified segments are further limited based on the classification determined using the vectors to reduce the risk of over-representation of any image segment in subsequent steps of the analysis.

The vector corresponding to the text portion of the summary is generated 224 using the procedure described above in the context of element 216 of meta-step 204.

The image vector and sentence vector are then projected onto a common unit space by matrix transformation. The matrices used to transform the vectors onto the common unit space have been trained so that semantically similar elements, whether in image parts or text parts, are correspondingly projected onto regions of the common unit space that reflect the semantic similarity.

One benefit of projecting vectors onto a common unit space is to reduce the impact of extraneous information on determining semantic similarity. For example, the vector as generated may comprise external information (e.g. color, texture, shape) that is not relevant to the semantic meaning of the image or text portion. The effect of this extrinsic information is reduced by mapping the vectors to a common unit space.

The cosine similarity of the vector to the image and text portions of the summary is then determined according to equation 4 below.

Equation 4

In the context of this equation, the equation,

and

are vector representations of the text segments of the text portion u of the abstract and the image segments of the image portion p of the abstract obtained using the method described above.

Meta-step 212 of method 200 illustrates operations for analyzing similarities between image portions of the summary and image portions of the multimedia content item in one embodiment. As explained above in the context of meta-step 208, vectors are determined for the images and projected onto the common unit space. The cosine similarity between the images based on the generated vector is determined according to equation 5 below.

Equation 5

In the case of the equation 5, the,

and

vector representations of image segments p and q of the summary and image portion of the multimedia content item, respectively.

Having generated similarity scores for various elements of the multimedia content item and corresponding summary as described above in method 200, a multimedia quality metric is determined 116 as shown in fig. 1 and as described in more detail below.

Determining multimedia summary metrics

Referring again to fig. 1, a process for determining 116 a quality metric that quantifies a degree of similarity between a summary and a semantic meaning of a multimedia content item using information determined in the analysis 112 (and corresponding method 200) is described below.

The multimedia digest quality metric is determined according to equation 6 below.

MuSQ＝f(IC_text，IC_image，Coh_total) Equation 6

Where MuSQ is a multimedia quality digest metric, IC_textIs a measure describing the amount of proportional information in the text portion of the summary relative to the text portion of the multimedia content item, IC_imageIs the amount of scale information in the image portion of the summary relative to the image portion of the multimedia content item. The term "f" in equation 6 and as used elsewhere in this disclosure represents a generic function rather than a specific function. Coh_totalIs the "coherence" between the text portion of the summary and the image portion of the summary. Coherence reflects the degree of semantic similarity between the text portion of the summary and the image portion of the summary while higher numbers reflect more semantic similarity between the text and the image of the summary. In one embodiment, equation 6 is as shown below in equation 7The non-decreasing summation of its variables.

MuSQ＝A·IC_text+B·IC_image+C·Coh_totalEquation 7

In equation 7, A, B and C are positive constants used to change the relative contribution of each variable to MuSQ.

IC is defined in equation 8 below_text。

MuSQ＝A·IC_text+B·IC_image+C·Coh_totalEquation 8

In equation 8, S_TIs defined above in equation 3, and R_vIs the number of items or words that may contribute to the semantic meaning of the text portion of the multimedia content item (referred to above as "information content"). That is, R_vIs the word count of nouns, verbs, adjectives, adverbs, and pronouns in a segment of text of a text portion. In the determination of R_vArticles, conjunctions, and the like are omitted.

For a given text segment v of the multimedia content item, a "max" function is taken on the text segment u present in the text portion of the summary. The result of the "max" function is the maximum representation of the text segment v present in the summary S. The "max" function also prevents redundant sentences in the summary from increasing the quality metric score, since only the summary sentences or segments that are most relevant to the multimedia content item contribute to the metric. In other words, using this function facilitates selecting the sentence with the most information content from among the plurality of sentences in the multimedia content item with respect to a particular semantic. This increases the score of the summary comprising a more extensive coverage of multimedia content, since repeated sentences contribute no (or less) to the score, wherein sentences and images representing diverse topics are scored as contributing more information content.

The result of the "max" function and the information content R of the sentence_vMultiplication. Including the information content R in equation 8_vThe assisted selection conveys more informative (in terms of number of nouns, adjectives, etc.) segments than less informative sentences having a lower count of "informative" words of the identified type. This quantity is for all the pieces of text present in the multimedia content itemThe summation of the segments v is a quality indicator of the text portion of the summary relative to the multimedia content item as a whole.

IC is defined below in equation 9_image。

Equation 9

S as defined above in equation 5_I(p, q) represents the information content of the image segment p (in the summary) in relation to the image q (in the multimedia content item). In one embodiment, S_IThe similarity between the image segment in the summary p and the corresponding image segment in the multimedia content item q is quantified. Determining pairs S based on representations of image segments as Recursive Convolutional Neural Network (RCNN) analysis optionally projected onto a common unit space as described above_IQuantization of (2). Item(s)

Is the information content of the image q of the multimedia content item. In one embodiment, the term is determined by converting the image segment q into text (and specifically the vector that generates 224) as described above in the context of meta-step 208, and then measuring the information content of that text using the method described above

And the term R described above_vThe functions of (a) are similar.

In equation 9, for a given image segment q of a multimedia content item, the maximum function is taken for the image segment p present in the image portion of the summary. The result is a maximum representation of the image segment q present in the image portion of the summary S. Summing all image segments q present in the multimedia content item provides an indication of how the summarized image portions represent the multimedia content item.

Coh is defined below in equation 10_total。

Equation 10

In equation 10, C_T，I(u, p) represents the coherence between the sentence (or text segment) u of the text portion from the abstract S and the image segment p of the image portion I of the abstract. As described above in the context of equation 4, C may be expressed_T，IProjected onto a common unit space to compare vectors of the extracted text portion and the image portion of the summary. R_uAnd

is the information content of the text portion and the image portion as defined above.

Example System

Fig. 3 is a block diagram of a distributed processing environment including a quality metric determination system remotely coupled to a given user's computing device by a communication network according to one embodiment of the present disclosure. The distributed processing environment 300 shown in fig. 3 includes a user device 304, a network 308, and a digest quality determination system 312. In other embodiments, system environment 300 includes different and/or additional components than those shown in FIG. 3.

The user device 304 is a computing device capable of receiving user input as well as transmitting and/or receiving data via the network 308. In one embodiment, the user device 304 is a computer system, such as a desktop or laptop computer. In another embodiment, the user device 304 may be a computer-enabled device, such as a Personal Digital Assistant (PDA), mobile phone, tablet computer, smart phone, or similar device. In some embodiments, the user device 304 is a mobile computing device for consuming a multimedia content item, a summary corresponding to the multimedia content item, and the methods described herein for determining a summary quality metric for a summary corresponding to the multimedia content item. The user equipment 304 isConfigured to communicate with digest quality determination system 312 via network 308. In one embodiment, user device 304 executes an application that allows a user of user device 304 to interact with summary quality determination system 312, thus becoming a specialized computing machine. For example, user device 304 executes a browser application to enable interaction between user device 304 and summary quality determination system 312 via network 308. In another embodiment, the user device 304 is operated by a native operating system (e.g., operating system) at the user device 304

Or ANDROID^TM) An upper running Application Programming Interface (API) interacts with digest quality determination system 312.

The user device 304 is configured to communicate via the network 308, which may include any combination of local and/or wide area networks, using both wired and wireless communication systems. In one embodiment, network 308 uses standard communication technologies and/or protocols. Thus, the network 308 may include links using technologies such as the Internet, 802.11, Worldwide Interoperability for Microwave Access (WiMAX), 3G, 4G, CDMA, Digital Subscriber Line (DSL), and so forth. Similarly, networking protocols used on network 308 may include multiprotocol label switching (MPLS), transmission control protocol/internet protocol (TCP/IP), User Datagram Protocol (UDP), hypertext transfer protocol (HTTP), Simple Mail Transfer Protocol (SMTP), and File Transfer Protocol (FTP). Data exchanged over network 308 may be represented using techniques and/or formats including hypertext markup language (HTML) or extensible markup language (XML). Further, all or some of the links may be encrypted using encryption techniques such as Secure Sockets Layer (SSL), Transport Layer Security (TLS), and internet protocol security (IPsec).

Fig. 4 is a block diagram of a system architecture of the digest quality determination system 312 as shown in fig. 3. The summary quality system 312 is configured to perform some or all of the above-described embodiments upon receiving the multimedia content and the corresponding summary to determine a quality metric indicating a degree of similarity between the overall semantic meaning of the summary and the semantic meaning of the corresponding multimedia content item. Summary quality determination system 312 includes non-transitory memory 416 and quality metric determination module 432, sub-components of which are described below.

Non-transitory memory 416 is depicted as including two different memory elements: a multimedia content item store 420 and a summary store 524. The multimedia content item repository 420 stores multimedia content items and (optionally content items comprising only one of text portions or image portions) for analysis and optionally for display or transmission. Summary store 424 stores summaries corresponding to multimedia content items. As with the multimedia content item repository 420, the summary repository 424 may store any one or more of a text summary, an image summary, and a multimedia summary that includes both text portions and image portions. Regardless of the nature of the stored content and summary, multimedia content item store 420 and summary store 424 are in communication with quality metric determination module 432.

The non-transitory memory 416 may include a computer system memory or random access memory for storing data and computer readable instructions and/or software implementing various embodiments as taught in the present disclosure, such as a persistent disk storage (which may include any suitable optical or magnetic persistent storage device, e.g., RAM, ROM, flash memory, USB devices, or other semiconductor-based storage media), a hard drive, a CD-ROM, or other computer readable medium. The non-transitory memory 416 may also include other types of memory or combinations thereof. The non-transitory memory 416 may be provided as a physical element of the system 312 or the non-transitory memory 416 may be provided separately or remotely from the system 312. The non-transitory memory 416 of the system 312 may store computer-readable and computer-executable instructions or software for implementing various embodiments, including a multimedia content item store 420 and a summary store 424.

In use, the quality metric determination module 432 communicates with the non-transitory memory 416 including the multimedia content item store 420 and the summary store 424 in order to receive and subsequently analyze multimedia content items and corresponding summaries. The quality metric determination module 432 includes a sentence-to-sentence analyzer 432, a sentence-to-image analyzer 436, and an image-to-image analyzer 440. The sentence-to-sentence analyzer analyzes the quality of the sentences (or sentence fragments) in the text portion of the summary relative to the sentences in the text portion of the multimedia content item as described above in the context of fig. 1 and 2. The sentence-to-image analyzer analyzes the quality of the sentences in the text portion of the summary relative to the accompanying image portion of the summary as described above in the context of fig. 1 and 2. The image-to-image analyzer analyzes the quality of the image portions of the summary with respect to the image portions of the corresponding multimedia content item as described above in the context of fig. 1 and 2. Once each of these

analyzers

432, 436, 440 completes the analysis, the quality metric determination module receives the output of the respective analysis to determine a summary quality metric as described above.

Web server 444 links summary quality determination system 312 to user device 304 via network 308. Web server 344 serves up Web pages and other Web related content, such as

XML, and the like. Web server 344 may provide functionality to receive and transmit content items and summaries from and to user device 304, to receive and transmit summary quality metrics from and to user devices, and to otherwise facilitate consumption of content items. Additionally, a web server 344 may be provided for operating the system to the native client device (such as

ANDROID^TM、

Or RIM) directly transmit data. The Web server 344 also provides API functions for exchanging data with the user device 304.

Summary quality determination system 312 also includes at least one processor 448 for executing computer-readable and computer-executable instructions or software stored in non-transitory memory 416 and other programs for controlling system hardware. Virtualization may be employed so that infrastructure and resources in the digest quality determination system 312 may be dynamically shared. For example, a virtual machine may be provided to handle a process running on multiple processors so that the process appears to use only one computing resource rather than multiple computing resources. Multiple virtual machines may also be used with a processor.

Example applications

The following two examples qualitatively describe the application of the embodiments described herein. In a first example, a multimedia content item contains two distinct sentences. The first sentence Str₁Set w comprising unique words₁. Str in multimedia content item₁Repeat n₁Next, the process is carried out. The second sentence Str₂Set w comprising unique words₂. Str in multimedia content item₂Repeat n₂Next, the process is carried out. For convenience of explanation, assume w₁And w₂Without any common words. This last hypothesis is expressed mathematically as w₁∩w₂Phi is given. Further, assume for this example that the word count | w₁|＝5，|w₂And 6. Str in multimedia content item₁The number of repetitions is n₁10 and Str in the multimedia content item₂The number of repetitions is n₂＝2。

If a summary of only a single sentence is requested, two options are possible: containing only Str₁Summary of (S)₁Or only Str₂Summary of (S)₂. Due to Str₁Repeating for 10 times, until Str₂Five more times more frequently, so summary S₁Is preferred because it captures information that is dominant in the original multimedia content item. Due to w₁And w₂Without any common words, the total number of unique words in the multimedia content item is w₁+w₂. Summary S compared to multimedia content item₁And S₂The retention rate of the words in each summary in (1) follows equations 11 and 12:

retention rate

Equation 11

Retention rate

Equation 12

Retention Rate Algorithm, such as that presented above, will preferentially select S₂Since it has the highest number of unique words of the analyzed summary. The retention rate algorithm bases this selection criterion on the assumption that the summary comprising more unique words describes more content in the multimedia content item. However, since these methods only focus on word counts, significant semantic differences are ignored. In this example, the retention rate would select a summary S with more unique words₂Even though it represents less of the entire content of the multimedia content item.

According to embodiments of the present disclosure, a summary having more information content and broader coverage (i.e., reflecting different topics throughout the multimedia content item) of the multimedia content item as a whole is preferred. In contrast to the retention rate example above, consider applying to summary 1 (S)₁) And abstract 2 (S)₂) Embodiments of the present disclosure selected in between. Equations 13 and 14 apply embodiments of the present disclosure to the above scenarios.

MuSQ(S₁) N1 w1 w 5 w 50 equation 13

MuSQ(S₂) N2 w2 w 6 w 12 equation 14

In the above example, equation 7 is reduced to the form of equations 13 and 14, and since this example includes only a text portion, the analysis image portion of equation 7 (i.e., IC) is reduced_imageAnd Coh_total) Is reduced to zero. Thus, the only term remaining from equation 7 is IC_textAn item. In this case, the IC_textReduced to semantic meaning (R) in sentences_v) The number of words that contribute because the "max" term is 1. Based on the foregoing, embodiments of the present disclosure may select S₁Because it is more representative of the multimedia content item(i.e., selecting includes the ratio Str₂The sentence Str repeated five times more frequently₁S of₁)。

In another example, consider the advantages of embodiments of the present disclosure over KL divergence. Adapt to the precedent, define the abstract S₁And S₂Is S₁＝{Str₁，Str₂And S₂＝{Str₁，Str₁And | w₁|＝5，|w₂| ═ 6 and w₁∩w₂Phi is given. Due to S₁With Str comprising only two repetitions₁S of₂The contrast includes more information (i.e., Str)₁And Str₂Both of them) so that S₁Is the preferred abstract.

Review KL divergence is defined in equation 15 below.

Equation 15

In equation 13, q_iIs the probability of occurrence of the ith word in the summary, and p is the probability of occurrence of the ith word in the original document. If KL (S)₂)＜KL(S₁) Then the abstract S will be selected according to KL divergence₂. The ratio of equation 16 determines the selection criteria based on the known application of the mathematics.

Equation 16

In this example, n₁10 and n₂2, so n₁＞4.3*n₂. For this reason, even S₂Has a ratio S₁Less information, in this case S will still be selected according to KL divergence₂As a preferred abstract.

In contrast, MuSQ (S), an example of the present disclosure, is applied₁)＝n₁*w₁+n₂*w₂10 × 5+2 × 6 ═ 62 and mussq (S)₂)＝n₁*w₁10 x 5 x 50. Applying this model, due to the diversity of the information, it is appropriateSelection of S₁As a preferred abstract.

More considerations are given

As will be appreciated in light of this disclosure, the various modules and components of the systems shown in fig. 3 and 4, such as the sentence-to-sentence analyzer 432, the sentence-to-image analyzer 436, and the image-to-image analyzer 440, may be implemented in software, such as a set of instructions (e.g., HTML, XML, C + +, object-oriented C, JavaScript, Java, BASIC, etc.), encoded on any computer-readable medium or computer program product (e.g., a hard drive, a server, a disk, or other suitable non-transitory memory or collection of memories) that, when executed by one or more processors, causes the various methods provided in this disclosure to be performed. It will be appreciated that in some embodiments, various functions performed by a user computing system as described in this disclosure may be performed by similar processors and/or databases in different configurations and arrangements, and the depicted embodiments are not intended to be limiting. The various components of this example embodiment, including the computing device 1000, may be integrated into, for example, one or more desktop or laptop computers, workstations, tablets, smart phones, game consoles, set-top boxes, or other such computing devices. Other typical components and modules of a computing system, such as a processor (e.g., central processing unit and co-processor, graphics processor, etc.), input devices (e.g., keyboard, mouse, touchpad, touchscreen, etc.), and an operating system are not shown but will be apparent.

The foregoing description of embodiments of the disclosure has been presented for purposes of illustration; it is not intended to be exhaustive or to limit the claims to the precise form disclosed. One skilled in the relevant art will recognize that many modifications and variations are possible in light of the above disclosure.

Some portions of the present description describe embodiments in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, when described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. The described operations may be embodied in software, firmware, hardware, or any combination thereof.

Any of the steps, operations, or processes described herein may be performed or implemented by one or more hardware or software modules, either alone or in combination with other devices. In one embodiment, the software modules are implemented with a computer program product comprising a non-transitory computer-readable medium containing computer program code executable by a computer processor for performing any or all of the steps, operations, or processes described.

Example embodiments

In one example, a computer-implemented method for evaluating a summary of a digital multimedia content item includes receiving a multimedia content item including a text portion and an image portion, receiving a summary of the multimedia content, the summary including a text portion and an image portion, and determining a quality metric of the summary relative to the multimedia content item. The determining includes determining at least two of the following content metrics: the method further includes determining a first content metric quantifying an amount of information content in the text portion of the summary that is common with the text portion of the multimedia content item, determining a second content metric quantifying an amount of information content in the image portion of the summary that is common with the image portion of the multimedia content item, and determining a third content metric quantifying an information coherence between the text portion of the summary and the image portion of the summary. The quality metric is based at least in part on the at least two determined content metrics. In one embodiment of this example, determining the quality metric further comprises determining a product of the first content metric, the second content metric, and the third content metric. In one embodiment of this example, determining the first content metric includes determining a cosine similarity between at least one text segment of the text portion of the multimedia summary and a vector representation of at least one text segment of the multimedia content item. A max function may be applied to the cosine similarity determination. In one embodiment of this example, determining the second content metric includes generating a first image vector from the image portion of the summary and generating a second image vector from the image portion of the multimedia content item. In one embodiment of this example, determining the third content metric includes projecting a first text content vector from the text portion of the summary and a second text content vector from the image portion of the summary onto a common unit space. In one embodiment of this example, determining the third content metric includes determining a product of a first content of the text portion of the summary and a second content of the image portion of the summary.

In another example, a computer program product is stored on at least one non-transitory computer-readable medium comprising instructions that, when executed by one or more processors, cause the above computer-implemented method to be performed.

In another example, a system for evaluating a summary of a digital multimedia content item includes various modules, at least one processor, and at least one non-transitory storage medium for determining a quality metric according to the example methods described above.

Claims

1. A computer-implemented method for evaluating a summary of a digital multimedia content item, the method comprising:

receiving the multimedia content item comprising a text portion and an image portion;

receiving the summary of the multimedia content item, the summary comprising a text portion and an image portion;

determining a quality metric of the summary relative to the multimedia content item, the determining comprising:

determining a first content metric quantifying an amount of information content in the text portion of the summary that is common with the text portion of the multimedia content item;

determining a second content metric quantifying an amount of information content in the image portion of the summary that is common with the image portion of the multimedia content item; and

determining a third content metric that quantifies a coherence of information between the text portion of the summary and the image portion of the summary;

wherein the quality metric is based at least in part on the determined first, second, and third content metrics.

2. The method of claim 1, wherein determining the quality metric further comprises determining a product of the first content metric, the second content metric, and the third content metric.

3. The method of claim 1, wherein determining the first content metric comprises determining a cosine similarity between a vector representation of at least one text segment of the text portion of the summary and a vector representation of at least one text segment of the multimedia content item.

4. The method of claim 3, further comprising applying a max function to the cosine similarity.

5. The method of claim 1, wherein determining the second content metric comprises generating a first image vector from the image portion of the summary and a second image vector from the image portion of the multimedia content item.

6. The method of claim 1, wherein determining the third content metric comprises projecting a first text content vector from the text portion of the summary and a second text content vector from the image portion of the summary onto a common unit space.

7. The method of claim 1, wherein determining the third content metric comprises determining a product of a first content of the text portion of the summary and a second content of the image portion of the summary.

8. A computer program product, wherein the computer program product is stored on at least one non-transitory computer-readable medium comprising instructions that, when executed by one or more processors, cause a process to be performed, the process comprising:

receiving a multimedia content item comprising a text portion and an image portion;

receiving a summary of the multimedia content item, the summary comprising a text portion and an image portion;

9. The computer program product of claim 8, wherein determining the quality metric further comprises determining a product of the first content metric, the second content metric, and the third content metric.

10. The computer program product of claim 8, wherein determining the first content metric comprises determining a cosine similarity between a vector representation of at least one text segment of the text portion of the summary and a vector representation of at least one text segment of the multimedia content item.

11. The computer program product of claim 10, further comprising applying a max function to the cosine similarity.

12. The computer program product of claim 8, wherein determining the second content metric comprises generating a first image vector from the image portion of the summary and a second image vector from the image portion of the multimedia content item.

13. The computer program product of claim 8, wherein determining the third content metric comprises projecting a first text content vector from the text portion of the summary and a second text content vector from the image portion of the summary onto a common unit space.

14. The computer program product of claim 8, wherein determining the third content metric comprises determining a product of a first content of the text portion of the summary and a second content of the image portion of the summary.

15. A system for evaluating a summary of a digital multimedia content item, the system comprising:

a multimedia content item repository configured to receive a multimedia content item comprising a text portion and an image portion;

a summary repository configured to receive a summary comprising a text portion and an image portion;

a quality metric determination module configured to determine a quality metric of the summary relative to the multimedia content item, the determination comprising:

16. The system of claim 15, wherein the quality metric determination module is further configured to determine the quality metric by determining a product of the first content metric, the second content metric, and the third content metric.

17. The system of claim 15, wherein the quality metric determination module is further configured to determine the first content metric by determining a cosine similarity between a vector representation of at least one text segment of the text portion of the summary and a vector representation of at least one text segment of the multimedia content item.

18. The system of claim 17, wherein the quality metric determination module is further configured to determine the first content metric by applying a max function to the cosine similarity.

19. The system of claim 15, wherein the quality metric determination module is further configured to determine the second content metric by generating a first image vector from the image portion of the summary and a second image vector from the image portion of the multimedia content item.

20. The system of claim 15, wherein the quality metric determination module is further configured to determine the third content metric by projecting a first text content vector from the text portion of the summary and a second text content vector from the image portion of the summary onto a common unit space.