US20230130010A1 - Automated content quality control - Google Patents
Automated content quality control Download PDFInfo
- Publication number
- US20230130010A1 US20230130010A1 US17/817,798 US202217817798A US2023130010A1 US 20230130010 A1 US20230130010 A1 US 20230130010A1 US 202217817798 A US202217817798 A US 202217817798A US 2023130010 A1 US2023130010 A1 US 2023130010A1
- Authority
- US
- United States
- Prior art keywords
- spectrogram
- audio content
- content
- combined
- end page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003908 quality control method Methods 0.000 title description 5
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000004040 coloring Methods 0.000 claims description 27
- 238000004891 communication Methods 0.000 claims description 26
- 238000010801 machine learning Methods 0.000 claims description 2
- 238000010200 validation analysis Methods 0.000 description 24
- 230000005236 sound signal Effects 0.000 description 16
- 230000008569 process Effects 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 8
- 230000003595 spectral effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 239000003086 colorant Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000001737 promoting effect Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 241000023320 Luma <angiosperm> Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 239000012925 reference material Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/001—Texturing; Colouring; Generation of texture or colour
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
Definitions
- Promotional content may be generated to promote particular television content (e.g., a particular TV show).
- the promos are generated to be broadcast (e.g., during broadcast of another TV show) on a particular day or week.
- Some promos referred to as straight promos, do not feature information indicating when and where the promoted show will be broadcast. In contrast, other promos do feature such information.
- Such information may be featured in the promo at an end page of the promo.
- An end page includes a sequence of video frames.
- the video frames indicate when and/or on what station the promoted show will be broadcast.
- the video frames may include audio providing such indication(s).
- FIG. 1 shows a screen capture 100 of a frame of an example end page of an example promo.
- the frame may be accompanied by audio (e.g., voiceover audio) that states, by way of example, “Catch Rachael tomorrow at 2 PM on NBC 10 Boston.”
- audio e.g., voiceover audio
- the end page of FIG. 1 may be similar to end pages of other promos that promote the same TV show.
- Such other promos may be the same as the promo of FIG. 1 , except that such other end pages may feature information indicating, by way of example, a different station and/or a different time of day at which the promoted show will be broadcast.
- such other promos and the promo of FIG. 1 may all be based on a common promo (e.g., a generic promo).
- each promo may have been edited to feature a different end page that is customized, for example, for a target broadcast area.
- Such editing of a common promo to generate customized promos may be performed by human operators. This process may be tedious and prone to human error. This process may be very time-consuming when performed to generate large numbers of customized promos.
- aspects of the present disclosure are directed to comparing first audio content (e.g., of audiovisual content of a reference end page) with second audio content (e.g., of audiovisual content of a promo that includes an end page) in a more autonomous manner. Based on the comparison, it is determined whether a specific reference end page is associated with a specific promo end page. According to a further aspect, the comparison includes determining a degree to which particular audio content included in the promo end page is present in the specific reference end page. The particular audio content may include specific voiceover content that would be found in a reference end page that properly corresponds to the promo end page.
- aspects of the present disclosure illustrate techniques of audio comparison within the context of end pages and promos, the audio comparison techniques described may be applied more generally to any audio files and contexts in order to compare the audio content between two sources.
- a method for comparing a first audio content with a second audio content includes: obtaining a first spectrogram representing the first audio content; obtaining a second spectrogram representing the second audio content; generating a combined spectrogram based on the first spectrogram and the second spectrogram; and determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
- a machine-readable non-transitory medium has stored thereon machine-executable instructions for comparing a first audio content with a second audio content.
- the instructions include: obtaining a first spectrogram representing the first audio content; obtaining a second spectrogram representing the second audio content; generating a combined spectrogram based on the first spectrogram and the second spectrogram; and determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
- an apparatus for comparing a first audio content with a second audio content includes: a network communication unit configured to transmit and receive data; and one or more controllers.
- the one or more controllers are configured to: obtain a first spectrogram representing the first audio content; obtain a second spectrogram representing the second audio content; generate a combined spectrogram based on the first spectrogram and the second spectrogram; and determine whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
- FIG. 1 shows a screen capture of a frame of an example end page.
- FIG. 2 illustrates example naming conventions that may be adopted in generating names of reference end pages and that of a promo end page.
- FIG. 3 illustrates a flow diagram of generating reference information according to at least one embodiment.
- FIG. 4 illustrates a flow diagram of a process (e.g., quality control process) that includes comparing at least one reference end page with a promo end page according to at least one embodiment.
- a process e.g., quality control process
- FIG. 5 illustrates calculation of a structural similarity index measure (SSIM) index between windows x and y having common dimensions.
- SSIM structural similarity index measure
- FIGS. 6 ( a ), 6 ( b ) and 6 ( c ) illustrate example spectrograms of audio content of a reference end page.
- FIG. 7 illustrates an example of aligning a spectrogram of audio content in the reference end page with a spectrogram of audio content in the promo end page according to at least one embodiment.
- FIGS. 8 ( a ) and 8 ( b ) illustrate examples of combined spectrograms.
- FIGS. 9 ( a ), 9 ( b ), 9 ( c ), 9 ( d ) and 9 ( e ) illustrate examples of combined spectrograms.
- FIG. 10 illustrates a flowchart of a method of comparing audio content according to at least one embodiment.
- FIG. 11 is an illustration of a computing environment according to at least one embodiment.
- FIG. 12 is a block diagram of a device according to at least one embodiment.
- one or more human operators may edit a common promo to generate customized promos. For example, upon receiving a common promo, one or more editors may identify (e.g., from among a set of reference end pages), a reference end page that is associated with the received common promo.
- the set of reference end pages is used to classify an incoming promo.
- the reference end pages embody information including information indicating a show with which the promo is to be broadcast, information indicating a station (e.g., network affiliate) on which the promo is to be broadcast, information indicating day and time of the broadcast, etc.
- FIG. 2 illustrates example naming conventions that may be adopted in generating a name (e.g., file name) of a reference end page or a promo end page.
- a generated name 200 is composed of multiple fields. For example, one field carries information indicating a city in which the (associated) promo is to be broadcast. As another example, another field carries information indicating the show with which the promo is to be broadcast. As yet another example, another field carries information indicating a day in which the promo is to be broadcast (e.g., tomorrow, today, etc.). As such, the name of a reference end page may be generated to carry such information.
- the name of a promo may be generated to carry such information. Accordingly, analyzing the name of a promo may be utilized to validate that an end page (e.g., an identified reference end page) that is (or has been) edited into a promo corresponds to the promo.
- analyzing the name of the promo may be used to validate that an identified reference end page includes the frame illustrated in the screen capture 100. As illustrated in FIG. 1 , the frame illustrated in the screen capture 100 is for promoting that a particular show is to be broadcast the following day (i.e., “TOMORROW”), and not to be broadcast on the current day (i.e., “TODAY”).
- aspects of the present disclosure are directed to validating that an end page that is (or has been) edited into a promo correctly corresponds to a given promo (e.g., correctly corresponds to the target of a given promo).
- one or more embodiments are directed to verifying that an end page that is (or has been) appended to a promo is correctly associated with a given promo based on comparing aspects of the reference end page with aspects of the promo end page. For example, it is verified that the reference end page is associated with a corresponding TV show or program, has an appropriate length with respect to time, and/or corresponds to announcing broadcast of a TV show or program on a particular day, time and/or network. Further by way of example, it is determined whether audio content included in the promo end page is present in the reference end page. The audio content may include specific voiceover content that would be found in a reference end page that properly corresponds to the promo end page.
- FIG. 3 illustrates a flow diagram of generating reference information according to at least one embodiment.
- the reference information includes hashes corresponding to respective reference end pages and/or thumbnail images of one or more frames of each reference end page.
- technical specifications may be validated for the reference end page, and subsequently, the reference end page may be hashed (e.g., using a perceptual hash), and a fingerprint of the hash for various reference end pages may be saved to facilitate fast matching.
- a thumbnail image sequence of the reference end page may be exported for fine-grained SSIM comparisons later.
- the reference information may also include spectrogram information corresponding to respective end pages.
- the reference end pages may be analyzed to synthesize a single, reference end page from overlapping sequence fragments.
- Unique MATID labels are mapped to the synthesized sequence.
- Normalized audio files are exported and named by their MD5 hashes to reduce duplication and spectrograms may be generated from the exported audio.
- the model may include MATID labels, file locations, and information about the scale of the reference spectrograms, longest and shortest reference sequences, and any other data required to ensure that the promos are preprocessed in the same manner as the reference material.
- the reference information may be used in a process (e.g., quality control process) that is performed for a promo end page.
- a process e.g., quality control process
- Generating reference information may occur periodically, e.g., once every six months or weekly.
- reference information may be generated based on a most recent batch of reference end pages and/or a group of reference end pages that are known.
- FIG. 4 illustrates a flow diagram of a quality control process that includes comparing at least one reference end page with a promo end page according to at least one embodiment.
- the process may include a video validation 410 and/or an audio validation 450 . If the process includes performing both validations 410 and 450 , the video validation 410 and the audio validation 450 may be performed independent of each other, simultaneously, or in series. Although FIG. 4 illustrates the video validation 410 as occurring before the audio validation 450 , that order may be switched, such that the audio validation 450 is performed before the video validation 410 .
- a particular validation e.g., the video validation 410
- performance of other validations e.g., the audio validation 450
- other validations may be omitted for purposes of saving time and/or reducing effort.
- the video validation 410 will now be described in more detail with reference to at least one embodiment.
- certain technical specifications (or parameters) of the promo may be validated against corresponding specifications of a reference end page, to determine whether the specifications are aligned.
- Such technical specifications may include frames per second (FPS), pixel resolution, audio frequency, etc.
- software tools may be used to identify metadata such as the frame rate, dimensions, etc. of the promo.
- technical specifications (or parameters) of a reference end page may be validated against corresponding specifications of the promo.
- a hash of the promo is generated.
- the hash may be generated based on perceptual hashing. Perceptual hashing is used to determine whether features of particular pieces of multimedia content are similar, e.g., based on, image brightness values of individual pixels.
- hashes for the last N frames of the promo may be generated, where N denotes an integer that is equal or greater than 1.
- the last N frames may correspond to a length of time.
- N may correspond to the lesser of (1) the number of frames in the longest reference sequence or (2) the number of frames in the promo. For example, it may be determined that the longest acceptable length of an end page may be around 8 seconds. In this situation, if the total length of a promo is 30 seconds, then N may be equal to the number of frames in the last 8 seconds of the promo, which is where the end page is located.
- N may be equal to the total number of frames in the promo. For example, if the total length of the promo is 4 seconds (and is, therefore, shorter than the longest acceptable length of 8 seconds), then N may be equal to the total number of frames in the 4-second promo.
- Hashes of the reference end pages may be generated in a similar manner. Accordingly, a hash of the promo may be compared against a hash of a reference end page, as will be described in more detail below.
- the promo is compared with at least one reference end page based on information described earlier, e.g., hash information. For example, at block 413 , a hash of the last frame of the promo may be retrieved. Then, all reference end pages that have a hash that is within a particular Hamming distance threshold (relative to the hash of the last frame of the promo) are identified. For each reference end page that meets such a threshold, a finer-grained analysis may then be performed (see block 414 ).
- each of a number of ordered frames (e.g., N ordered frames) of the promo end page is sufficiently similar to a corresponding frame of the reference end page.
- the degree of similarity may be based on Hamming distance. For a given pair of end pages, it may be determined that the two end pages are sufficiently similar if, for each of the N frames, the difference between respective hashes does not exceed the Hamming distance threshold.
- the Hamming distance threshold may be an integer between 0 and 3, inclusive.
- the coarser-grained analysis of block 413 may result in identification of one or more reference end pages that potentially match the promo end page.
- the search space of potential matches is likely reduced (or narrowed) based on perceptual hashing.
- a finer-grained analysis is then performed based on an accordingly smaller number of reference end pages.
- a finer-grained analysis is performed to further measure the similarity between respective frames of end pages (e.g., respective frames of a promo end page and a reference end page identified at block 413 ).
- the analysis of block 414 is based on a structural similarity index measure (SSIM).
- SSIM structural similarity index measure
- sequence matching or reference and promo end page matching
- a fast, coarse-grained search for near matches to reduce the search space may be performed.
- a fine-grained framewise comparison e.g., SSIM
- SSIM fine-grained framewise comparison
- FIG. 5 illustrates calculation of an SSIM index between two windows x and y having common dimensions.
- the window x may correspond to a frame of a promo end page
- the window y may correspond to a respective frame of reference end page identified at block 413 of FIG. 4 .
- the calculation of FIG. 5 may be applied on luma, on color (e.g., RGB) values or on chromatic (e.g. YCbCr) values.
- the resultant SSIM index is a decimal value between 0 and 1, where the value of 1 corresponds to a case of two identical sets of data and therefore indicates perfect structural similarity. In contrast, a value of 0 indicates no structural similarity.
- a particular value e.g., 1
- a reference end page is determined to be sufficiently similar to a promo end page if the SSIM-based criterion described above is met for each of a percentage of pairs of frames. For example, if a particular number of pairs of frames do not satisfy the SSIM-based criterion and all other pairs of frames do satisfy the criterion, then the reference end page is determined to be sufficiently similar to the promo end page.
- a determination is presented in the following pseudo code:
- R denotes a sorted list of SSIM values from lowest to highest
- V denotes the value of a single item in R
- T denotes the minimum value of V to be considered a match
- N denotes the number of frames that are allowed to be below T in absolute value
- :N denotes all items in the list between 0 and N ⁇ 1
- N denotes all items between N and the end of the list.
- the name of the promo end page is analyzed (e.g., with respect to the name(s) of one or more reference end pages identified at block 414 ). For example, it is determined whether the fields of the name of the promo end page (see FIG. 2 ) are consistent with the names of the identified reference end pages.
- the quality control process of FIG. 4 may include performing audio validation 450 .
- the audio validation 450 may be performed after the video validation 410 .
- the video validation 410 and the audio validation 450 may be performed independent (or irrespective) of each other.
- the validations 410 and 450 can be performed in parallel.
- a report is generated.
- the report may be for storage in a “pass” folder (or directory) if the promo meets all criteria described earlier.
- the report may be for storage in a quarantine folder and, therefore, flagged for subsequent review (e.g., human review).
- the report may be generated for monitoring/debugging, and the promo may be moved either to the pass folder or to the quarantine folder.
- the label for the reference end page may include the MATID data from block 430 that may be used for filename verification and as a pointer to associated audio data, if any.
- an audio validation 450 will now be described in more detail according to at least one embodiment.
- audio verification it is determined whether audio content (e.g., spectral content of an audio signal) of a promo end page matches corresponding audio content of a reference end page.
- audio content of the promo end page that includes voiceover audio.
- the voiceover audio may be similar to that which was described earlier with reference to FIG. 1 (e.g., “Catch Rachael tomorrow at 2 PM on NBC 10 Boston”).
- examples will be described with reference to audio signals that are stereo signals, in that each given audio signal carries two individual channels (e.g., left channel and right channel).
- FIGS. 6 ( a ), 6 ( b ) and 6 ( c ) illustrate example spectrograms of audio content in the reference end page.
- Each spectrogram is a visual representation of the spectrum of frequencies (see vertical axis) in one or more channels of the audio signal as the signal varies over time (see horizontal axis). More particularly, FIG. 6 ( a ) illustrates a spectrogram 610 of the left channel of the audio signal in the reference end page.
- FIG. 6 ( c ) illustrates a spectrogram 630 of the right channel of the audio signal in the reference end page.
- FIG. 6 ( b ) illustrates a spectrogram 620 of the merged left and right channels of the audio signal in the reference end page. In an aspect, the left and right channels are merged by mixing the channels together to produce a mono audio track.
- Spectrograms similar to those illustrated in FIGS. 6 ( a ), 6 ( b ) and 6 ( c ) are also obtained for audio content in the promo end page.
- a spectrogram 720 of merged left and right channels of an audio signal in the promo end page is illustrated in FIG. 7 .
- FIG. 7 illustrates an example of aligning a spectrogram of audio content in the reference end page with a spectrogram of audio content in the promo end page according to at least one embodiment.
- an attempt is made to align the spectrogram 620 of audio content (merged left and right channels) in the reference end page.
- the alignment is performed to obtain an approximate match (e.g., closest match) between the spectrogram 620 and a segment of the spectrogram 720 . If the alignment succeeds, then further analysis may be performed based on the spectrogram 620 , corresponding to the reference end page, and the matching segment of the spectrogram 720 , corresponding to the promo end page.
- the spectrogram 620 may be effectively positioned (or shifted) along the horizontal axis with respect to the spectrogram 720 , to obtain a best (or closest) match in spectral content between the spectrograms 620 and 720 . Once a best match is obtained, then one or more parameters may be captured to record a location (or positioning) of the alignment.
- the spectrograms 620 and 720 are aligned by calculating a homography that maps the first spectrogram 620 to the segment of the spectrogram 720 .
- one or more parameters may be captured to record a horizontal offset of a positioning of the spectrogram 620 with respect to (e.g., within the bounds of) the spectrogram 720 .
- the parameters may include a number in an upper right of a homography matrix.
- Such an offset may then be used to effectively crop spectrograms of the promo end page.
- such an offset may be applied to a spectrogram of the left channel of the audio signal of the promo end page, to crop the spectrogram.
- the offset may be applied to a spectrogram of the right channel of the audio signal of the promo end page.
- the spectrogram of the left channel and the spectrogram of the right channel of the promo end page are cropped to have lengths in time that correspond to the lengths of spectrograms 610 and 630 (see FIGS. 6 ( a ) and 6 ( c ) ).
- spectrograms of the promo end page are combined with corresponding spectrograms of the reference end page.
- spectrograms are combined by putting one spectrogram on top of another spectrogram to produce a combined spectrogram corresponding to a new audio track with two channels, each channel occupying a separate color channel in the spectrogram image.
- the cropped spectrogram of the left channel of the promo end page is combined with the spectrogram 610 of the left channel of the audio signal in the reference end page in FIG. 6 ( a ) .
- the cropped spectrogram of the left channel of the promo end page may be effectively overlaid (or superimposed) over the spectrogram 610 .
- the overlaying produces a combined spectrogram (e.g., combined spectrogram 810 of FIG. 8 ( a ) ).
- the cropped spectrogram of the right channel of the promo end page is combined with the spectrogram 630 of the right channel of the audio signal in the reference end page of FIG. 6 ( c ) .
- the cropped spectrogram of the right channel of the promo end page may be effectively overlaid (or superimposed) over the spectrogram 630 .
- the overlaying produces a combined spectrogram (e.g., combined spectrogram 830 of FIG. 8 ( b ) ).
- coloring is applied to individual spectrograms before overlaying, such that the combination produces a spectrogram that is generated as a color image.
- the combined spectrogram 810 is produced by combining the cropped spectrogram of the left channel of the promo end page with the spectrogram 610 of FIG. 6 ( a ) .
- coloring is applied to the cropped spectrogram of the left channel of the promo end page, such that all audio content in the spectrogram is placed in a first color channel (e.g., a green channel).
- a first color channel e.g., a green channel
- all spectral content that arises from audio in the cropped spectrogram of the left channel of the promo end page is represented using the color green.
- the represented audio may include both voiceover audio and background music, as well as other types of audio content.
- a different coloring is applied to the spectrogram 610 , which corresponds to the left channel of the reference end page.
- all audio content in the spectrogram 610 is placed in a second color channel that is different from the first color channel noted above.
- the second color channel may be a red channel.
- all spectral content that arises from audio in the spectrogram 610 is represented using the color red.
- the represented audio typically includes voiceover audio but not background music, because the reference end page includes voiceover audio but does not include background music.
- the application of coloring to the individual spectrograms results in potentially combined coloring in the combined spectrogram 810 .
- the combined coloring may be utilized to identify areas of alignment (or, conversely, non-alignment) between the reference end page and the promo end page. In more detail, it may be determined whether audio content (e.g., voiceover audio) in the reference end page is also present in the promo end page
- regions of a combined color e.g., yellow
- regions of a combined color e.g., yellow
- yellow is the combined color because the colors red and green combine to produce the color yellow.
- the yellow-colored regions result from voiceover audio in the cropped spectrogram of the left channel of the promo end page (represented using the color green) being effectively overlaid or superimposed over matching voiceover audio in the spectrogram 610 (represented using the color red).
- the region 812 is an example of a region where voiceover audio in the cropped spectrogram of the left channel of the promo end page and voiceover audio in the spectrogram 610 align to appear as a yellow-colored region.
- regions of the first color may appear in the RGB image of the combined spectrogram 810 .
- spectral content that arises from background music in the promo end page is represented using the color green.
- the background music may be unique to this specific promo end page, in that different promo end pages may feature different background music and the reference end page does not feature any background music.
- the region 814 is an example of a region where background music in the cropped spectrogram of the left channel of the promo end page does not align with any audio content in the spectrogram 610 . Accordingly, the corresponding, green-colored region does not overlap with a red-color region in the spectrogram 610 , and, therefore, remains green in the combined spectrogram 810 .
- regions of the second color may appear in the RGB image of the combined spectrogram (e.g., combined spectrogram 810 ).
- spectral content that arises from all audio content in the reference end page is represented using the color red.
- a corresponding time range of the misalignment is recorded. For example, one or more timestamps marking a beginning (or start) and/or an end of misalignment with respect to time may be recorded. Information including such timestamps may be provided.
- FIG. 9 ( a ) illustrates an example of misalignment at a beginning of the end pages.
- voiceover audio that is present at a beginning of the reference end page is not present at a beginning of the promo end page. Accordingly, a red-colored region appears at a left (starting) area of the combined spectrogram.
- FIG. 9 ( b ) illustrates an example of misalignment at (or around) a middle of the end pages.
- voiceover audio that is present at a middle of the reference end page is not present at a middle of the promo end page. Accordingly, a red-colored region appears at a center area of the combined spectrogram.
- FIG. 9 ( c ) illustrates an example of misalignment at an end of the end pages.
- voiceover audio that is present at an end of the reference end page is not present at the end of the promo end page. Accordingly, a red-colored region appears at a right (ending) area of the combined spectrogram.
- FIG. 9 ( d ) illustrates an example of a complete (or near complete) misalignment between the end pages over time.
- voiceover audio that is present in the reference end page is simply not present in the promo end page. Accordingly, a red-colored region appears throughout the combined spectrogram.
- FIG. 9 ( e ) illustrates an example of isolated (or scattered) misalignment between the end pages over time.
- voiceover audio in the promo end page may not fully match voiceover audio in the reference end page. Accordingly, scattered red-colored regions appear across the combined spectrogram.
- one or more tools based on machine learning may be utilized to determine whether the reference end page passes or fails with respect to the promo end page. The determination may be based, at least in part, on the presence of red-colored regions in the combined spectrogram. For example, if the (combined) size of red-colored regions is under a particular threshold size, then it may be determined that the reference end page sufficiently matches the promo end page. Otherwise, it may be determined that the reference end page does not sufficiently match the promo end page.
- automatic speech recognition may be used to eliminate (or reduce) false positives that may arise.
- audio content in promo end pages may intentionally be sped up or modified slightly to meet on-air requirements. Such changes to the audio content may result in identification of areas of misalignment (or non-alignment) during the audio validation that has been described herein with reference to one or more embodiments.
- ASR-based tools may be used to confirm that the voiceover audio in the reference end page is identical (e.g., in substance) to the voiceover audio in the promo end page.
- ASR-based tools may be used to confirm that the substance of the voiceover audio in the reference end page matches that of the voiceover audio in the promo end page, which states “Catch Rachael tomorrow at 2 PM on NBC 10 Boston” (see the example described earlier with reference to FIG. 1 ). Accordingly, the number of false positives may be reduced.
- coloring aspects in the combined spectrogram 830 of FIG. 8 ( b ) are similar to those described earlier with reference to the combined spectrogram 810 of FIG. 8 ( a ) , as well as those of FIGS. 9 ( a ), 9 ( b ), 9 ( c ), 9 ( d ) and 9 ( e ) . Accordingly, for purposes of brevity, the coloring aspects in the combined spectrogram 830 of FIG. 8 ( b ) will not be described in more detail below. Further, although red, yellow, and green were chosen in the examples to provide color to the audio channels and to the combined spectrograms, other colors may be used, and the techniques are not limited to a particular color scheme.
- one or more embodiments are directed to comparing aspects of a reference end page and aspects of a promo end page.
- the aspects may relate to video content of the end pages.
- the aspects may relate to audio content of the end pages.
- it is determined whether specific audio content (e.g., voiceover content) that is present in the reference end page is also present in the promo end page.
- specific audio content e.g., voiceover content
- the specific audio content may be audio content that is not language-based.
- specific tone-based content e.g., a sequence of chimes or musical tones
- features described herein may be utilized to determine whether an audio layout of the reference end page sufficiently matches an audio layout of the promo end page.
- the audio layout may relate to a balance between left and right channels.
- particular audio content may be isolated within a larger audio scape (e.g., a promo end page that includes not only the voiceover content but also other forms of audio content such as background music).
- a promo end page that includes not only the voiceover content but also other forms of audio content such as background music.
- comparison of the promo end page with a reference end page that includes no background music is facilitated.
- Such a feature serves to distinguish embodiments described herein from an approach that is based merely on analysis of raw audio bytes and that does not serve to isolate a specific type of audio content (e.g., voiceover content) from a different type of audio content (e.g., background music).
- features described herein are distinguishable from approaches that determine audio similarity, for example, based on an audio “fingerprint” that records audio frequencies having largest energies at respective points in time. Such approaches do not utilize, for example, analysis of RGB images such as those described earlier with reference to combined spectrograms 810 and 830 .
- FIG. 10 illustrates a flowchart of a method 1000 of comparing a first audio content with a second audio content according to at least one embodiment.
- a first spectrogram representing the first audio content is obtained.
- the first audio content may be part of a first audiovisual content that includes a reference end page.
- a spectrogram 610 of a left channel of an audio signal in a reference end page is obtained.
- a spectrogram 630 of a right channel of the audio signal in the reference end page is obtained.
- a second spectrogram representing the second audio content is obtained.
- the second audio content may be part of a second audiovisual content that includes a promo end page.
- an offset may be applied to a spectrogram of a left channel (or a right channel) of an audio signal of a promo end page, to crop the spectrogram.
- the offset may be applied to a spectrogram of the right channel of the audio signal of the promo end page.
- the spectrogram of the left channel and the spectrogram of the right channel of the promo end page are cropped to have lengths in time that correspond to the lengths of spectrograms 610 and 630 (see FIGS. 6 ( a ) and 6 ( c ) ).
- the second spectrogram is obtained based on a homography that maps the first spectrogram to a segment of a spectrogram representing the second audio content.
- a homography that maps the spectrogram 620 to a segment of the spectrogram 720 is calculated.
- obtaining the second spectrogram includes aligning the first spectrogram to obtain an approximate match between the first spectrogram and a segment of a spectrogram representing the second audio content.
- an attempt is made to align the spectrogram 620 of audio content (merged left and right channels) in the reference end page.
- the alignment is performed to obtain an approximate match (e.g., closest match) between the spectrogram 620 and a segment of the spectrogram 720 .
- a first coloring may be applied to the first spectrogram.
- all audio content in the spectrogram 610 is placed in a particular color channel (e.g., a red channel).
- a particular color channel e.g., a red channel
- a second coloring may be applied to the second spectrogram.
- coloring is applied to the cropped spectrogram of the left channel of the promo end page, such that all audio content in the spectrogram is placed in a particular color channel (e.g., a green channel).
- a particular color channel e.g., a green channel
- a combined spectrogram is generated based on the first spectrogram and the second spectrogram.
- generating the combined spectrogram includes generating a combined spectrogram by superimposing one of the first spectrogram or the second spectrogram over the other.
- the cropped spectrogram of the left channel of the promo end page may be effectively overlaid (or superimposed) over the spectrogram 610 .
- the overlaying produces a combined spectrogram (e.g., combined spectrogram 810 of FIG. 8 ( a ) ).
- determining whether the first audio content is misaligned with respect to the second audio content is based on a presence of a third coloring in the combined spectrogram, the third coloring corresponding to a combination of the first coloring and the second coloring.
- regions of a combined color e.g., yellow
- regions of a combined color would appear in the RGB image of the combined spectrogram 810 .
- yellow is the combined color because the colors red and green combine to produce the color yellow.
- determining whether the first audio content is misaligned with respect to the second audio content includes identifying a misalignment between the first audio content and the second audio content, and recording a corresponding time range of the misalignment.
- a time range of the misalignment (e.g., a range in time over which the misalignment occurs) may be recorded.
- the time range of the misalignment may be used to calculate a percentage of misalignment based on the time range of the spectrogram.
- video content of the first audiovisual content may be compared with video content of the second audiovisual content.
- comparing the video content of the first audiovisual content with the video content of the second audiovisual content is based on perceptual hashing.
- a video validation 410 may include generating a hash of the promo is generated.
- the hash may be generated based on perceptual hashing.
- features described herein, or other aspects of the disclosure may be implemented and/or performed at one or more software or hardware computer systems which may further include (or may be operably coupled to) one or more hardware memory systems for storing information including databases for storing, accessing, and querying various content, encoded data, shared addresses, metadata, etc.
- the one or more computer systems incorporate one or more computer processors and controllers.
- the components of various embodiments described herein may each include a hardware processor of the one or more computer systems, and, in one embodiment, a single processor may be configured to implement the various components.
- the encoder, the content server, and the web server, or combinations thereof may be implemented as separate hardware systems, or may be implemented as a single hardware system.
- the hardware system may include various transitory and non-transitory memory for storing information, wired and wireless communication receivers and transmitters, displays, and input and output interfaces and devices.
- the various computer systems, memory, and components of the system may be operably coupled to communicate information, and the system may further include various hardware and software communication modules, interfaces, and circuitry to enable wired or wireless communication of information.
- FIG. 11 may include one or more computer servers 1101 .
- the server 1101 may be operatively coupled to one or more data stores 1102 (for example, databases, indexes, files, or other data structures).
- the server 1101 may connect to a data communication network 1103 including a local area network (LAN), a wide area network (WAN) (for example, the Internet), a telephone network, a satellite or wireless communication network, or some combination of these or similar networks.
- LAN local area network
- WAN wide area network
- telephone network for example, a satellite or wireless communication network, or some combination of these or similar networks.
- One or more client devices 1104 , 1105 , 1106 , 1107 , 1108 may be in communication with the server 1101 , and a corresponding data store 1102 via the data communication network 1103 .
- Such client devices 1104 , 1105 , 1106 , 1107 , 1108 may include, for example, one or more laptop computers 1107 , desktop computers 1104 , smartphones and mobile phones 1105 , tablet computers 1106 , televisions 1108 , or combinations thereof.
- client devices 1104 , 1105 , 1106 , 1107 , 1108 may send and receive data or instructions to or from the server 1101 in response to user input received from user input devices or other input.
- the server 1101 may serve data from the data store 1102 , alter data within the data store 1102 , add data to the data store 1102 , or the like, or combinations thereof.
- the server 1101 may transmit one or more media files including audio and/or video content, encoded data, generated data, and/or metadata from the data store 1102 to one or more of the client devices 1104 , 1105 , 1106 , 1107 , 1108 via the data communication network 1103 .
- the devices may output the audio and/or video content from the media file using a display screen, projector, or other display output device.
- the system 1100 configured in accordance with features and aspects described herein may be configured to operate within or support a cloud computing environment. For example, a portion of, or all of, the data store 1102 and server 1101 may reside in a cloud server.
- FIG. 12 an illustration of an example computer 1200 is provided.
- One or more of the devices 1104 , 1105 , 1106 , 1107 , 1108 of the system 1100 may be configured as or include such a computer 1200 .
- the computer 1200 may include a bus 1203 (or multiple buses) or other communication mechanism, a processor 1201 , main memory 1204 , read only memory (ROM) 1205 , one or more additional storage devices 1206 , and/or a communication interface 1202 , or the like or sub-combinations thereof.
- Embodiments described herein may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a selective combination thereof.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGAs field programmable gate arrays
- processors controllers, micro-controllers, microprocessors, other electronic units designed to perform the
- the bus 1203 or other communication mechanism may support communication of information within the computer 1200 .
- the processor 1201 may be connected to the bus 1203 and process information.
- the processor 1201 may be a specialized or dedicated microprocessor configured to perform particular tasks in accordance with the features and aspects described herein by executing machine-readable software code defining the particular tasks.
- Main memory 1204 (for example, random access memory—or RAM—or other dynamic storage device) may be connected to the bus 1203 and store information and instructions to be executed by the processor 1201 .
- Main memory 1204 may also store temporary variables or other intermediate information during execution of such instructions.
- ROM 1205 or some other static storage device may be connected to a bus 1203 and store static information and instructions for the processor 1201 .
- the additional storage device 1206 (for example, a magnetic disk, optical disk, memory card, or the like) may be connected to the bus 1203 .
- the main memory 1204 , ROM 1205 , and the additional storage device 1206 may include a non-transitory computer-readable medium holding information, instructions, or some combination thereof—for example, instructions that, when executed by the processor 1201 , cause the computer 1200 to perform one or more operations of a method as described herein.
- the communication interface 1202 may also be connected to the bus 1203 .
- a communication interface 1202 may provide or support two-way data communication between the computer 1200 and one or more external devices (for example, other devices contained within the computing environment).
- the computer 1200 may be connected (for example, via the bus 1203 ) to a display 1207 .
- the display 1207 may use any suitable mechanism to communicate information to a user of a computer 1200 .
- the display 1207 may include or utilize a liquid crystal display (LCD), light emitting diode (LED) display, projector, or other display device to present information to a user of the computer 1200 in a visual display.
- One or more input devices 1208 (for example, an alphanumeric keyboard, mouse, microphone) may be connected to the bus 1203 to communicate information and commands to the computer 1200 .
- one input device 1208 may provide or support control over the positioning of a cursor to allow for selection and execution of various objects, files, programs, and the like provided by the computer 1200 and displayed by the display 1207 .
- the computer 1200 may be used to transmit, receive, decode, display, etc. one or more video files. In selected embodiments, such transmitting, receiving, decoding, and displaying may be in response to the processor 1201 executing one or more sequences of one or more instructions contained in main memory 1204 . Such instructions may be read into main memory 1204 from another non-transitory computer-readable medium (for example, a storage device).
- a storage device for example, a storage device
- main memory 1204 may cause the processor 1201 to perform one or more of the procedures or steps described herein.
- processors in a multi-processing arrangement may also be employed to execute sequences of instructions contained in main memory 1204 .
- firmware may be used in place of, or in connection with, software instructions to implement procedures or steps in accordance with the features and aspects described herein.
- embodiments in accordance with features and aspects described herein may not be limited to any specific combination of hardware circuitry and software.
- Non-transitory computer readable medium may refer to any medium that participates in holding instructions for execution by the processor 1201 , or that stores data for processing by a computer, and include all computer-readable media, with the sole exception being a transitory, propagating signal.
- Such a non-transitory computer readable medium may include, but is not limited to, non-volatile media, volatile media, and temporary storage media (for example, cache memory).
- Non-volatile media may include optical or magnetic disks, such as an additional storage device.
- Volatile media may include dynamic memory, such as main memory.
- non-transitory computer-readable media may include, for example, a hard disk, a floppy disk, magnetic tape, or any other magnetic medium, a CD-ROM, DVD, Blu-ray or other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory card, chip, or cartridge, or any other memory medium from which a computer can read.
- the communication interface 1202 may provide or support external, two-way data communication to or via a network link.
- the communication interface 1202 may be a wireless network interface controller or a cellular radio providing a data communication network connection.
- the communication interface 1202 may include a LAN card providing a data communication connection to a compatible LAN.
- the communication interface 1202 may send and receive electrical, electromagnetic, or optical signals conveying information.
- a network link may provide data communication through one or more networks to other data devices (for example, client devices as shown in the computing environment 1100 ).
- a network link may provide a connection through a local network of a host computer or to data equipment operated by an Internet Service Provider (ISP).
- An ISP may, in turn, provide data communication services through the Internet.
- a computer 1200 may send and receive commands, data, or combinations thereof, including program code, through one or more networks, a network link, and communication interface 1202 .
- the computer 1200 may interface or otherwise communicate with a remote server (for example, server 1101 ), or some combination thereof.
- certain embodiments described herein may be implemented with separate software modules, such as procedures and functions, each of which performs one or more of the functions and operations described herein.
- the software codes can be implemented with a software application written in any suitable programming language and may be stored in memory and executed by a controller or processor.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
Systems and methods for providing an environment for comparing a first audio content with a second audio content are disclosed. According to at least one embodiment, a method for comparing a first audio content with a second audio content includes: obtaining a first spectrogram representing the first audio content; obtaining a second spectrogram representing the second audio content; generating a combined spectrogram based on the first spectrogram and the second spectrogram; and determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
Description
- Pursuant to 35 U.S.C. § 119(e), this application claims the benefit of U.S. Provisional Patent Application Ser. No. 63/270,934, filed on Oct. 22, 2021, the contents of which are hereby incorporated by reference herein in their entirety.
- Promotional content (or promos) may be generated to promote particular television content (e.g., a particular TV show). The promos are generated to be broadcast (e.g., during broadcast of another TV show) on a particular day or week. Some promos, referred to as straight promos, do not feature information indicating when and where the promoted show will be broadcast. In contrast, other promos do feature such information.
- For example, such information may be featured in the promo at an end page of the promo. An end page includes a sequence of video frames. The video frames indicate when and/or on what station the promoted show will be broadcast. For example, in addition to a displayed graphic, the video frames may include audio providing such indication(s).
-
FIG. 1 shows ascreen capture 100 of a frame of an example end page of an example promo. The frame may be accompanied by audio (e.g., voiceover audio) that states, by way of example, “Catch Rachael tomorrow at 2 PM on NBC 10 Boston.” - The end page of
FIG. 1 may be similar to end pages of other promos that promote the same TV show. Such other promos may be the same as the promo ofFIG. 1 , except that such other end pages may feature information indicating, by way of example, a different station and/or a different time of day at which the promoted show will be broadcast. For example, such other promos and the promo ofFIG. 1 may all be based on a common promo (e.g., a generic promo). However, each promo may have been edited to feature a different end page that is customized, for example, for a target broadcast area. - Such editing of a common promo to generate customized promos may be performed by human operators. This process may be tedious and prone to human error. This process may be very time-consuming when performed to generate large numbers of customized promos.
- Aspects of the present disclosure are directed to comparing first audio content (e.g., of audiovisual content of a reference end page) with second audio content (e.g., of audiovisual content of a promo that includes an end page) in a more autonomous manner. Based on the comparison, it is determined whether a specific reference end page is associated with a specific promo end page. According to a further aspect, the comparison includes determining a degree to which particular audio content included in the promo end page is present in the specific reference end page. The particular audio content may include specific voiceover content that would be found in a reference end page that properly corresponds to the promo end page. Although aspects of the present disclosure illustrate techniques of audio comparison within the context of end pages and promos, the audio comparison techniques described may be applied more generally to any audio files and contexts in order to compare the audio content between two sources.
- According to at least one embodiment, a method for comparing a first audio content with a second audio content includes: obtaining a first spectrogram representing the first audio content; obtaining a second spectrogram representing the second audio content; generating a combined spectrogram based on the first spectrogram and the second spectrogram; and determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
- According to at least one embodiment, a machine-readable non-transitory medium has stored thereon machine-executable instructions for comparing a first audio content with a second audio content. The instructions include: obtaining a first spectrogram representing the first audio content; obtaining a second spectrogram representing the second audio content; generating a combined spectrogram based on the first spectrogram and the second spectrogram; and determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
- According to at least one embodiment, an apparatus for comparing a first audio content with a second audio content includes: a network communication unit configured to transmit and receive data; and one or more controllers. The one or more controllers are configured to: obtain a first spectrogram representing the first audio content; obtain a second spectrogram representing the second audio content; generate a combined spectrogram based on the first spectrogram and the second spectrogram; and determine whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
- The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
- The above and other aspects and features of the present disclosure will become more apparent upon consideration of the following description of embodiments, taken in conjunction with the accompanying drawing figures.
-
FIG. 1 shows a screen capture of a frame of an example end page. -
FIG. 2 illustrates example naming conventions that may be adopted in generating names of reference end pages and that of a promo end page. -
FIG. 3 illustrates a flow diagram of generating reference information according to at least one embodiment. -
FIG. 4 illustrates a flow diagram of a process (e.g., quality control process) that includes comparing at least one reference end page with a promo end page according to at least one embodiment. -
FIG. 5 illustrates calculation of a structural similarity index measure (SSIM) index between windows x and y having common dimensions. -
FIGS. 6(a), 6(b) and 6(c) illustrate example spectrograms of audio content of a reference end page. -
FIG. 7 illustrates an example of aligning a spectrogram of audio content in the reference end page with a spectrogram of audio content in the promo end page according to at least one embodiment. -
FIGS. 8(a) and 8(b) illustrate examples of combined spectrograms. -
FIGS. 9(a), 9(b), 9(c), 9(d) and 9(e) illustrate examples of combined spectrograms. -
FIG. 10 illustrates a flowchart of a method of comparing audio content according to at least one embodiment. -
FIG. 11 is an illustration of a computing environment according to at least one embodiment. -
FIG. 12 is a block diagram of a device according to at least one embodiment. - In the following detailed description, reference is made to the accompanying drawing figures which form a part hereof, and which show by way of illustration specific embodiments of the present invention. It is to be understood by those of ordinary skill in this technological field that other embodiments may be utilized, and that structural, as well as procedural, changes may be made without departing from the scope of the present invention. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or similar parts.
- As described earlier, one or more human operators may edit a common promo to generate customized promos. For example, upon receiving a common promo, one or more editors may identify (e.g., from among a set of reference end pages), a reference end page that is associated with the received common promo. The set of reference end pages is used to classify an incoming promo. For example, the reference end pages embody information including information indicating a show with which the promo is to be broadcast, information indicating a station (e.g., network affiliate) on which the promo is to be broadcast, information indicating day and time of the broadcast, etc.
-
FIG. 2 illustrates example naming conventions that may be adopted in generating a name (e.g., file name) of a reference end page or a promo end page. As illustrated inFIG. 2 , a generatedname 200 is composed of multiple fields. For example, one field carries information indicating a city in which the (associated) promo is to be broadcast. As another example, another field carries information indicating the show with which the promo is to be broadcast. As yet another example, another field carries information indicating a day in which the promo is to be broadcast (e.g., tomorrow, today, etc.). As such, the name of a reference end page may be generated to carry such information. - Similarly, the name of a promo (e.g., a promo including an end page) may be generated to carry such information. Accordingly, analyzing the name of a promo may be utilized to validate that an end page (e.g., an identified reference end page) that is (or has been) edited into a promo corresponds to the promo. By way of example, analyzing the name of the promo may be used to validate that an identified reference end page includes the frame illustrated in the
screen capture 100. As illustrated inFIG. 1 , the frame illustrated in thescreen capture 100 is for promoting that a particular show is to be broadcast the following day (i.e., “TOMORROW”), and not to be broadcast on the current day (i.e., “TODAY”). - As will be described in more detail herein, aspects of the present disclosure are directed to validating that an end page that is (or has been) edited into a promo correctly corresponds to a given promo (e.g., correctly corresponds to the target of a given promo). For example, one or more embodiments are directed to verifying that an end page that is (or has been) appended to a promo is correctly associated with a given promo based on comparing aspects of the reference end page with aspects of the promo end page. For example, it is verified that the reference end page is associated with a corresponding TV show or program, has an appropriate length with respect to time, and/or corresponds to announcing broadcast of a TV show or program on a particular day, time and/or network. Further by way of example, it is determined whether audio content included in the promo end page is present in the reference end page. The audio content may include specific voiceover content that would be found in a reference end page that properly corresponds to the promo end page.
-
FIG. 3 illustrates a flow diagram of generating reference information according to at least one embodiment. The reference information includes hashes corresponding to respective reference end pages and/or thumbnail images of one or more frames of each reference end page. For example, referring toFIG. 3 , technical specifications may be validated for the reference end page, and subsequently, the reference end page may be hashed (e.g., using a perceptual hash), and a fingerprint of the hash for various reference end pages may be saved to facilitate fast matching. A thumbnail image sequence of the reference end page may be exported for fine-grained SSIM comparisons later. In an aspect, the reference information may also include spectrogram information corresponding to respective end pages. - Optionally, the reference end pages may be analyzed to synthesize a single, reference end page from overlapping sequence fragments. Unique MATID labels are mapped to the synthesized sequence. Normalized audio files are exported and named by their MD5 hashes to reduce duplication and spectrograms may be generated from the exported audio. In an aspect, the model may include MATID labels, file locations, and information about the scale of the reference spectrograms, longest and shortest reference sequences, and any other data required to ensure that the promos are preprocessed in the same manner as the reference material.
- As will be described in more detail later with reference to
FIG. 4 , the reference information may be used in a process (e.g., quality control process) that is performed for a promo end page. - Generating reference information, as illustrated in
FIG. 3 , may occur periodically, e.g., once every six months or weekly. For example, reference information may be generated based on a most recent batch of reference end pages and/or a group of reference end pages that are known. -
FIG. 4 illustrates a flow diagram of a quality control process that includes comparing at least one reference end page with a promo end page according to at least one embodiment. As will be described in more detail herein, the process may include avideo validation 410 and/or anaudio validation 450. If the process includes performing bothvalidations video validation 410 and theaudio validation 450 may be performed independent of each other, simultaneously, or in series. AlthoughFIG. 4 illustrates thevideo validation 410 as occurring before theaudio validation 450, that order may be switched, such that theaudio validation 450 is performed before thevideo validation 410. Also, it is understood that, if a particular validation (e.g., the video validation 410) either fails or results in a non-match, then performance of other validations (e.g., the audio validation 450) may be omitted. For example, other validations may be omitted for purposes of saving time and/or reducing effort. - The
video validation 410 will now be described in more detail with reference to at least one embodiment. - At
block 411, certain technical specifications (or parameters) of the promo may be validated against corresponding specifications of a reference end page, to determine whether the specifications are aligned. Such technical specifications may include frames per second (FPS), pixel resolution, audio frequency, etc. According to at least one embodiment, software tools may be used to identify metadata such as the frame rate, dimensions, etc. of the promo. Similarly, technical specifications (or parameters) of a reference end page may be validated against corresponding specifications of the promo. - At
block 412, a hash of the promo is generated. According to at least one embodiment, the hash may be generated based on perceptual hashing. Perceptual hashing is used to determine whether features of particular pieces of multimedia content are similar, e.g., based on, image brightness values of individual pixels. - As illustrated in
FIG. 4 , hashes for the last N frames of the promo may be generated, where N denotes an integer that is equal or greater than 1. In this regard, the last N frames may correspond to a length of time. In another aspect, N may correspond to the lesser of (1) the number of frames in the longest reference sequence or (2) the number of frames in the promo. For example, it may be determined that the longest acceptable length of an end page may be around 8 seconds. In this situation, if the total length of a promo is 30 seconds, then N may be equal to the number of frames in the last 8 seconds of the promo, which is where the end page is located. Here, it is understood that, if the total length of the promo is shorter than the longest acceptable length of an end page, then N may be equal to the total number of frames in the promo. For example, if the total length of the promo is 4 seconds (and is, therefore, shorter than the longest acceptable length of 8 seconds), then N may be equal to the total number of frames in the 4-second promo. - Hashes of the reference end pages (see
FIG. 3 ) may be generated in a similar manner. Accordingly, a hash of the promo may be compared against a hash of a reference end page, as will be described in more detail below. - At
blocks FIG. 4 , the promo is compared with at least one reference end page based on information described earlier, e.g., hash information. For example, atblock 413, a hash of the last frame of the promo may be retrieved. Then, all reference end pages that have a hash that is within a particular Hamming distance threshold (relative to the hash of the last frame of the promo) are identified. For each reference end page that meets such a threshold, a finer-grained analysis may then be performed (see block 414). - At
block 413, it may be determined whether each of a number of ordered frames (e.g., N ordered frames) of the promo end page is sufficiently similar to a corresponding frame of the reference end page. The degree of similarity may be based on Hamming distance. For a given pair of end pages, it may be determined that the two end pages are sufficiently similar if, for each of the N frames, the difference between respective hashes does not exceed the Hamming distance threshold. For example, the Hamming distance threshold may be an integer between 0 and 3, inclusive. - Accordingly, the coarser-grained analysis of
block 413 may result in identification of one or more reference end pages that potentially match the promo end page. As such, the search space of potential matches is likely reduced (or narrowed) based on perceptual hashing. A finer-grained analysis is then performed based on an accordingly smaller number of reference end pages. - At
block 414, a finer-grained analysis is performed to further measure the similarity between respective frames of end pages (e.g., respective frames of a promo end page and a reference end page identified at block 413). According to at least one embodiment, the analysis ofblock 414 is based on a structural similarity index measure (SSIM). In sum, sequence matching (or reference and promo end page matching) may be divided into two steps. First, a fast, coarse-grained search for near matches to reduce the search space may be performed. Second, a fine-grained framewise comparison (e.g., SSIM) may be performed to ensure the best match and to verify image quality. -
FIG. 5 illustrates calculation of an SSIM index between two windows x and y having common dimensions. The window x may correspond to a frame of a promo end page, and the window y may correspond to a respective frame of reference end page identified atblock 413 ofFIG. 4 . The calculation ofFIG. 5 may be applied on luma, on color (e.g., RGB) values or on chromatic (e.g. YCbCr) values. The resultant SSIM index is a decimal value between 0 and 1, where the value of 1 corresponds to a case of two identical sets of data and therefore indicates perfect structural similarity. In contrast, a value of 0 indicates no structural similarity. According to at least one embodiment, if the SSIM index calculated between respective frames of two end pages is approximately equal to (or sufficiently close to) a particular value (e.g., 1), then it is determined that the frames are sufficiently similar. - According to a further embodiment, a reference end page is determined to be sufficiently similar to a promo end page if the SSIM-based criterion described above is met for each of a percentage of pairs of frames. For example, if a particular number of pairs of frames do not satisfy the SSIM-based criterion and all other pairs of frames do satisfy the criterion, then the reference end page is determined to be sufficiently similar to the promo end page. Such a determination is presented in the following pseudo code:
- If all(V/T>=T for V in R[:N]) and all(V>=T for V in R[N:]): Success!
- In the above pseudo code, R denotes a sorted list of SSIM values from lowest to highest, V denotes the value of a single item in R, T denotes the minimum value of V to be considered a match, N denotes the number of frames that are allowed to be below T in absolute value, :N denotes all items in the list between 0 and N−1, and N: denotes all items between N and the end of the list.
- At
block 430, the name of the promo end page is analyzed (e.g., with respect to the name(s) of one or more reference end pages identified at block 414). For example, it is determined whether the fields of the name of the promo end page (seeFIG. 2 ) are consistent with the names of the identified reference end pages. - According to at least one embodiment, the quality control process of
FIG. 4 may include performingaudio validation 450. As illustrated inFIG. 4 , theaudio validation 450 may be performed after thevideo validation 410. However, it is understood that thevideo validation 410 and theaudio validation 450 may be performed independent (or irrespective) of each other. For example, thevalidations - At
block 460, a report is generated. For example, the report may be for storage in a “pass” folder (or directory) if the promo meets all criteria described earlier. Alternatively, the report may be for storage in a quarantine folder and, therefore, flagged for subsequent review (e.g., human review). In an aspect, the report may be generated for monitoring/debugging, and the promo may be moved either to the pass folder or to the quarantine folder. The label for the reference end page may include the MATID data fromblock 430 that may be used for filename verification and as a pointer to associated audio data, if any. - Returning to block 450, an
audio validation 450 will now be described in more detail according to at least one embodiment. - Examples of audio verification according to at least one embodiment will now be described in more detail. In this regard, it is determined whether audio content (e.g., spectral content of an audio signal) of a promo end page matches corresponding audio content of a reference end page. Examples will be described with reference to audio content of the promo end page that includes voiceover audio. The voiceover audio may be similar to that which was described earlier with reference to
FIG. 1 (e.g., “Catch Rachael tomorrow at 2 PM onNBC 10 Boston”). Also, examples will be described with reference to audio signals that are stereo signals, in that each given audio signal carries two individual channels (e.g., left channel and right channel). -
FIGS. 6(a), 6(b) and 6(c) illustrate example spectrograms of audio content in the reference end page. Each spectrogram is a visual representation of the spectrum of frequencies (see vertical axis) in one or more channels of the audio signal as the signal varies over time (see horizontal axis). More particularly,FIG. 6(a) illustrates aspectrogram 610 of the left channel of the audio signal in the reference end page.FIG. 6(c) illustrates aspectrogram 630 of the right channel of the audio signal in the reference end page.FIG. 6(b) illustrates aspectrogram 620 of the merged left and right channels of the audio signal in the reference end page. In an aspect, the left and right channels are merged by mixing the channels together to produce a mono audio track. - Spectrograms similar to those illustrated in
FIGS. 6(a), 6(b) and 6(c) are also obtained for audio content in the promo end page. For example, aspectrogram 720 of merged left and right channels of an audio signal in the promo end page is illustrated inFIG. 7 . -
FIG. 7 illustrates an example of aligning a spectrogram of audio content in the reference end page with a spectrogram of audio content in the promo end page according to at least one embodiment. For example, an attempt is made to align thespectrogram 620 of audio content (merged left and right channels) in the reference end page. The alignment is performed to obtain an approximate match (e.g., closest match) between thespectrogram 620 and a segment of thespectrogram 720. If the alignment succeeds, then further analysis may be performed based on thespectrogram 620, corresponding to the reference end page, and the matching segment of thespectrogram 720, corresponding to the promo end page. - According to at least one embodiment, the
spectrogram 620 may be effectively positioned (or shifted) along the horizontal axis with respect to thespectrogram 720, to obtain a best (or closest) match in spectral content between thespectrograms - According to at least one embodiment, the
spectrograms first spectrogram 620 to the segment of thespectrogram 720. Once a best match is obtained, then one or more parameters may be captured to record a horizontal offset of a positioning of thespectrogram 620 with respect to (e.g., within the bounds of) thespectrogram 720. For example, the parameters may include a number in an upper right of a homography matrix. - Such an offset may then be used to effectively crop spectrograms of the promo end page. For example, such an offset may be applied to a spectrogram of the left channel of the audio signal of the promo end page, to crop the spectrogram. Similarly, the offset may be applied to a spectrogram of the right channel of the audio signal of the promo end page. In this manner, the spectrogram of the left channel and the spectrogram of the right channel of the promo end page are cropped to have lengths in time that correspond to the lengths of
spectrograms 610 and 630 (seeFIGS. 6(a) and 6(c) ). - To further analyze similarities/differences between audio content of the promo end page and that of the reference end page, spectrograms of the promo end page are combined with corresponding spectrograms of the reference end page. In an aspect, in the spectrograms are combined by putting one spectrogram on top of another spectrogram to produce a combined spectrogram corresponding to a new audio track with two channels, each channel occupying a separate color channel in the spectrogram image.
- For example, the cropped spectrogram of the left channel of the promo end page is combined with the
spectrogram 610 of the left channel of the audio signal in the reference end page inFIG. 6(a) . In this regard, the cropped spectrogram of the left channel of the promo end page may be effectively overlaid (or superimposed) over thespectrogram 610. The overlaying produces a combined spectrogram (e.g., combinedspectrogram 810 ofFIG. 8(a) ). - Similarly, the cropped spectrogram of the right channel of the promo end page is combined with the
spectrogram 630 of the right channel of the audio signal in the reference end page ofFIG. 6(c) . In this regard, the cropped spectrogram of the right channel of the promo end page may be effectively overlaid (or superimposed) over thespectrogram 630. The overlaying produces a combined spectrogram (e.g., combinedspectrogram 830 ofFIG. 8(b) ). - According to at least one embodiment, coloring is applied to individual spectrograms before overlaying, such that the combination produces a spectrogram that is generated as a color image.
- This will now be described in more detail with reference to the combined
spectrogram 810 ofFIG. 8(a) . As described earlier, the combinedspectrogram 810 is produced by combining the cropped spectrogram of the left channel of the promo end page with thespectrogram 610 ofFIG. 6(a) . - According to at least one embodiment, coloring is applied to the cropped spectrogram of the left channel of the promo end page, such that all audio content in the spectrogram is placed in a first color channel (e.g., a green channel). As such, all spectral content that arises from audio in the cropped spectrogram of the left channel of the promo end page is represented using the color green. The represented audio may include both voiceover audio and background music, as well as other types of audio content.
- In addition, a different coloring is applied to the
spectrogram 610, which corresponds to the left channel of the reference end page. For example, all audio content in thespectrogram 610 is placed in a second color channel that is different from the first color channel noted above. By way of example, the second color channel may be a red channel. As such, all spectral content that arises from audio in thespectrogram 610 is represented using the color red. The represented audio typically includes voiceover audio but not background music, because the reference end page includes voiceover audio but does not include background music. - The application of coloring to the individual spectrograms results in potentially combined coloring in the combined
spectrogram 810. The combined coloring may be utilized to identify areas of alignment (or, conversely, non-alignment) between the reference end page and the promo end page. In more detail, it may be determined whether audio content (e.g., voiceover audio) in the reference end page is also present in the promo end page - For example, if the voiceover audio content in the reference end page aligns with (e.g., matches or is identical to) the voiceover audio content in the promo end page, then regions of a combined color (e.g., yellow) would appear in the RGB image of the combined
spectrogram 810. In this situation, yellow is the combined color because the colors red and green combine to produce the color yellow. The yellow-colored regions result from voiceover audio in the cropped spectrogram of the left channel of the promo end page (represented using the color green) being effectively overlaid or superimposed over matching voiceover audio in the spectrogram 610 (represented using the color red). With reference toFIG. 8(a) , theregion 812 is an example of a region where voiceover audio in the cropped spectrogram of the left channel of the promo end page and voiceover audio in thespectrogram 610 align to appear as a yellow-colored region. - If audio content in the promo end page does not align with audio content in the reference end page, then regions of the first color (e.g., green) may appear in the RGB image of the combined
spectrogram 810. As described earlier, spectral content that arises from background music in the promo end page is represented using the color green. The background music may be unique to this specific promo end page, in that different promo end pages may feature different background music and the reference end page does not feature any background music. With reference toFIG. 8(a) , theregion 814 is an example of a region where background music in the cropped spectrogram of the left channel of the promo end page does not align with any audio content in thespectrogram 610. Accordingly, the corresponding, green-colored region does not overlap with a red-color region in thespectrogram 610, and, therefore, remains green in the combinedspectrogram 810. - If audio content in the reference end page does not align with audio content in the promo end page, then regions of the second color (e.g., red) may appear in the RGB image of the combined spectrogram (e.g., combined spectrogram 810). As described earlier, spectral content that arises from all audio content in the reference end page is represented using the color red.
- According to at least one embodiment, after such a misalignment between the reference end page and the promo end page is identified, a corresponding time range of the misalignment is recorded. For example, one or more timestamps marking a beginning (or start) and/or an end of misalignment with respect to time may be recorded. Information including such timestamps may be provided.
-
FIG. 9(a) illustrates an example of misalignment at a beginning of the end pages. Here, it is possible that voiceover audio that is present at a beginning of the reference end page is not present at a beginning of the promo end page. Accordingly, a red-colored region appears at a left (starting) area of the combined spectrogram. -
FIG. 9(b) illustrates an example of misalignment at (or around) a middle of the end pages. Here, it is possible that voiceover audio that is present at a middle of the reference end page is not present at a middle of the promo end page. Accordingly, a red-colored region appears at a center area of the combined spectrogram. -
FIG. 9(c) illustrates an example of misalignment at an end of the end pages. Here, it is possible that voiceover audio that is present at an end of the reference end page is not present at the end of the promo end page. Accordingly, a red-colored region appears at a right (ending) area of the combined spectrogram. -
FIG. 9(d) illustrates an example of a complete (or near complete) misalignment between the end pages over time. Here, it is possible that voiceover audio that is present in the reference end page is simply not present in the promo end page. Accordingly, a red-colored region appears throughout the combined spectrogram. -
FIG. 9(e) illustrates an example of isolated (or scattered) misalignment between the end pages over time. Here, voiceover audio in the promo end page may not fully match voiceover audio in the reference end page. Accordingly, scattered red-colored regions appear across the combined spectrogram. - According to at least one embodiment, one or more tools based on machine learning may be utilized to determine whether the reference end page passes or fails with respect to the promo end page. The determination may be based, at least in part, on the presence of red-colored regions in the combined spectrogram. For example, if the (combined) size of red-colored regions is under a particular threshold size, then it may be determined that the reference end page sufficiently matches the promo end page. Otherwise, it may be determined that the reference end page does not sufficiently match the promo end page.
- According to at least one embodiment, automatic speech recognition (ASR) may be used to eliminate (or reduce) false positives that may arise. For example, audio content in promo end pages may intentionally be sped up or modified slightly to meet on-air requirements. Such changes to the audio content may result in identification of areas of misalignment (or non-alignment) during the audio validation that has been described herein with reference to one or more embodiments. In this regard, ASR-based tools may be used to confirm that the voiceover audio in the reference end page is identical (e.g., in substance) to the voiceover audio in the promo end page. For example, ASR-based tools may be used to confirm that the substance of the voiceover audio in the reference end page matches that of the voiceover audio in the promo end page, which states “Catch Rachael tomorrow at 2 PM on
NBC 10 Boston” (see the example described earlier with reference toFIG. 1 ). Accordingly, the number of false positives may be reduced. - It is understood that coloring aspects in the combined
spectrogram 830 ofFIG. 8(b) are similar to those described earlier with reference to the combinedspectrogram 810 ofFIG. 8(a) , as well as those ofFIGS. 9(a), 9(b), 9(c), 9(d) and 9(e) . Accordingly, for purposes of brevity, the coloring aspects in the combinedspectrogram 830 ofFIG. 8(b) will not be described in more detail below. Further, although red, yellow, and green were chosen in the examples to provide color to the audio channels and to the combined spectrograms, other colors may be used, and the techniques are not limited to a particular color scheme. - As described herein, one or more embodiments are directed to comparing aspects of a reference end page and aspects of a promo end page. The aspects may relate to video content of the end pages. Alternatively (or in addition), the aspects may relate to audio content of the end pages. As described earlier with respect to at least one embodiment, it is determined whether specific audio content (e.g., voiceover content) that is present in the reference end page is also present in the promo end page. However, it is understood that the specific audio content may be audio content that is not language-based. For example, it may be determined whether specific tone-based content (e.g., a sequence of chimes or musical tones) that is present in the reference end page is also present in the promo end page.
- Also, it is understood that features described herein may be utilized to determine whether an audio layout of the reference end page sufficiently matches an audio layout of the promo end page. The audio layout may relate to a balance between left and right channels.
- Based on features that have been described herein, particular audio content (e.g., voiceover content) may be isolated within a larger audio scape (e.g., a promo end page that includes not only the voiceover content but also other forms of audio content such as background music). As such, comparison of the promo end page with a reference end page that includes no background music is facilitated. Such a feature serves to distinguish embodiments described herein from an approach that is based merely on analysis of raw audio bytes and that does not serve to isolate a specific type of audio content (e.g., voiceover content) from a different type of audio content (e.g., background music).
- In addition, features described herein are distinguishable from approaches that determine audio similarity, for example, based on an audio “fingerprint” that records audio frequencies having largest energies at respective points in time. Such approaches do not utilize, for example, analysis of RGB images such as those described earlier with reference to combined
spectrograms -
FIG. 10 illustrates a flowchart of amethod 1000 of comparing a first audio content with a second audio content according to at least one embodiment. - At
block 1002, a first spectrogram representing the first audio content is obtained. The first audio content may be part of a first audiovisual content that includes a reference end page. - For example, with reference to
FIG. 6(a) , aspectrogram 610 of a left channel of an audio signal in a reference end page is obtained. Alternatively (or in addition), with reference toFIG. 6(c) , aspectrogram 630 of a right channel of the audio signal in the reference end page is obtained. - At
block 1004, a second spectrogram representing the second audio content is obtained. The second audio content may be part of a second audiovisual content that includes a promo end page. - For example, as described earlier with reference to
FIG. 7 , an offset may be applied to a spectrogram of a left channel (or a right channel) of an audio signal of a promo end page, to crop the spectrogram. Similarly, the offset may be applied to a spectrogram of the right channel of the audio signal of the promo end page. In this manner, the spectrogram of the left channel and the spectrogram of the right channel of the promo end page are cropped to have lengths in time that correspond to the lengths ofspectrograms 610 and 630 (seeFIGS. 6(a) and 6(c) ). - In an aspect, the second spectrogram is obtained based on a homography that maps the first spectrogram to a segment of a spectrogram representing the second audio content.
- For example, as described earlier with reference to
FIG. 7 , a homography that maps thespectrogram 620 to a segment of thespectrogram 720 is calculated. - In another aspect, obtaining the second spectrogram includes aligning the first spectrogram to obtain an approximate match between the first spectrogram and a segment of a spectrogram representing the second audio content.
- For example, as described earlier with reference to
FIG. 7 , an attempt is made to align thespectrogram 620 of audio content (merged left and right channels) in the reference end page. The alignment is performed to obtain an approximate match (e.g., closest match) between thespectrogram 620 and a segment of thespectrogram 720. - At
block 1006, a first coloring may be applied to the first spectrogram. - For example, as described earlier, all audio content in the
spectrogram 610 is placed in a particular color channel (e.g., a red channel). - At
block 1008, a second coloring may be applied to the second spectrogram. - For example, as described earlier, coloring is applied to the cropped spectrogram of the left channel of the promo end page, such that all audio content in the spectrogram is placed in a particular color channel (e.g., a green channel).
- At
block 1010, a combined spectrogram is generated based on the first spectrogram and the second spectrogram. - According to a further embodiment, generating the combined spectrogram includes generating a combined spectrogram by superimposing one of the first spectrogram or the second spectrogram over the other.
- For example, as described earlier with reference to
FIG. 8(a) , the cropped spectrogram of the left channel of the promo end page may be effectively overlaid (or superimposed) over thespectrogram 610. The overlaying produces a combined spectrogram (e.g., combinedspectrogram 810 ofFIG. 8(a) ). - At
block 1012, it is determined whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram. - According to a further embodiment, determining whether the first audio content is misaligned with respect to the second audio content is based on a presence of a third coloring in the combined spectrogram, the third coloring corresponding to a combination of the first coloring and the second coloring.
- For example, as described earlier with reference to
FIG. 8(a) , if the voiceover audio content in the reference end page aligns with (e.g., matches or is identical to) the voiceover audio content in the promo end page, then regions of a combined color (e.g., yellow) would appear in the RGB image of the combinedspectrogram 810. In this situation, yellow is the combined color because the colors red and green combine to produce the color yellow. - According to a further embodiment, determining whether the first audio content is misaligned with respect to the second audio content includes identifying a misalignment between the first audio content and the second audio content, and recording a corresponding time range of the misalignment.
- For example, as described earlier with reference to
FIG. 9(a) , an example of a misalignment at a beginning of the end pages is identified. In this regard, a time range of the misalignment (e.g., a range in time over which the misalignment occurs) may be recorded. In an aspect, the time range of the misalignment may be used to calculate a percentage of misalignment based on the time range of the spectrogram. - At
block 1014, video content of the first audiovisual content may be compared with video content of the second audiovisual content. - In an aspect, comparing the video content of the first audiovisual content with the video content of the second audiovisual content is based on perceptual hashing.
- For example, as described earlier with reference to
FIG. 4 , avideo validation 410 may include generating a hash of the promo is generated. According to at least one embodiment, the hash may be generated based on perceptual hashing. - In at least some embodiments, features described herein, or other aspects of the disclosure (e.g., the
method 1000 ofFIG. 10 ) may be implemented and/or performed at one or more software or hardware computer systems which may further include (or may be operably coupled to) one or more hardware memory systems for storing information including databases for storing, accessing, and querying various content, encoded data, shared addresses, metadata, etc. In hardware implementations, the one or more computer systems incorporate one or more computer processors and controllers. - The components of various embodiments described herein may each include a hardware processor of the one or more computer systems, and, in one embodiment, a single processor may be configured to implement the various components. For example, in one embodiment, the encoder, the content server, and the web server, or combinations thereof, may be implemented as separate hardware systems, or may be implemented as a single hardware system. The hardware system may include various transitory and non-transitory memory for storing information, wired and wireless communication receivers and transmitters, displays, and input and output interfaces and devices. The various computer systems, memory, and components of the system may be operably coupled to communicate information, and the system may further include various hardware and software communication modules, interfaces, and circuitry to enable wired or wireless communication of information.
- In selected embodiments, features and aspects described herein may be implemented within a
computing environment 1100, as shown inFIG. 11 , which may include one ormore computer servers 1101. Theserver 1101 may be operatively coupled to one or more data stores 1102 (for example, databases, indexes, files, or other data structures). Theserver 1101 may connect to adata communication network 1103 including a local area network (LAN), a wide area network (WAN) (for example, the Internet), a telephone network, a satellite or wireless communication network, or some combination of these or similar networks. - One or
more client devices server 1101, and acorresponding data store 1102 via thedata communication network 1103.Such client devices more laptop computers 1107,desktop computers 1104, smartphones andmobile phones 1105,tablet computers 1106,televisions 1108, or combinations thereof. In operation,such client devices server 1101 in response to user input received from user input devices or other input. In response, theserver 1101 may serve data from thedata store 1102, alter data within thedata store 1102, add data to thedata store 1102, or the like, or combinations thereof. - In selected embodiments, the
server 1101 may transmit one or more media files including audio and/or video content, encoded data, generated data, and/or metadata from thedata store 1102 to one or more of theclient devices data communication network 1103. The devices may output the audio and/or video content from the media file using a display screen, projector, or other display output device. In certain embodiments, thesystem 1100 configured in accordance with features and aspects described herein may be configured to operate within or support a cloud computing environment. For example, a portion of, or all of, thedata store 1102 andserver 1101 may reside in a cloud server. - With reference to
FIG. 12 , an illustration of anexample computer 1200 is provided. One or more of thedevices system 1100 may be configured as or include such acomputer 1200. - In selected embodiments, the
computer 1200 may include a bus 1203 (or multiple buses) or other communication mechanism, aprocessor 1201,main memory 1204, read only memory (ROM) 1205, one or moreadditional storage devices 1206, and/or acommunication interface 1202, or the like or sub-combinations thereof. Embodiments described herein may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a selective combination thereof. In all embodiments, the various components described herein may be implemented as a single component, or alternatively may be implemented in various separate components. - The
bus 1203 or other communication mechanism, including multiple such buses or mechanisms, may support communication of information within thecomputer 1200. Theprocessor 1201 may be connected to thebus 1203 and process information. In selected embodiments, theprocessor 1201 may be a specialized or dedicated microprocessor configured to perform particular tasks in accordance with the features and aspects described herein by executing machine-readable software code defining the particular tasks. Main memory 1204 (for example, random access memory—or RAM—or other dynamic storage device) may be connected to thebus 1203 and store information and instructions to be executed by theprocessor 1201.Main memory 1204 may also store temporary variables or other intermediate information during execution of such instructions. -
ROM 1205 or some other static storage device may be connected to abus 1203 and store static information and instructions for theprocessor 1201. The additional storage device 1206 (for example, a magnetic disk, optical disk, memory card, or the like) may be connected to thebus 1203. Themain memory 1204,ROM 1205, and theadditional storage device 1206 may include a non-transitory computer-readable medium holding information, instructions, or some combination thereof—for example, instructions that, when executed by theprocessor 1201, cause thecomputer 1200 to perform one or more operations of a method as described herein. Thecommunication interface 1202 may also be connected to thebus 1203. Acommunication interface 1202 may provide or support two-way data communication between thecomputer 1200 and one or more external devices (for example, other devices contained within the computing environment). - In selected embodiments, the
computer 1200 may be connected (for example, via the bus 1203) to adisplay 1207. Thedisplay 1207 may use any suitable mechanism to communicate information to a user of acomputer 1200. For example, thedisplay 1207 may include or utilize a liquid crystal display (LCD), light emitting diode (LED) display, projector, or other display device to present information to a user of thecomputer 1200 in a visual display. One or more input devices 1208 (for example, an alphanumeric keyboard, mouse, microphone) may be connected to thebus 1203 to communicate information and commands to thecomputer 1200. In selected embodiments, oneinput device 1208 may provide or support control over the positioning of a cursor to allow for selection and execution of various objects, files, programs, and the like provided by thecomputer 1200 and displayed by thedisplay 1207. - The
computer 1200 may be used to transmit, receive, decode, display, etc. one or more video files. In selected embodiments, such transmitting, receiving, decoding, and displaying may be in response to theprocessor 1201 executing one or more sequences of one or more instructions contained inmain memory 1204. Such instructions may be read intomain memory 1204 from another non-transitory computer-readable medium (for example, a storage device). - Execution of sequences of instructions contained in
main memory 1204 may cause theprocessor 1201 to perform one or more of the procedures or steps described herein. In selected embodiments, one or more processors in a multi-processing arrangement may also be employed to execute sequences of instructions contained inmain memory 1204. Alternatively, or in addition thereto, firmware may be used in place of, or in connection with, software instructions to implement procedures or steps in accordance with the features and aspects described herein. Thus, embodiments in accordance with features and aspects described herein may not be limited to any specific combination of hardware circuitry and software. - Non-transitory computer readable medium may refer to any medium that participates in holding instructions for execution by the
processor 1201, or that stores data for processing by a computer, and include all computer-readable media, with the sole exception being a transitory, propagating signal. Such a non-transitory computer readable medium may include, but is not limited to, non-volatile media, volatile media, and temporary storage media (for example, cache memory). Non-volatile media may include optical or magnetic disks, such as an additional storage device. Volatile media may include dynamic memory, such as main memory. Common forms of non-transitory computer-readable media may include, for example, a hard disk, a floppy disk, magnetic tape, or any other magnetic medium, a CD-ROM, DVD, Blu-ray or other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory card, chip, or cartridge, or any other memory medium from which a computer can read. - In selected embodiments, the
communication interface 1202 may provide or support external, two-way data communication to or via a network link. For example, thecommunication interface 1202 may be a wireless network interface controller or a cellular radio providing a data communication network connection. Alternatively, thecommunication interface 1202 may include a LAN card providing a data communication connection to a compatible LAN. In any such embodiment, thecommunication interface 1202 may send and receive electrical, electromagnetic, or optical signals conveying information. - A network link may provide data communication through one or more networks to other data devices (for example, client devices as shown in the computing environment 1100). For example, a network link may provide a connection through a local network of a host computer or to data equipment operated by an Internet Service Provider (ISP). An ISP may, in turn, provide data communication services through the Internet. Accordingly, a
computer 1200 may send and receive commands, data, or combinations thereof, including program code, through one or more networks, a network link, andcommunication interface 1202. Thus, thecomputer 1200 may interface or otherwise communicate with a remote server (for example, server 1101), or some combination thereof. - The various devices, modules, terminals, and the like described herein may be implemented on a computer by execution of software comprising machine instructions read from computer-readable medium, as discussed above. In certain embodiments, several hardware aspects may be implemented using a single computer; in other embodiments, multiple computers, input/output systems and hardware may be used to implement the system.
- For a software implementation, certain embodiments described herein may be implemented with separate software modules, such as procedures and functions, each of which performs one or more of the functions and operations described herein. The software codes can be implemented with a software application written in any suitable programming language and may be stored in memory and executed by a controller or processor.
- The foregoing described embodiments and features are merely exemplary and are not to be construed as limiting the present invention. The present teachings can be readily applied to other types of apparatuses and processes. The description of such embodiments is intended to be illustrative, and not to limit the scope of the claims. Many alternatives, modifications, and variations will be apparent to those skilled in the art.
Claims (20)
1. A method for comparing a first audio content with a second audio content, the method comprising:
obtaining a first spectrogram representing the first audio content;
obtaining a second spectrogram representing the second audio content;
generating a combined spectrogram based on the first spectrogram and the second spectrogram; and
determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
2. The method of claim 1 , wherein the second spectrogram is obtained based on a homography that maps the first spectrogram to a segment of a spectrogram representing the second audio content.
3. The method of claim 1 , wherein obtaining the second spectrogram comprises aligning the first spectrogram to obtain an approximate match between the first spectrogram and a segment of a spectrogram representing the second audio content.
4. The method of claim 1 , wherein:
the first audio content is part of a first audiovisual content that comprises a reference end page; and
the second audio content is part of a second audiovisual content that comprises a promo end page.
5. The method of claim 4 , further comprising:
comparing video content of the first audiovisual content with video content of the second audiovisual content.
6. The method of claim 5 , wherein comparing the video content of the first audiovisual content with the video content of the second audiovisual content is based on perceptual hashing.
7. The method of claim 1 , wherein determining whether the first audio content is misaligned with respect to the second audio content comprises at least one of:
identifying a misalignment at a beginning of the combined spectrogram with respect to time;
identifying a misalignment at or around a middle of the combined spectrogram with respect to time;
identifying a misalignment at an end of the combined spectrogram with respect to time;
identifying a complete misalignment across the combined spectrogram with respect to time; or
identifying a plurality of scattered misalignments across the combined spectrogram with respect to time.
8. The method of claim 1 , further comprising:
applying a first coloring to the first spectrogram;
applying a second coloring to the second spectrogram; and
generating the combined spectrogram comprises superimposing one of the first spectrogram or the second spectrogram over the other,
wherein determining whether the first audio content is misaligned with respect to the second audio content is based on a presence of a third coloring in the combined spectrogram, the third coloring corresponding to a combination of the first coloring and the second coloring.
9. The method of claim 1 , wherein determining whether the first audio content is misaligned with respect to the second audio content is performed using a machine learning model.
10. The method of claim 1 , wherein determining whether the first audio content is misaligned with respect to the second audio content comprises:
identifying a misalignment between the first audio content and the second audio content; and
recording a corresponding time range of the misalignment.
11. A machine-readable non-transitory medium having stored thereon machine-executable instructions for comparing a first audio content with a second audio content, the instructions comprising:
obtaining a first spectrogram representing the first audio content;
obtaining a second spectrogram representing the second audio content;
generating a combined spectrogram based on the first spectrogram and the second spectrogram; and
determining whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
12. The machine-readable non-transitory medium of claim 11 , wherein the second spectrogram is obtained based on a homography that maps the first spectrogram to a segment of a spectrogram representing the second audio content.
13. The machine-readable non-transitory medium of claim 11 , wherein obtaining the second spectrogram comprises aligning the first spectrogram to obtain an approximate match between the first spectrogram and a segment of a spectrogram representing the second audio content.
14. The machine-readable non-transitory medium of claim 11 , wherein:
the first audio content is part of a first audiovisual content that comprises a reference end page; and
the second audio content is part of a second audiovisual content that comprises a promo end page.
15. The machine-readable non-transitory medium of claim 14 , wherein the instructions further comprise:
comparing video content of the first audiovisual content with video content of the second audiovisual content.
16. The machine-readable non-transitory medium of claim 15 , wherein comparing the video content of the first audiovisual content with the video content of the second audiovisual content is based on perceptual hashing.
17. The machine-readable non-transitory medium of claim 11 , wherein determining whether the first audio content is misaligned with respect to the second audio content comprises at least one of:
identifying a misalignment at a beginning of the combined spectrogram with respect to time;
identifying a misalignment at or around a middle of the combined spectrogram with respect to time;
identifying a misalignment at an end of the combined spectrogram with respect to time;
identifying a complete misalignment across the combined spectrogram with respect to time; or
identifying a plurality of scattered misalignments across the combined spectrogram with respect to time.
18. The machine-readable non-transitory medium of claim 11 , wherein the instructions further comprise:
applying a first coloring to the first spectrogram;
applying a second coloring to the second spectrogram; and
generating the combined spectrogram comprises superimposing one of the first spectrogram or the second spectrogram over the other,
wherein determining whether the first audio content is misaligned with respect to the second audio content is based on a presence of a third coloring in the combined spectrogram, the third coloring corresponding to a combination of the first coloring and the second coloring.
19. The machine-readable non-transitory medium of claim 11 , wherein determining whether the first audio content is misaligned with respect to the second audio content comprises:
identifying a misalignment between the first audio content and the second audio content; and
recording a corresponding time range of the misalignment.
20. An apparatus for comparing a first audio content with a second audio content, the apparatus comprising:
a network communication unit configured to transmit and receive data; and
one or more controllers configured to:
obtain a first spectrogram representing the first audio content;
obtain a second spectrogram representing the second audio content;
generate a combined spectrogram based on the first spectrogram and the second spectrogram; and
determine whether the first audio content is misaligned with respect to the second audio content based on the combined spectrogram.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/817,798 US20230130010A1 (en) | 2021-10-22 | 2022-08-05 | Automated content quality control |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163270934P | 2021-10-22 | 2021-10-22 | |
US17/817,798 US20230130010A1 (en) | 2021-10-22 | 2022-08-05 | Automated content quality control |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230130010A1 true US20230130010A1 (en) | 2023-04-27 |
Family
ID=86057134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/817,798 Pending US20230130010A1 (en) | 2021-10-22 | 2022-08-05 | Automated content quality control |
Country Status (1)
Country | Link |
---|---|
US (1) | US20230130010A1 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9241229B2 (en) * | 2006-10-20 | 2016-01-19 | Adobe Systems Incorporated | Visual representation of audio data |
US11308329B2 (en) * | 2020-05-07 | 2022-04-19 | Adobe Inc. | Representation learning from video with spatial audio |
-
2022
- 2022-08-05 US US17/817,798 patent/US20230130010A1/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9241229B2 (en) * | 2006-10-20 | 2016-01-19 | Adobe Systems Incorporated | Visual representation of audio data |
US11308329B2 (en) * | 2020-05-07 | 2022-04-19 | Adobe Inc. | Representation learning from video with spatial audio |
Non-Patent Citations (1)
Title |
---|
Knospe et al., Privacy-enhanced Perceptual Hashing of Audio Data, IEEE, Cited portions of text (Year: 2013) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10349125B2 (en) | Method and apparatus for enabling a loudness controller to adjust a loudness level of a secondary media data portion in a media content to a different loudness level | |
US20220060587A1 (en) | Methods and apparatus to identify media using hybrid hash keys | |
US9756368B2 (en) | Methods and apparatus to identify media using hash keys | |
US11627378B2 (en) | Media presentation device with voice command feature | |
CN105590627A (en) | Image display apparatus, method for driving same, and computer readable recording medium | |
US20180336320A1 (en) | System and method for interacting with information posted in the media | |
US11663824B1 (en) | Document portion identification in a recorded video | |
US9542976B2 (en) | Synchronizing videos with frame-based metadata using video content | |
US10257461B2 (en) | Digital content conversion quality control system and method | |
US20230130010A1 (en) | Automated content quality control | |
CN112328834A (en) | Video association method and device, electronic equipment and storage medium | |
US11099811B2 (en) | Systems and methods for displaying subjects of an audio portion of content and displaying autocomplete suggestions for a search related to a subject of the audio portion | |
KR101930488B1 (en) | Metadata Creating Method and Apparatus for Linkage Type Service | |
US20170098467A1 (en) | Method and apparatus for detecting frame synchronicity between master and ancillary media files | |
US20210089577A1 (en) | Systems and methods for displaying subjects of a portion of content and displaying autocomplete suggestions for a search related to a subject of the content | |
US20210089781A1 (en) | Systems and methods for displaying subjects of a video portion of content and displaying autocomplete suggestions for a search related to a subject of the video portion | |
US20130314601A1 (en) | Inter-video corresponding relationship display system and inter-video corresponding relationship display method | |
US11871057B2 (en) | Method and system for content aware monitoring of media channel output by a media system | |
US20130302017A1 (en) | Inter-video corresponding relationship display system and inter-video corresponding relationship display method | |
EP3797368B1 (en) | System and method for identifying altered content | |
US11284162B1 (en) | Consolidation of channel identifiers in electronic program guide (EPG) data for one or more EPG data providers, and use of consolidated channel identifiers for processing audience measurement data | |
CN112154671B (en) | Electronic device and content identification information acquisition thereof | |
US20110075949A1 (en) | Image searching system and method thereof | |
JP2007312283A (en) | Image information storage device, and image topic detecting method used therefor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NBCUNIVERSAL MEDIA, LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:RAYLES, JASON;REEL/FRAME:060733/0760 Effective date: 20220705 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |