EP4335111A1 - Computerimplementiertes verfahren zur lieferung von audiovisuellen medien auf anfrage - Google Patents

Computerimplementiertes verfahren zur lieferung von audiovisuellen medien auf anfrage

Info

Publication number
EP4335111A1
EP4335111A1 EP22748259.3A EP22748259A EP4335111A1 EP 4335111 A1 EP4335111 A1 EP 4335111A1 EP 22748259 A EP22748259 A EP 22748259A EP 4335111 A1 EP4335111 A1 EP 4335111A1
Authority
EP
European Patent Office
Prior art keywords
sequence
digital video
markers
descriptors
playlist
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22748259.3A
Other languages
English (en)
French (fr)
Inventor
Boris BORZIC
Elmahdi SADOUNI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Centre National de la Recherche Scientifique CNRS
CY Cergy Paris Universite
Ecole Nationale Superieure de lElectronique et de ses Applications ENSEA
Original Assignee
Centre National de la Recherche Scientifique CNRS
CY Cergy Paris Universite
Ecole Nationale Superieure de lElectronique et de ses Applications ENSEA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Centre National de la Recherche Scientifique CNRS, CY Cergy Paris Universite, Ecole Nationale Superieure de lElectronique et de ses Applications ENSEA filed Critical Centre National de la Recherche Scientifique CNRS
Publication of EP4335111A1 publication Critical patent/EP4335111A1/de
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/432Content retrieval operation from a local storage medium, e.g. hard-disk
    • H04N21/4325Content retrieval operation from a local storage medium, e.g. hard-disk by playing back content from the storage medium
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47202End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for requesting content on demand, e.g. video on demand
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors

Definitions

  • the present invention relates to the field of identification and automated processing of digital data, in particular digital video files.
  • the invention relates more specifically to a computerized process for the audiovisual de-linearization of digital video files.
  • a large number of video files cannot be structured a priori. This is the case, for example, of events filmed live, the course of which cannot be predicted before the production of the digital video file.
  • the indexing defined a priori by the producer may not be relevant from the point of view of the user whose search criteria are not always known a priori either.
  • the practice is therefore to label the digital video file as a whole, so that the metadata associated with a digital video file are global, such as name, creation date, file format, viewing time.
  • a set of metadata provides access to a digital video file as a whole when a search for audiovisual content is performed. These metadata are therefore “global”.
  • the difficulty with video content is that it is not self-descriptive, unlike text media.
  • EU document EP3252770A1 proposes a process for the identification and automatic post-processing of audiovisual content.
  • a formal description of the content of the digital video file is provided by an operator, such as a script in the case of a film.
  • After extracting the image (i.e. containing visual data) and audio streams from the audiovisual data these two parts of the audiovisual data are broken down into a set of successive fragments.
  • the formal description of the digital video file is broken down into logical parts.
  • a dialog pattern is generated from the audio stream only.
  • An association of the audiovisual data with the corresponding formal description is achieved by associating logical parts of the formal description to the set of audiovisual data fragments, using the dialogue pattern.
  • a digital video file can then be indexed and then manipulated based on this association.
  • Ue document US6714909B1 is another example in which a method of automating the multimodal indexing process is proposed.
  • a process comprises the following steps:
  • the method described in document EP3252770A1 has the disadvantage of requiring the provision of a formal description of the digital video file.
  • the method described in the document US6714909B1 has the disadvantage of requiring that the content of the audio streams and or texts of the digital video file be semantically structured, that is to say that it is a question of being able to reconstitute an audio content which makes sense by extracting and aggregating footage from a given video. It cannot therefore be implemented to aggregate sequences from different video files or for semantically weakly structured video files.
  • the invention thus aims to propose an automated method of analysis, indexing and editing of a set of digitally possibly weakly structured video files on criteria defined by the user and without a priori indexing of the content of these files.
  • the invention relates to a computerized process for audiovisual de-linearization allowing sequencing of one or more digital video files and indexing of the sequences resulting from the sequencing, by virtually cutting by time stamping the digital video file(s) into virtual sequences, each virtual sequence being defined by two sequence time stamps and associated descriptors.
  • the method comprises the following steps: a. receiving one or more digital video files to be analyzed; b. indexing each of the digital video files in a primary index by means of associated primary endogenous descriptors making it possible to identify each digital video file; vs. automatic extraction of audio, image, and text data streams from each digital video file; d.
  • a multimodal candidate sequence time marker mathematically related to the at least two unimodal sequence markers, is created; f. for each of said digital video files analyzed, according to a lower limit and an upper limit defined to determine the minimum duration and the maximum duration of each sequence, with respect to the typology of the digital video file(s),
  • these pairs of sequence markers being associated with the descriptors associated with the said selected candidate sequence temporal markers, these descriptors therefore being referred to as “secondary endogenous descriptors”; g. indexing, in a secondary index which is in a relationship of inheritance with respect to said primary index, of all the pairs of sequence markers and of the associated descriptors allowing the identification of each sequence, the virtual sequences being identifiable and capable of being searched for less by the secondary endogenous descriptors and the primary endogenous descriptors.
  • sequence a digital video file in sequences presenting a semantic coherence according to one to four different modalities, in the form of virtual sequences delimited by pairs of sequence time markers and indexed by secondary descriptors associated with these sequence time stamps as well as the primary descriptors associated with the digital video file from which the sequences originate.
  • the space in memory used for these sequences corresponds to the space necessary to store the pairs of temporal markers and the associated secondary descriptors. It is in this that the sequencing is said to be virtual.
  • the computerized process for audiovisual de-linearization is characterized in that a video extract associated with a virtual sequence, obtained by viewing the file fragment delimited by the two sequence markers of the virtual sequence has a unit of meaning (in other words a semantic coherence) which results from the automatic analysis of each digital video file according to the four modalities and from the virtual cutting in relation to this analysis.
  • a video extract associated with a virtual sequence obtained by viewing the file fragment delimited by the two sequence markers of the virtual sequence has a unit of meaning (in other words a semantic coherence) which results from the automatic analysis of each digital video file according to the four modalities and from the virtual cutting in relation to this analysis.
  • the virtual sequences can be extracted and the video extracts corresponding to the virtual sequences can be viewed by a user who will perceive its semantic coherence and will be able to attribute an overall meaning to it.
  • At least one of the two sequence markers of each pair of sequence markers selected in step f is a plurimodal candidate sequence temporal marker and is then called a plurimodal sequence marker, and advantageously each sequence marker of each selected sequence tag pair is a multimodal sequence tag.
  • the so-called endogenous descriptors are derived from the same modality, or from one or more modalities different from the modality or modalities from which are derived for the start and end temporal cutting markers sequence of the video extract
  • step f two types of plurimodal sequence markers are distinguished:
  • main plurimodal sequence marker a plurimodal sequence marker created from four unimodal temporal cutting markers resulting from the four different modalities separated two-by-two by a time interval less than the main predetermined duration is called main plurimodal sequence marker and
  • plurimodal sequence marker created from two or three unimodal temporal cutting markers resulting from as many modalities among the four modalities, separated two-by-two by a time interval less than the main predetermined duration is said marker of secondary multimodal sequence.
  • At least one of the tags of each pair of sequence tags is a main multimodal sequence tag.
  • the action modality is a modality of at least one of the two sequence markers of the pair of sequence markers selected.
  • the semantic coherence of a sequence is at least underpinned by the action modality, which plays a special role in many video files.
  • the sequence obtained will be coherent from the point of view of sporting actions.
  • weights are assigned to each of the modalities for the production of candidate sequence markers in step e and/or the selection of sequence markers in step f.
  • the semantic coherence of a sequence can be underpinned in various proportions, possibly adapted to video typologies, by the four modalities. For example, in the field of sport, we can assign a higher weight to the action modality. In the field of online courses, we can assign a higher weight to the text modality.
  • the weight of the action modality is greater than that of the image modality, itself greater than the weight of the text and audio modalities
  • the weight of the text modality is greater than that of the other three modalities. Thanks to this arrangement, the semantic coherence of a sequence can be adapted to a video typology such as a video in the field of sports or to a video with high informational content such as a documentary or an online course.
  • a weight is assigned to the secondary endogenous descriptors as well as to the primary endogenous descriptors to characterize their importance in the sequences, and this weight is greater for the secondary endogenous descriptors than that of the primary endogenous descriptors.
  • the different weights of the endogenous and exogenous descriptors make it possible, when formulating a sequence search query formulated later, to play different roles for these two types of descriptors.
  • the weight of endogenous descriptors is greater than that of exogenous descriptors, the results of a sequence search will be based more on endogenous descriptors than on exogenous descriptors.
  • the secondary endogenous descriptors are said to be “unimodal” when they correspond to a single modality and are said to be “multimodal” when they are detected for several modalities.
  • thermodynamics information on the unimodal or multimodal character of a given secondary endogenous descriptor is kept during the indexing process. For example, if the image modality gives the “thermodynamics” descriptor, and the text modality also gives the “thermodynamics” descriptor, then we can create a “thermodynamics” plurimodal descriptor (which comes from the two previous descriptors and is therefore more robust on the interest of viewing this extract we are interested in thermodynamics).
  • step f of the method presents these sub-steps, for each digital video file, to produce the sequences: i) - selection of a last end-of-sequence marker, in particular multimodal, from the end digital video file,
  • the last sequence start marker is designated by the subtraction from the time code of the last end marker selected from the upper limit; ii), step i) is repeated to select a penultimate sequence start marker, the sequence start marker selected at the end of the previous step i playing the role of last sequence end marker selected at the start of the previous step i; iii) sub-step ii) is repeated in this way until the start of the digital video file.
  • the main predetermined duration is less than 5 seconds, and optionally the maximum duration of each selected sequence is equal to two minutes. so that the candidate sequence markers are close enough in time and the sequencing is fine enough.
  • the sequencing is fine enough, it is possible to constitute virtual sequences whose duration is limited by a relatively low upper limit.
  • the duration of the selected virtual sequences is limited by an upper limit.
  • the time between the two markers of a sequence marker pair is less than 2 minutes, 1 minute, or 30 seconds.
  • step g At least one additional step of enriching the indexing of the virtual sequences by exogenous secondary descriptors is carried out in step g.
  • the sequencing can be repeated to end up with finer sequencing, since additional - exogenous - information has been added.
  • the secondary descriptors by means of which the identified sequences are indexed are enriched with a numerical or lettered indicator, such as an overall score of a digital collection card, calculated for each sequence from the secondary descriptors the virtual sequence and/or the primary descriptors of the digital video file in which the sequence was identified.
  • a numerical or lettered indicator such as an overall score of a digital collection card
  • the results of a subsequent sequence search in the secondary index can be ordered on the basis of this encrypted or lettered indicator.
  • the action modality comprises the sub-modalities: ⁇ detection of change of shots, detection of action according to a typology of digital video files ⁇ , and each of the sub-modalities of the action modality makes it possible to generate a game particular of unimodal cut-out time markers.
  • the analysis according to the audio modality comprises noise detection, music detection and/or transcription of speech into a text stream.
  • the analysis according to the image modality includes the sub-modalities ⁇ shape or object recognition; plan aggregation; optical character recognition ⁇ , and each of the sub-modalities of the image modality makes it possible to generate a particular set of unimodal descriptors.
  • the invention also relates to a computerized method for the automatic production of an ordered playlist of video extracts from digital video files, with a data transmission stream, the digital video files being indexed in a primary index stored in a documentary database containing the digital video files with primary descriptors, the digital video files having been, beforehand and by means of the computerized process of de-linearization according to one of the preceding embodiments, cut virtually by time stamping into virtual sequences which are defined by two sequence time markers forming a pair of sequence markers and by associated secondary descriptors, the pairs of virtual sequence markers and the associated secondary descriptors being stored in a secondary index stored in a documentary database, the index secondary being in inheritance relation with the primary index these index being accessible via a graphical interface.
  • the computerized process of research and automatic production of a playlist of video extracts includes:
  • the stored digital video files have been sequenced, and the virtual sequences of the digital video files have been indexed in the secondary index before the search criteria are formulated and before the search result is received by the client by means of the sequencing process as described above;
  • the ordered automatic playlist is a list of video sequences of the digital video file(s) each corresponding to a virtual sequence of a digital video file, according to an order which is a function of the secondary descriptors associated with each sequence and primary descriptors associated with each video file digital. Thanks to this arrangement, it is possible to select one or more sequences of digital video files obtained at the end of the process for sequencing one or more digital video files, that is to say in an automated manner without required that the user view one or more digital video files in their entirety.
  • This selection can be made by means of a search query and the search is carried out in the secondary index containing the secondary descriptors of the sequences, which is linked to the primary index containing the primary descriptors of the digital video files from which the sequences.
  • the method determines according to the search query and the descriptors of the virtual sequence(s), whether the virtual sequences are essential (the number of descriptors is relevant) or ornamental (the number of descriptors is not relevant with respect to the criterion defined for the essential virtual sequences);
  • the method produced via the transmission stream is an exhaustive playlist video extracts associated with all the essential virtual sequences, or a summary with a selection of video extracts associated with the essential virtual sequences according to criteria specified by the user,
  • the method produces via the transmission stream a playlist of video extracts associated with the so-called “zapping" virtual sequences, of these digital files with a selection of the essential virtual sequences associated with the video extracts according to criteria specified by the user.
  • the method produces via the transmission stream a summary playlist with a selection of video extracts from this digital video file according to criteria specified by the user during his search,
  • the method produces via the transmission stream a playlist of video extracts associated with the so-called “zapping” virtual sequences, of these digital files with a selection of video extracts according to criteria specified by the user during his search.
  • the computerized method for automatically producing a playlist of video extracts allows, after automatic production of an ordered playlist of video extracts from digital video files, the following navigation operations from the virtual remote control and from the data transmission stream:
  • this comprises a single navigation bar for all the video extracts arranged one after the other on the playlist, according to the order of the sequence markers according to the user's request (which presents the descriptors associated with the markers cutting in the secondary index).
  • the method for automatically producing an ordered playlist of video extracts from digital video files allows the following additional operation: d. new temporary output from the viewing of the original digital video file of the extract being played from operation c), to view during step d) a summary created automatically and prior to this viewing from this single original digital file.
  • the method for automatically producing an ordered playlist of video extracts from digital video files allows the following additional operation: e. recording of browsing history on the playlist of video sequences and creation of a new digital file which is this browsing history.
  • the search query formulated in step 1 is multi-criteria, and combines a search on the full text, a faceted search and in that the criteria for carrying out the order for the automatic playlist include criteria chronological and/or semantic and/or relevance.
  • This arrangement makes it possible to formulate search queries as varied as possible, including with suggestions based on facets or criteria, and to obtain an ordered list of results.
  • the search query formulated in step 1 is carried out automatically on the basis of one or more criteria specified by the user chosen from a list comprising: the desired duration of an automatic playlist as well as semantic criteria.
  • the search query formulated in step 1 is carried out by a conversational robot.
  • the computerized method for automatically producing an ordered playlist of video extracts from digital video files comprises a viewing step in which the user displays on a first screen a video extract from the playlist, and descriptors of the virtual sequence associated with the video extract on a second screen synchronized with the video extract.
  • the computerized method for automatically producing an ordered playlist of video extracts from digital video files comprises a viewing step in which the descriptors associated with the virtual sequences are displayed on the extracts. Thanks to these arrangements, the user can view, at the same time as the video extracts, the descriptors on the basis of which the method has considered the sequence as relevant with respect to the search query. In this way, the user can both assign a global meaning to the video extract and compare it to the global meaning which could be attributed to it on the basis of the descriptors which have been automatically associated with it.
  • the technology used is ElasticSearch®.
  • access to the video files is done in “streaming” mode.
  • the invention further relates to an automatic list of pairs of sequence markers and associated descriptors resulting from the computerized method of automatically producing an ordered playlist of video extracts from digital video files, presenting endogenous and exogenous descriptors consistent with the request of research.
  • all the virtual sequences have, as end-of-sequence marker, at least one main multimodal sequence marker or sequence marker resulting from three modalities.
  • the end of sequence marker of each pair of sequence time markers corresponding to each virtual sequence is derived at least from the action modality.
  • the sequence time markers are determined by an approach multimodal by automatic analysis, file by file, of each of said one or more digital video files, according to at least two of the four modalities: image modality, audio modality, text modality, action modality.
  • At least two sequence time markers are determined randomly or unimodally.
  • the invention also relates to a computerized method of editing with virtual cutting without creating a digital video file, from the computerized method of automatic production an ordered playlist of video clips from digital video files comprising the following steps:
  • the computerized method of assembly with virtual cutting comprises the following steps:
  • the playlist of video extracts is generated automatically by a computerized method of searching and automatically producing a playlist having ordered video extracts according to one of the embodiments described above.
  • the invention further relates to the use of video extracts or a playlist of video extracts obtained by the computerized method of research and automatic production of a playlist, or by the editing method according to one of the embodiments described above, in a social network or in a search engine or to constitute a new digital video file.
  • the invention finally relates to a computerized system comprising:
  • At least one acquisition module for one or more digital video files At least one acquisition module for one or more digital video files
  • At least one sequencing module generating sequences of indexed digital video files;
  • At least one search module comprising a client making it possible to formulate a search query for the implementation of the steps:
  • One or more digital video files to be analyzed are received via the acquisition module;
  • Each of said digital video files is automatically indexed in a primary index, based on the endogenous, so-called primary, descriptors of said digital video file;
  • the audio, image and text data streams are extracted from each of the digital video files
  • a file analysis is carried out by file of each of said one or more digital video files according to the four modalities: image modality, audio modality, text modality, action modality, the analysis automatically producing one or more unimodal cutting time markers for each of the modalities, one or more descriptors being associated with each of the single-mode slice time markers;
  • candidate sequence time markers are provided, with the aim of determining virtual sequences, and the descriptors associated with these candidate sequence time markers, which are :
  • the time codes corresponding to said unimodal cutting time markers are compared and, each time that at least two unimodal cutting time markers resulting from different analysis modalities are separated by a time interval less than a main predetermined duration, a plurimodal candidate sequence temporal marker, in mathematical connection with the at least two unimodal cut markers, is created;
  • a lower limit and an upper limit are defined according to the type of said digital video file for the duration of a sequence and pairs of sequences are automatically selected from the candidate sequence markers.
  • sequence markers called start and end of sequence markers, each pair of sequence markers having a start of sequence marker and an end of sequence marker, such that the duration of each sequence retained is between said lower limits and superior, these pairs of sequence markers being associated with the descriptors associated with the said selected candidate sequence temporal markers, these descriptors therefore being referred to as “secondary endogenous descriptors”;
  • a search query for sequences of digital video files is formulated using the search module; each of the modules comprising the necessary calculation means, each of the modules other than the dispatcher module communicating with the dispatcher module and the dispatcher module managing the distribution of the calculations between the other modules.
  • this system further comprises at least one module for enriching the primary descriptors of the digital video files and/or the secondary descriptors of the virtual sequences of digital video files by exogenous complementary descriptors.
  • this system further comprises video editor module communicating with the research module.
  • Fig. 1 represents a flowchart of a device making it possible to implement the method of analysis, sequencing and indexing of the sequences of a digital video file.
  • Fig. 2a represents a first step in sequencing a digital video file according to the four modalities: image, audio, text and action.
  • Fig. 2b represents a second step of sequencing a digital video file according to the four modalities: image, audio, text and action.
  • Fig. 2c represents a third step of sequencing a digital video file according to the four modalities: image, audio, text and action.
  • Fig. 3 represents the different interactions between the modules and the services of the computerized process in connection with the possible actions of the user.
  • Fig. 4 represents the steps of an iteration of the method for sequencing a video file on the basis of four modalities.
  • Fig. 5a represents a graphical interface 55 for editing or viewing a playlist.
  • Fig. 5b shows another embodiment of a graphical interface for editing or viewing a playlist.
  • Fig. 6 schematically represents the effect of the manipulation of the virtual remote control on the playlist.
  • Fig. 7a shows a third embodiment of a graphical interface 55.
  • Fig. 7b shows a fourth embodiment of a graphical interface 55.
  • Fig. 8 shows a fifth embodiment of a graphical interface 55.
  • Fig. 9 shows a sixth embodiment of a graphical interface 55.
  • Fig. 10 shows a seventh embodiment of a graphical interface 55.
  • Fig. 11 shows an eighth embodiment of a graphical interface 55.
  • Fig. 12 shows a ninth embodiment of a graphical interface 55.
  • the invention relates to a method for the analysis, sequencing and multimodal indexing of digital audiovisual data.
  • the format of the audiovisual data is not limited a priori.
  • the digital video file formats MPEG, MP4, AVI, WMV of the ISO/IEC standard can be considered.
  • the audiovisual data may be available on the Internet, on a public or private digital video library, or even provided individually or in a group by a particular user.
  • Metadata is integrated into the audiovisual document, in particular technical metadata: compression level, file size, number of pixels, format, etc. cataloging: title, year of production, director, ...
  • This metadata will be referred to as "global” metadata insofar as it is associated with the digital video file as a whole.
  • a digital video file without any cataloging metadata can be sequenced automatically by the method according to the invention without human intervention. This is one of the strengths of the method compared to the sequencing methods of the prior art.
  • the audiovisual de-linearization process can be implemented on structured digital video files, such as those used in "broadcast” type distribution processes, it is particularly relevant in the case of a video file unstructured or weakly structured digital material, such as those generally available on the Internet or used in "multicast” type broadcasting processes, for example YouTube® videos.
  • the method comprises several steps traversed in a non-linear manner, requiring its implementation on a computerized device 8 for sequencing a digital video file, an embodiment of which is shown in FIG. 1, comprising several modules:
  • An acquisition module 1 allowing the recovery of one or more video files from various sources and their indexing by means of so-called primary descriptors in a primary index;
  • a sequencing module 5 generating virtual sequences (or even virtual fragments) of the digital video file(s) and indexing them in a secondary index by means of secondary descriptors;
  • a research module 6 comprising the client making it possible to carry out a search on the sequences generated by the module 5 for one or more digital video files.
  • an enrichment module 4 4.
  • module 7 video editor comprising a graphical interface allowing manipulation of virtual sequences produced following a search for virtual sequences by module 5.
  • a virtual sequence of digital video file designates a virtual fragment of the initial digital video file, of shorter duration than that of the initial file, in which the succession of images between the beginning and the end of the fragment is exactly the same as that of the initial digital video file (or original, or in which the virtual sequence was identified) between the two corresponding instants, without a specific new digital video file to the sequence is constituted at the physical level.
  • a virtual sequence of a digital video file is therefore constituted solely by the data of a pair of sequence time markers, comprising a start of sequence marker and an end of sequence marker.
  • Each time stamp corresponds to a particular timecode in the original digital video file.
  • a virtual digital video file sequence is therefore systematically indexed by means of one or more semantic descriptors, called secondary descriptors.
  • the space in storage memory used to memorize these "virtual" sequences corresponds to the space necessary to store the pairs of markers temporal and the associated secondary descriptors. This is what sequencing is called virtual.
  • the sequencing and indexing method according to the invention is therefore particularly inexpensive in terms of memory.
  • a virtual sequence of digital video file allows in a second time, in particular according to the needs of the user, the extraction of a "real" fragment of a digital video file, that is to say the constitution of a "video clip" of a digital video file.
  • the constitution of a video extract from a digital video file can for example take the form of modifications in the random access memory of a processor by viewing the content between the two sequence markers of the chosen virtual sequence, in particular in streaming, in particular after a decompression stage.
  • This visualization of the video extract does not require the constitution of a new digital video file and directly calls up the passage or the fragment of the original digital video file thanks to the virtual sequence.
  • the constitution of a video extract can possibly in certain cases materialize in a storage memory by the recording of the fragment of digital video file associated with the virtual sequence in the form of a new digital video file which can be of smaller size than that of the digital video file in which the corresponding virtual sequence has been identified.
  • the acquisition module 1 makes it possible to copy from various storage sources and to record on a suitable storage device one or more digital video files that one wishes to analyze.
  • the storage device may contain other files already acquired and its content is increased as the device is used.
  • the storage device allows access to the video file in “streaming” mode.
  • the set of digital video files acquired by the module 1 can be homogeneous from a content point of view or heterogeneous.
  • the process can be implemented in any field (sport, online courses, scientific conferences, television news, amateur videos, cinema, etc.) or even in several fields at the same time.
  • a domain or even a typology can in particular be described using semantic descriptors.
  • the different modules are made up of physical or virtual machines, and therefore of one or more processors.
  • the machines are organized into farms (“cluster” in English).
  • the device comprises at least one master node (“master” in English) which interacts with a plurality of “worker” nodes called “workers”.
  • master in English
  • workers workers
  • Each of the nodes, master and “workers”, encapsulates at least the applications, storage resources, means of calculation necessary for the realization of the task or tasks to which it is dedicated.
  • Any container orchestration solution that automates the deployment and scaling of the management of containerized applications can be considered for the creation of this “cluster”.
  • the ElasticSearch® technology available in Open Source, may be used.
  • the digital video files acquired by the module 1 are therefore stored, for example in a documentary database, and they are further indexed in a so-called "primary" index, making it possible to find and access each of the digital video files in his outfit.
  • the primary index is for example contained in the documentary database.
  • the indexing of a given digital video file in the primary index is done by means of so-called “primary” descriptors. This is for example all or part of the metadata of the digital video file.
  • the database is document-based, as opposed to relational, in the sense that searching the database is not based on a relational model or limited to an SQL-like language based on algebraic operators, such as this will be described later.
  • Each digital video file acquired by the acquisition module 1 is transmitted to the dispatcher module 2 which is a master node.
  • the dispatcher module 2 receives and distributes the requests on the "worker" nodes suitable for the execution of the requests and available for this execution.
  • the dispatcher module 2 can launch a preliminary and optional step of enriching the metadata at the level of the enrichment module 4.
  • the enrichment module 4 which is a "worker” node, is in particular connected to external databases, such as databases (4a) that are free to access and use (Open Data), web services (4b) or other databases (4c), private in particular.
  • databases (4a) that are free to access and use (Open Data), web services (4b) or other databases (4c), private in particular.
  • this preliminary step is not essential for the implementation of the method and it may not be executed or may not result in any effective enrichment of the metadata initially associated with the digital video file.
  • the method is based on techniques of automatic de-linearization of the digital video file based on the content.
  • delinearization is meant the discovery and/or recognition of underlying structures in a digital file, in particular a digital video file, without human intervention.
  • the de-linearization is, in the context of the invention, based on the content of the digital file, including the metadata, enriched or not beforehand.
  • the dispatcher module 2 can initially trigger four analyzes at the level of the multimodal analysis module 3.
  • Multimodal analysis module 3 is a “worker” node on which four different computerized devices are implemented, each implementing an automatic learning algorithm. These are, for example, four different neural networks. These neural networks analyze the digital video file with different viewpoints in parallel.
  • Each of these neural networks is chosen appropriately to extract temporal markers of potential cutting of the digital video file into sequences having coherence, i.e. meaning, with respect to a particular point of view of analysis .
  • the image stream (equivalently video stream) of the digital video file can be considered, among other things, as an ordered collection of images. We can therefore assign a sequence number to each image, allowing it to be found within the digital video file.
  • a cutting time marker corresponds to a sequence number, or equivalently to a given instant during the viewing of the video, the dates being able to be identified with respect to the initial instant corresponding to the first image digital video file.
  • a cutting marker is associated with a time code (“timecode”).
  • the neural networks used may in particular be convolutional neural networks (“Convolutional Neuronal Network”, CNN) and/or recurrent.
  • Each of these neural networks contains several successive layers of neurons, so as to be able to undergo a learning phase of the deep learning type ("deep leaming"), unsupervised, semi-supervised or supervised, preferably pre-trained before its implemented in device 8.
  • deep leaming the deep learning type
  • unsupervised, semi-supervised or supervised unsupervised, semi-supervised or supervised, preferably pre-trained before its implemented in device 8.
  • the role of supervision may be more or less important depending on the method of analysis.
  • the analysis of the text and sound streams may, in one non-limiting embodiment, be carried out by a neural network having undergone an unsupervised learning phase, and the analysis of the image stream may implement a network neurons that have undergone a supervised or semi-supervised learning phase.
  • the number and type of layers are chosen according to the type of analysis to be performed.
  • a digital video file includes components (also called “flows”) images (or equivalently video), sound (or equivalently audio) and text placed in a container.
  • a digital video file may contain several audio streams and/or several image streams.
  • the text type stream has things like metadata, subtitles, transcription of the audio stream as text where possible, etc.
  • the first neural network called an analyzer according to the image modality (3a)
  • the first neural network is configured to carry out an analysis of the image flow, image by image. It can in particular carry out analyzes of the type: detection of objects, shapes, color, texture, detection of similar images, ocerization.
  • the analyzer according to the image modality (3a) analyzes the content of each image of the file to be analyzed pixel by pixel. It is, among other things, equipped with an object detection algorithm, preferably capable of analyzing a video stream in real time while maintaining good predictive performance (algorithm available under the name “Yolo3” for example).
  • the analyzer following the image modality (3a) extracts a set of primitives which take into account certain representations such as the contour, the texture, the shape, the color, then it aggregates the results in a single signature allowing the calculations of similarity in particular to through a hybridization between Deep Leaming and unsupervised clustering algorithms (“K Nearest Neighbors”, KNN).
  • the algorithm aggregates the results in a signature allowing similarity calculations in particular through a hybridization between Deep Leaming algorithms and unsupervised clustering (KNN) (plane aggregation).
  • KNN unsupervised clustering
  • the image modality gives rise to an analysis according to at least three sub-modalities:
  • the second neural network is a so-called sound analyzer network (3b) or equivalently an analyzer according to the audio modality or according to the sound modality. It is equipped with an audio track separator and an activity detector for speech, noise, music, ...
  • the third neural network (3c) is a text flow analyzer or equivalent analyzer depending on the text modality, for example metadata, subtitles when available, or text obtained after a "speech" type text extraction to text” on the basis of known voice recognition technologies, or even “video tagging” information described later.
  • NLP Natural Language Processing
  • speech to text the analyzer following the text modality (3c) cuts sentences, paragraphs into units of meaning translating a change of subject, or the continuation of an argument according to models of the analysis of the speech.
  • the analyzer following the text modality (3c) can also, via an automatic language processing (T.A.L) platform, possibly Open Source, extract semantic metadata to feed structured fields from the full text coming from module 4, for example from web sources and/or social networks.
  • T.A.L automatic language processing
  • Open Source extract semantic metadata to feed structured fields from the full text coming from module 4, for example from web sources and/or social networks.
  • the fourth neural network (3d) is an analyzer of the video stream as a whole, in order to create cutting markers based on dynamic notions, such as the notion of action or shot changes.
  • This modality of analysis will be called equivalently action modality or event modality.
  • the actions could include the phases of actual play as opposed to the phases during which the players are not playing, for example: waiting for the next serve, picking up the ball, ...
  • the analyzer following the action modality (3d) first detects the changes of shots. It should be noted that the changes of shots are generally not made randomly by an editor, so they can carry rich information, which can be found at least partially thanks to this detection of the changes of shots.
  • the characteristic images of each plane are then sent to the analyzer according to the image modality (3a).
  • the information returned by the analyzer according to the image modality (3a) is analyzed in the analyzer according to the action modality (3d) by an action detection algorithm.
  • a dense pose estimation system can be implemented, which associates the pixels of two successive images based on the intensities of the different pixels to match them with each other. the other.
  • Such a system can perform “video tracking” without sensors having been positioned on the animated objects/subjects present in the video content.
  • a stock bank can be set up with a view to a supervised learning phase, thanks in particular to this estimation.
  • the analysis of a player's arm gesture on a set of digital video files each containing a sequence of well-identified offensive forehands allows the neural network to recognize, based on the successive positions of a player's arm, an offensive forehand in a video file that was not used for training.
  • topspin An offensive forehand
  • cut An offensive forehand
  • Actions can be defined outside the context of sport.
  • a handshake between two subjects can be an action in the sense of the invention, and a neural network can learn to recognize such an action.
  • the analyzer following the action modality (3d) can also exploit the sound associated with the images.
  • an interruption in the flow of the speaker can be indicative of a change of action in the sense of these videos, that is to say the passage from one sequence of the course to another sequence.
  • the analyzer following the action modality (3d) can also exploit "video tagging" information, i.e. metadata of the keyword type added manually to the digital video file, when they are relevant from the point of view of view of the actions that have been identified.
  • the action modality gives way to at least two sub-modalities:
  • the first sub-modality is the analysis (or equivalently the detection) of shot changes
  • the second sub-modality is action detection in the sense of a typology, such as a typology of digital video files or gesture or motion.
  • the method can include the phase of training the neural networks on a set of video files associated with a particular domain, for example a set of video files relating to a particular sport, or a particular scientific field. It can also be implemented on neural networks previously trained for a domain chosen by G user for example.
  • the analyzers according to the image (3a) and action (3d) modalities can provide sets of unimodal temporal markers according to several sub-modalities.
  • different unimodal cutting temporal markers can be identified according to one or more of the sub-modalities: change of planes,
  • a descriptor is a term, which may be a common noun or a proper noun, an adjective, a verb, a phrase, a compound word or a group of words, and which represents a concept. Only descriptors or combinations of descriptors can be used for indexing. The non-descriptors may, however, be used in the formulation of the search request at the level of module 6 of research and assembly.
  • descriptors can optionally be defined in a thesaurus specific to the device or come from existing thesaurus.
  • a descriptor therefore makes it possible, in documentary language, to specify the content of the digital video file when it is associated with the digital video file as a whole, or of a sequence of digital video file when it is associated with the latter.
  • the analysis step can be performed based on minimal metadata.
  • the following schematic example helps to understand the different steps of the process. Let's assume that a user of the device wants to analyze a video:
  • - whose audio track does not allow the extraction of significant textual content. For example, it contains only noise without identifiable words, or background music without words and unrelated to the image content.
  • the example digital video file is an "example 1" amateur video file, made during a football match and in a very noisy sound environment so that any words cannot be highlighted in the noise of background.
  • a first analysis by module 3 of multimodal analysis makes it possible to bring out a few descriptors of the ball, football, jersey type (and their colors), names of certain players, football stadium soundscape, corresponding to a relatively coarse sequencing after processing of the results of module 3 of multimodal analysis by module 5 of sequencing which will be described later.
  • the dispatcher module 2 can optionally enrich the unimodal descriptors identified and associated with the unimodal cutting time markers by exogenous descriptors, either by transmitting them to the enrichment module 4, or from the descriptors already identified and stored in the device itself. , especially in the primary and secondary indexes.
  • exogenous descriptors such as “match, goal, half-time, ...” may be added.
  • exogenous descriptors can also be found on the device's database if it has already analyzed other video files such as football matches.
  • the dispatcher restarts an analysis step by the multimodal analysis module 3 on the basis of these enriched descriptors.
  • This new step generates more unimodal cut-out time markers and/or more adapted to the analyzed video.
  • a second stage of analysis of the "example 1" video following the enrichment of the descriptors by the enrichment module 4 will make it possible to obtain a sequencing on the basis of the two halves and the goals scored if these events are identified.
  • Module 3 of multimodal analysis used a priori can be "generalist”, i.e. adapted to digital video files whose content is as varied as possible, or even specialized by learning on an ad hoc video game.
  • a multimodal analysis module 3 dedicated to and trained in this area, or even in a specific sport, can be implemented. But it is possible to analyze the same video with several 3 multimodal analysis modules dedicated to several different domains to obtain different sequencing, or to use a set of 3 modules to change the choice of the 3 multimodal analysis module as the metadata are enriched to move towards a multimodal analysis module 3 increasingly adapted to the content of the digital video file, on which the device had no a priori knowledge of the domain of the content.
  • each of the modules 3 of multimodal analysis being adapted to a particular and/or general field.
  • the multimodal analysis module 3 can only analyze the file according to two methods, for example if one of the streams of the file is not usable, or if one wishes to favor these two modalities.
  • the temporal markers of unimodal cutting and the endogenous, and possibly exogenous, associated unimodal descriptors are transmitted by the dispatcher to the module 5 of sequencing.
  • Sequencing module 5 is also a “worker” module. The sequencer synthesizes all the information collected by the dispatcher to create homogeneous, coherent and relevant sequences, if possible according to several of the points of view used in module 3 of multimodal analysis at the same time.
  • the horizontal axis represents the time axis for the digital video file, that is to say the order of appearance of the various images which constitute it;
  • the unimodal slice time markers associated with the image modality are for example represented on the top line, the unimodal slice time markers associated with the audio visual modality on the line, just below, then again below the time markers of unimodal cutouts associated with the textual modality, and finally the unimodal cutout temporal markers associated with the action modality are represented on the bottom one.
  • the sequencing module 5 proposes candidate sequence time markers.
  • a candidate sequence time stamp is:
  • plurimodal candidate sequence temporal marker To create a plurimodal candidate sequence temporal marker, one proceeds as follows: if at least two unimodal cutting temporal markers from different modalities are identified as temporally close, a plurimodal candidate sequence temporal marker, in mathematical relation with these temporal cutting markers unimodal, is created.
  • the temporal proximity is defined with respect to a time criterion T2 specified beforehand: two (or more) of unimodal cutting temporal markers are considered as temporally close if they are separated two-by-two by a duration less than a duration predetermined T2, called main.
  • a plurimodal sequence temporal marker is created in mathematical connection with the unimodal cutout markers which underlie its creation according to a rule fixed beforehand.
  • the candidate multimodal sequence time stamp is identical to the single-mode slice time stamp from the audio modality. Or again, it can correspond to the time marker closest to the mean of the time codes of the n unimodal cutting time markers identified as temporally close.
  • a unimodal candidate sequence time stamp is created based on a single modality. In this case, it is said to be a unimodal candidate sequence time marker and identical to the identified unimodal cut-out time marker.
  • Figure 2a represents the decomposition of a digital video file according to the four modalities: image, audio, text and action.
  • two candidate sequence time markers 21 plurimodal are detected in this case according to four modalities.
  • Candidate sequence markers are therefore said to be “main” when they come from the four modalities.
  • the two candidate sequence temporal markers 21 of FIG. 2a are therefore principal multimodal.
  • Endogenous plurimodal descriptors called “main” because they come from the four modalities, are associated with each of the 21 main plurimodal candidate sequence temporal markers identified.
  • FIG. 2b represents the breakdown of the same digital video file as for FIG. 2a according to the four modalities: image, audio, text and action.
  • This decomposition leads initially to the detection of three main candidate sequence temporal markers 21 , resulting from four different modalities.
  • Multi-modal, but only three-modality candidate sequence temporal markers 22 can be identified.
  • This plurimodal candidate sequence marker is said to be secondary because it is plurimodal but stems from less than four modalities.
  • the secondary plurimodal candidate sequence marker is associated with endogenous plurimodal descriptors, called secondary because they are plurimodal but come from less than four modalities.
  • a multimodal candidate sequence marker whether primary or secondary, can be associated with endogenous multimodal (or equivalently multimodal) descriptors, derived from the unimodal descriptors associated with the unimodal cut-off temporal markers of all the modalities which made it possible to select the multimodal marker.
  • the descriptors are said to be "endogenous" when they come from the sequencing of the digital video file by the sequencing module (5) but not from an enrichment step by the module (4) from information exogenous to the video file digital.
  • Two secondary candidate multimodal cut-out time markers 22 from three modalities can be seen in Figure 2b.
  • a proximity threshold being able to be predetermined
  • a multimodal candidate cutting marker called "secondary” because multimodal but resulting from less than four modalities, is identified, to which are associated endogenous multimodal descriptors, called secondary because multimodal but resulting from less than four modalities, in a second step.
  • FIG. 2c This case is represented in FIG. 2c, still for the same digital video file as in FIG. 2a.
  • the sequencing allows the detection in a first stage of main plurimodal candidate sequence markers 21, in a second stage of secondary plurimodal candidate sequence markers 22 resulting from three modalities, then in a third stage of secondary plurimodal candidate sequence markers 23.
  • the multimodal candidate cut markers are therefore initially chosen by temporal proximity out of four modalities, which leads to the choice of the main multimodal candidate sequence markers 21.
  • secondary multimodal sequence markers 22 or 23 can be selected based on a combination of two or three modalities.
  • the sequencing is considered “insufficient” on automatically assessable criteria. For example, if at least one time interval separating two successive candidate sequence markers has a duration greater than a predetermined duration, called the threshold duration T1, defined for example in relation to the total duration of the digital video file or absolutely, the sequencing is insufficient.
  • T1 a predetermined duration
  • candidate sequence time markers Once the candidate sequence time markers have been identified, a selection is made from among these candidate sequence markers to constitute one or more pairs of sequence markers, each comprising a start of sequence marker and an end of sequence marker.
  • the duration of a sequence is, to do this, limited by a minimum duration D 1 and by a maximum duration D2 which depend on the type of digital video file to be sequenced.
  • a last end of sequence marker can be, to initialize the constitution of pairs of sequence markers, placed from the end of the digital video file, either exactly at the end of the file, or for example at the level of a candidate sequence time stamp provided it is separated by a time interval less than a predetermined threshold from the end of the file.
  • a multimodal candidate sequence marker separated by a duration between the durations and D1 and D2 of the last end-of-sequence marker is sought. If it exists, it is effectively retained as the last sequence start marker and associated with the last sequence end marker to constitute the last pair of sequence markers, which delimits the last virtual sequence.
  • a multimodal candidate sequence marker is found at a duration less than D 1 from the last end-of-sequence marker, it can thus be decided not to retain it because the sequencing would result in sequences that are too short for them to be really of interest.
  • a unimodal candidate sequence marker separated by a duration between the durations and DI and D2 of the last end-of-sequence marker is sought . If it exists, it is selected as the last start-of-sequence marker and combined with the last end-of-sequence marker to form the last pair of sequence markers, which delimits the last virtual sequence.
  • a last sequence start marker is created, separated by a duration D2 from the identified cutting marker, so as to ensure the convergence of the process.
  • At least one of the sequence tags of each pair of sequence tags is multimodal.
  • the two sequence markers of each pair of sequence markers are multimodal.
  • This arrangement makes it possible to ensure that the identified sequences have a semantic coherence defined by several modalities.
  • At least one of the sequence markers of each pair of sequence markers is main multimodal.
  • weights can be assigned to the different modalities according to the typology of the digital video file. For example, for “sport” type videos, the action modality can play a more important role in the sequencing if its weight is higher.
  • the weights of the different modalities can optionally be chosen according to the nature of the content analyzed (known a priori or detected as the iterations progress) and/or the video file search criterion formulated by a user of the device 8.
  • Each virtual sequence of digital video file can be indexed in a secondary index by means of the endogenous descriptors, and if necessary exogenous, associated with the start of sequence marker, as well as those associated with the end of sequence marker.
  • descriptors associated with the start of sequence marker and/or with the end of sequence marker are said to be “secondary” in the sense that they are associated with a digital video file sequence and no longer with the digital video file as a whole. They allow the sequence marker pair to be indexed in the secondary index.
  • the secondary index is in a relationship of inheritance with the primary index so that the primary endogenous descriptors, associated with the digital video file, are also associated with the identified sequence.
  • the sequences of a digital video file are "daughters" of this digital file in the sense that if the digital video file is indexed means of endogenous and, where appropriate exogenous, primary descriptors, the sequence inherits these primary descriptors and can therefore be searched in the index not only on the basis of the secondary descriptors which characterize it but also on the basis of the primary descriptors which characterize the digital video file of which it is a "daughter".
  • the minimum duration of a video file sequence is not fixed a priori but a video file sequence (or equivalently a pair of sequence time stamps) is retained in the secondary index only if it is associated with a sufficient number of descriptors, for example for there to be a significant probability of finding this sequence at the end of a search query.
  • unimodal sequence markers can be selected, before an enrichment step and a new iteration of the process of sequencing for example.
  • Unimodal sequence markers then play the same role as multimodal sequence markers in the indexing process, i.e. the corresponding sequences are indexed on the basis of the associated unimodal descriptors. This scenario is not sought in itself, but makes it possible to ensure the convergence of the sequencing process.
  • information on the unimodal or multimodal character of a given secondary endogenous descriptor is kept during the indexing process. Thanks to this arrangement, it is possible to distinguish the multimodal secondary descriptors from the unimodal descriptors, which can be useful when searching for a video file sequence in which it is desired to make these two types of descriptors play different roles.
  • the analysis of a digital video file is not carried out backwards, but by starting by selecting a first initial sequence marker, then a first end sequence marker and so on until the file has been completely scanned starting from the beginning of the file.
  • the sequencer therefore indexes in a secondary index all the validated virtual sequences, that is to say all the virtual sequences identified and delimited by a marker of sequence start and an end of sequence marker retained by the sequencing module 5, each of which is associated with a set of endogenous and, where appropriate, exogenous secondary semantic descriptors.
  • a sequence time marker can be associated by default with the first image and/or the last image, so as to ensure the sequencing of the entire file.
  • a preliminary step of reducing the digital video file can be carried out so as to proceed with the sequencing only on the fragments of digital video file of interest.
  • the secondary descriptors selected at the end of the sequencing step are secondary because they are not associated with a digital video file in its entirety, like “global” metadata or generally like “primary” descriptors, but they are associated to a particular sequence.
  • the sequencing module 5 may optionally be a cluster of sequencers, this arrangement making it possible to distribute the requests to the various sequencers of the cluster according to the increase in load of the device.
  • the process is iterative, i.e. the secondary descriptors associated with a virtual sequence can be enriched by a search for so-called "exogenous" secondary descriptors, such as sequence descriptors already existing in the descriptor database. of the device and/or through the enrichment module 4, before a new sequencing is restarted in order to achieve finer sequencing, on the basis of the endogenous and exogenous primary and secondary descriptors identified. It is also possible to proceed, before the sequencing of a digital video file, to a step of enrichment of the primary endogenous descriptors of this digital video file by exogenous descriptors, also called primary by means of the enrichment module 4 . A digital video file is therefore indexed in the primary index by means of endogenous and, where appropriate, exogenous primary descriptors.
  • information on the exogenous or endogenous character of a given primary or secondary descriptor is kept during the indexing process. Thanks to this arrangement, it is possible to distinguish the endogenous descriptors from the exogenous descriptors, which can be useful when searching for a video file sequence in which one wishes to make these two types of descriptors play different roles.
  • example 1 if the sequences have been defined at the end of a first sequencing step on the basis of the schedule identified for the goals and half-time, it is possible for example to find the corresponding match on the Internet and to enrich the endogenous secondary descriptors of each sequence on the basis of textual information on this match.
  • Fig. 4 gives a schematic representation of the steps of an iteration of the sequencing process of a video file on the basis of four modalities.
  • the process of indexing digital video file sequences is of the parent/child type: the dispatcher's index points to the general information of the digital video file, therefore to the so-called "primary" index, while the sequencer creates an indexing "secondary" inherited.
  • the primary and secondary indexes are multi-field and mutually feed each iteration. For example, a step of sequencing the video of a football match can cause N sequences to emerge, the k-th of which is associated with a descriptor that is “half-time”. The “half-time” information is relevant both for the sequence k but also for the entire video file.
  • the primary indexing of the video file can therefore be enriched with the half-time information and the date of this half-time in the file.
  • wildcard information can populate the primary index from the secondary index
  • character information initially identified as generic and becoming particularly relevant to a particular sequence can populate the secondary index from the primary index
  • the invention therefore makes it possible to go down, thanks to this indexing process, to a much finer grain size in a search for content in digital video files than what is permitted by the indexing processes currently implemented for this type of files, as well as a two-level sequence search possibility according to the two nested dimensions created by the two indexes.
  • this secondary indexing is dynamic, that is to say that it can be enriched and refined: as the analyzes of videos of the same domain are carried out, the corpus of relevant descriptors associated to this domain on the basis of which the multimodal analysis module 3 can analyze a digital video file increases. As a result, the first analyzed digital video file can be re-analyzed after analyzing N other digital video files to refine its sequencing.
  • the secondary indexing can be carried out according to various points of view according to the video search requests carried out by the user on the video library already analyzed.
  • an initial point of view chosen for secondary indexing is not absolutely limiting and can always be modified on the basis of a particular search.
  • a digital video file could have been created manually by aggregating two video files to give a digital video file containing a football sequence containing, among other things, a spectacular football goal followed by a rugby sequence containing, among other things, a spectacular rugby. Analyzing this digital video file in sports mode would yield two sequences, one sequence (a) for football and one sequence (b) for rugby, but there is no reason why the sequencing should be suitable for football rather than in rugby or vice versa.
  • the dispatcher can relaunch an analysis of the video (a) on descriptors adapted to football, to obtain a sequencing and an indexing more adapted to this particular sport. But he can repeat the same process at another time in the context of rugby.
  • the search module 6 contains a “client”, which allows a user to access the various sequences of the video files analyzed by formulating a search query .
  • the research module 6 therefore constitutes the so-called “front-end” level of the device, that is to say through which the end user interacts with the device, while modules 1 to 5 constitute the so-called “back-end” level.
  • -end i.e. not visible to the end user of the device.
  • the research module 6 can communicate with a video editor module 7, comprising an interface for creating, editing and viewing video extracts corresponding to virtual sequences.
  • the search module 6 allows the user at least to formulate a search query and to visualize the result.
  • a search is carried out on the sequences of video files thanks to the association ⁇ primary index, secondary index ⁇ based on a inheritance link and thanks to the sets of descriptors that have been associated with each sequence of each digital video file during secondary indexing.
  • the query is not an a priori query based on a relational database language, although this possibility could be envisaged.
  • This is a query of the type used by search engines, i.e. the query can combine a full-text, faceted search based on the descriptors present in the primary and secondary and numerical (for example, sorting can be done on chronological type criteria).
  • the search query can be formulated by a user in a user interface or else by a conversational robot (“chatbot” in English).
  • the search result is then displayed in the graphical interface of the search and editing module 6 and it does not appear in the form of a list of video files but of a list of sequences of video files, classified in order of relevance.
  • Fig. 3 represents the different interactions between the modules and the services of the computerized process in connection with the possible actions of the user.
  • the principle is therefore that implemented for website search engines, which allow direct access to the pages that make up the websites, or for the constitution of playlists from a set of audio files in which tracks or chapters are predefined.
  • this principle is natural for these two types of media, highly structured and designed to be indexed, it is not used for any type of digital video file in general, for which the choice has historically been made to index them in their globality due to the complexity of their sequencing.
  • the device makes it possible in summary to constitute a search engine for digital video file sequences, the sequencing of video files on which the search is carried out being dynamic, that is to say to be created or modified or adapted at the end of formulating a new search query.
  • the search result may include several sequences from several different video files and/or several sequences from the same digital video file.
  • the temporal consistency of the original sequences may not be respected, even in the case where the sequences forming the list returned in response to the search query come from the same original digital video file, since this is the relevance of the sequences with respect to the search criterion which fixes their order of appearance in this list.
  • the relevance of the sequences in relation to the search criterion is for example evaluated according to logical and mathematical criteria, which make it possible to assign a score to each sequence according to a query.
  • the sequences are then presented in descending order of score.
  • Prior filtering steps (language, geographical origin, dates, etc.) may be provided.
  • a higher weight is assigned to the secondary descriptors than to the primary descriptors so that the search result is based more on the content of the sequence than on the content of the video file digital as a whole.
  • indexing architecture primary and secondary
  • a user can therefore perform several tasks dynamically from full-text search functionalities, semantic concepts, themes or multi-criteria filters/facets.
  • the research module 6 can comprise a user interface, such as a computer, a tablet, a smartphone for example.
  • the video editor module 7 can include a user interface, such as a computer, a tablet, a smartphone for example.
  • the user interface can be common to modules 6 and 7.
  • the user can in particular, via one or other of these interfaces: from each virtual sequence, extract the virtual sequence from the digital video file to produce a video extract that he can view , such as streaming, or saving as a new digital video file.
  • a video extract In the case where a video extract is displayed, it can optionally simultaneously display the endogenous and/or, where appropriate exogenous, secondary and/or primary descriptors associated with the extracted sequence.
  • the dashboard can also present other information, such as definitions or "find out more" from the encyclopedic web, geographical maps, graphs...
  • the user interface can comprise a graphical interface 55 comprising a zone 52 dedicated to formulating the search query and displaying its results, a zone for viewing video extracts (screen 1, reference 53), a second zone display (or even screen 2, reference 54), synchronized with screen 1 and a virtual remote control zone 51.
  • a graphical interface 55 comprising a zone 52 dedicated to formulating the search query and displaying its results, a zone for viewing video extracts (screen 1, reference 53), a second zone display (or even screen 2, reference 54), synchronized with screen 1 and a virtual remote control zone 51.
  • each end of sequence marker of each virtual sequence associated with an extract from the playlist is: main plurimodal or
  • This arrangement makes it possible to increase the semantic consistency of the playlist as a whole and its consistency with respect to the search criterion formulated.
  • Navigation can, thanks to the primary and secondary indexing system, be extended outside the selected playlist: it is for example possible, from a given sequence of the playlist, to extend the playback of the digital video file from which from the sequence beyond this sequence by moving the start and/or end of sequence markers.
  • Visual effects such as, in a non-exhaustive way, slow motions, enlargements, repetitions, can be applied to the playlist, either during viewing, an addition of text, a freeze frame, etc., or for the editing a new digital video file.
  • Sound effects such as, but not limited to, modifying a background sound, adding a commentary or another sound, can be applied to the playlist, either during viewing or for editing. a new digital video file. Building a playlist or editing a new video can be fully automated from the formulation of the search query. However, as the system behaves like a virtual playhead which moves dynamically from sequence to sequence, at any time, if the graphic interface of module 6 gives it the possibility, the user can act on the playlist or the new video.
  • the graphical interface of the video editor module 7 thus offers navigation options in the form of an improved video player allows access to the summary when the search result is an entire video or an interactive zapping within the selected and aggregated sequences.
  • a graphical interface 55 for editing or viewing a playlist, can be viewed in FIG. 5a. Selectable descriptors are positioned to the left of playlist viewing screen 1, the playlist can be displayed above screen 1, the descriptors related to the user's search are displayed above the playlist .
  • Virtual remote control 51 is located below the playlist.
  • a second screen linked to the video extract corresponding to the virtual sequence being viewed is located to the right of the playlist and allows you to display graphics or other useful information linked to the playlist.
  • Fig. 5b shows another embodiment of the graphical interface of the device 8 in which selectable descriptors are positioned to the left of the screen for viewing the playlist, the playlist is viewed in screen 1 (reference 53), the descriptors related to the user's search are located above the playlist and the virtual remote 51 is located below the playlist
  • Fig. 6 represents the actions performed when using each button of the virtual remote control on an example of a playlist created from three digital video files, the playlist being composed by way of example of three different extracts.
  • the virtual remote control comprises for example at least 5 virtual buttons.
  • the al button allows viewing of the video extract corresponding to the current sequence and stopping viewing.
  • buttons a2 and a2 When button a2 is pressed, the playback of the video extract corresponding to the sequence being viewed will be extended in the original digital video file beyond the duration provided for this sequence, a second press of button a2 while viewing has not yet exceeded the time limit provided for the sequence cancels the first press of button a2, a second press of button a2 when viewing the digital video file outside the time limit provided, stops viewing of the original digital video file and resumes the playlist at the next sequence.
  • Button a3 allows you to return to the start of the sequence preceding the sequence currently being viewed.
  • the a4 button allows you to return to the start (at the timecode of the start marker) of the sequence currently being viewed.
  • Button a5 stops viewing the current sequence and starts playing the next sequence.
  • -N s which allows you to go back N seconds in the digital video file of the current sequence, allowing you to review a sequence or to see N seconds before the start marker of the current virtual sequence;
  • this button allows you to advance N seconds ahead of the digital video file of the current sequence allowing you to skip a sequence or see 10 seconds after the end marker of the virtual sequence In progress.
  • the virtual remote control therefore allows flexible navigation within the automatic playlist of video extracts from digital files, the user being able to view the selected extracts at will in the order of the playlist or in an order that suits him better or even extending the viewing an extract before or after the cut markers, without the files associated with each extract being created and having to be opened and/or closed to switch from one extract to another.
  • the comfort and browsing potential are therefore considerably improved compared to what is possible with a “static” playlist within the meaning of the prior art.
  • Figs. 7a and Figs. 7b represent two examples of graphical interface 55.
  • FIG. 7a represents a graphic interface of the computerized method, comprising a first screen 53 for viewing the playlist, a second screen 54 for a graphic linked to the sequence being viewed and a virtual remote control 51 located below the two screens to navigate in the playlist (in which the video extracts are arranged one after the other), as well as a button used to put the playlist in full screen.
  • FIG. 7b represents a graphic interface 56 of the computerized method, comprising a first screen 53 for viewing the playlist, a second screen 54 for putting messages in connection with the video or for communicating with other users, a virtual remote control 51 located below the two screens to navigate in the playlist and a button used to put the playlist in full screen.
  • the playlist made up of extracts based on this search result can be exhaustive. It may also contain only extracts considered essential with respect to search criteria specified by the user.
  • a score can be defined to classify the virtual sequences of digital video files into two categories: "essential” and “ornamental” according to the number of descriptors found.
  • the playlist made up of extracts based on this search result may contain only the extracts associated with virtual sequences identified as essential with respect to criteria user-specified searches.
  • the concept of summary can be defined in relation to a particular domain.
  • the summary can be built from keywords provided by the user or defined beforehand, for example ⁇ goal, yellow card, red card, change of player, mid- time ⁇ , the relevant sequences being presented in the temporal order of the initial digital video file from which they originate.
  • the search is possible in "full text” mode (or even “full text”) and in "faceted” search mode, with optional semi-automatic completion. Faceted answers help refine search criteria and are combined with full-text words.
  • the inheritance indexing system thanks to the inheritance indexing system, the video files (in the previous example, the matches) from which the sequences originate are known. It is therefore possible to provide an option to view all or part of the original video files of the sequences if necessary.
  • module 6 "front-end” and the "back-end” level composed of modules 1 to 5 can be done whatever the support of module 6 (computer, tablet, smartphone, etc.) possibly without use a proprietary application. That is in particular achievable with technologies accessible in Open Source, such as the React JavaScript library.
  • the device can be integrated into a social network, and offer two user profiles: the creators of video files by editing using the video editor module 7 and the viewers (“followers”) who follow these creators.
  • the browsing history on a playlist of excerpts from digital video files obtained according to the invention can be recorded. It can then be shared in a social network or used to semi-automatically edit a new digital video file.
  • Fig. 8 represents a graphic interface of the device 8 comprising a screen for the representation of a mental map ("mindmap" in English) of a directory of sequences or automatic lists or extracts or playlist recorded by the user, a part of the backups being public and the other part private, below this screen several tabs are selectable: Mindmap, Chatbot, Search by facet, Social network and video editor.
  • Fig. 9 represents a graphic interface 56 of the device 8, comprising a screen for the representation of the interactive Chatbot making it possible to carry out a search for playlists or sequences through a discussion by keyword, below this screen several tabs are selectable: Mindmap, Chatbot, Facet Search, Social Network and Video Editor.
  • Fig. 10 represents a graphic interface of the device 8, comprising a screen for the representation of the search by facet, grouping descriptors under other more general descriptors, making it possible to search by tree structure, below this screen several tabs are selectable: Mindmap, Chatbot, Facet Search, Social Network and Video Editor.
  • Fig. 11 represents a graphic interface of the device 8, comprising a screen for the social network integrated into the invention, the users share the playlists found or created, below this screen several tabs are selectable: Mindmap, Chatbot, Search by facet, Network social and video editor.
  • Fig. 12 represents a graphic interface of the computerized device 8, comprising a screen for editing video, the user can modify the order of the extracts and integrate the extracts he wishes into a playlist, below this screen several tabs are selectable: Mindmap, Chatbot, Facet Search, Social Network and Video Editor.
  • multimodal analysis module 3 a analyzer according to the image modality 3b: analyzer according to the audio modality 3 c: analyzer according to the text modality 3d: analyzer according to the action modality 4: enrichment module

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Television Signal Processing For Recording (AREA)
EP22748259.3A 2021-07-08 2022-07-06 Computerimplementiertes verfahren zur lieferung von audiovisuellen medien auf anfrage Pending EP4335111A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR2107439A FR3125193A1 (fr) 2021-07-08 2021-07-08 Procédé informatisé de dé-linéarisation audiovisuelle
PCT/EP2022/068798 WO2023280946A1 (fr) 2021-07-08 2022-07-06 Procede informatise de de-linearisation audiovisuelle

Publications (1)

Publication Number Publication Date
EP4335111A1 true EP4335111A1 (de) 2024-03-13

Family

ID=78649350

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22748259.3A Pending EP4335111A1 (de) 2021-07-08 2022-07-06 Computerimplementiertes verfahren zur lieferung von audiovisuellen medien auf anfrage

Country Status (3)

Country Link
EP (1) EP4335111A1 (de)
FR (1) FR3125193A1 (de)
WO (1) WO2023280946A1 (de)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233104B (zh) * 2023-05-10 2023-07-21 广州耐奇电气科技有限公司 基于Elasticsearch的物联网大数据热力监控系统及其监控装置
CN116646911B (zh) * 2023-07-27 2023-10-24 成都华普电器有限公司 应用于数字化电源并联模式的电流均流分配方法及系统
CN117478824B (zh) * 2023-12-27 2024-03-22 苏州元脑智能科技有限公司 会议视频生成方法、装置、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714909B1 (en) 1998-08-13 2004-03-30 At&T Corp. System and method for automated multimedia content indexing and retrieval
EP1495603B1 (de) * 2002-04-02 2010-06-16 Verizon Business Global LLC Verbindungsherstellung über instant-communications-clients
US10331661B2 (en) * 2013-10-23 2019-06-25 At&T Intellectual Property I, L.P. Video content search using captioning data
US9253511B2 (en) * 2014-04-14 2016-02-02 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for performing multi-modal video datastream segmentation
BE1023431B1 (nl) 2016-06-01 2017-03-17 Limecraft Nv Automatische identificatie en verwerking van audiovisuele media

Also Published As

Publication number Publication date
WO2023280946A1 (fr) 2023-01-12
FR3125193A1 (fr) 2023-01-13

Similar Documents

Publication Publication Date Title
Amato et al. AI in the media and creative industries
EP1859614B1 (de) Verfahren zum auswählen von teilen eines audiovisuellen programms und einrichtung dafür
US8799253B2 (en) Presenting an assembled sequence of preview videos
US9342596B2 (en) System and method for generating media bookmarks
WO2023280946A1 (fr) Procede informatise de de-linearisation audiovisuelle
US8156114B2 (en) System and method for searching and analyzing media content
US8219513B2 (en) System and method for generating a context enhanced work of communication
Lokoč et al. Is the reign of interactive search eternal? findings from the video browser showdown 2020
US20060122984A1 (en) System and method for searching text-based media content
US20120239690A1 (en) Utilizing time-localized metadata
EP1368756A1 (de) Verfahren zur navigation durch berechnung von dokumentengruppen, empfänger zur durchführung des verfahrens, und grafische schnittstelle zur anzeige des verfahrens
EP2104937B1 (de) Verfahren zur erzeugung einer neuen zusammenfassung eines audiovisuellen dokuments, das bereits eine zusammenfassung und meldungen enthält, und empfänger, der das verfahren implementieren kann
US20140115622A1 (en) Interactive Video/Image-relevant Information Embedding Technology
EP2524324A1 (de) Verfahren zur navigation von identifikatoren in bereichen und empfänger zur ausführung des verfahrens
US20100281046A1 (en) Method and web server of processing a dynamic picture for searching purpose
Saravanan Segment based indexing technique for video data file
US20240364960A1 (en) Computerized method for audiovisual delinearization
Knauf et al. Produce. annotate. archive. repurpose-- accelerating the composition and metadata accumulation of tv content
TWI780333B (zh) 動態處理並播放多媒體內容的方法及多媒體播放裝置
Reboud Towards automatic understanding of narrative audiovisual content
Zavesky et al. Searching visual semantic spaces with concept filters
Anilkumar et al. Sangati—a social event web approach to index videos
Smeaton et al. Interactive searching and browsing of video archives: Using text and using image matching
Peronikolis et al. Personalized Video Summarization: A Comprehensive Survey of Methods and Datasets
WO2024120646A1 (en) Device and method for multimodal video analysis

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231208

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)