WO2022134701A1 - 视频处理方法及装置 - Google Patents

视频处理方法及装置 Download PDF

Info

Publication number
WO2022134701A1
WO2022134701A1 PCT/CN2021/120390 CN2021120390W WO2022134701A1 WO 2022134701 A1 WO2022134701 A1 WO 2022134701A1 CN 2021120390 W CN2021120390 W CN 2021120390W WO 2022134701 A1 WO2022134701 A1 WO 2022134701A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
target
sub
text
video
Prior art date
Application number
PCT/CN2021/120390
Other languages
English (en)
French (fr)
Inventor
谢畅
李佩易
Original Assignee
上海幻电信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海幻电信息科技有限公司 filed Critical 上海幻电信息科技有限公司
Priority to EP21908702.0A priority Critical patent/EP4207772A4/en
Publication of WO2022134701A1 publication Critical patent/WO2022134701A1/zh
Priority to US18/298,243 priority patent/US20230245455A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/26603Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel for automatically generating descriptors from content, e.g. when it is not made available by its provider, using content analysis techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/47815Electronic shopping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of computer technology, and in particular, to a video processing method.
  • the present application also relates to a video processing apparatus, a computing device, a computer-readable storage medium and a computer program product.
  • video information has a large information capacity, and among the information that people can process at the same time, video is the one with the largest amount of information.
  • video information is multi-modal, that is, the video information naturally has multiple dimensions, it includes not only the image information of each frame, the text information carried by the subtitles, but also the audio information carried in the audio track and so on.
  • the video information has a time sequence association, and the information carried in each frame or each segment of the video information is usually related to the previous segment and the next segment of video content. Video not only carries information in each frame, but also carries deeper and more complex information through contextual associations.
  • the video information needs to be received as much as possible for the viewer. But in the face of hundreds of millions of massive video information, not everyone is interested in the information contained in each frame. In specific tasks such as video retrieval, summarization, video recommendation, review and other tasks, only the interesting part of the video information needs to be extracted. How to extract effective information from videos to complete such tasks has become an urgent problem to be solved.
  • embodiments of the present application provide a video processing method.
  • the present application also relates to a video processing apparatus, a computing device, a computer-readable storage medium, and a computer program product, so as to solve the defects of omission or error in extracting valid information from the video in the prior art.
  • a video processing method including:
  • an object list corresponding to the target object contained in the target video is determined.
  • a video processing apparatus including:
  • an extraction module configured to extract at least two modal information from the received target video
  • an extraction module configured to extract text information from the at least two modal information according to an extraction manner corresponding to the at least two modal information
  • the matching module is configured to determine an object list corresponding to the target object contained in the target video by matching the preset object information of the target object with the text information.
  • a computing device including a memory, a processor, and computer instructions stored in the memory and executable on the processor, the processor implementing the instructions when the processor executes the instructions The steps of a video processing method.
  • a computer-readable storage medium which stores computer instructions, and when the instructions are executed by a processor, implements the steps of the video processing method.
  • a computer program product wherein when the computer program product is executed in a computer, the computer is caused to execute the steps of the video processing method.
  • the video processing method provided by the present application includes: extracting at least two modal information from a received target video; extracting text from the at least two modal information according to the extraction methods corresponding to the at least two modal information information; by matching the preset object information of the target object with the text information, the object list corresponding to the target object contained in the target video is determined, and the text information is extracted from the multi-modal information of the target video, And by means of information matching, the target object contained in the target video and the object list corresponding to the target object are determined, which improves the accuracy of determining the target object in the target video and the object list corresponding to the target object, so that the target video can be quickly identified. It is also convenient to perform tasks such as searching, recommending, video summarizing, and reviewing the target video based on the object information.
  • FIG. 1 is an example diagram of a specific application scenario of a video processing method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a video processing method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of extracting text information corresponding to voice information contained in a video provided by an embodiment of the present application
  • FIG. 5 is a flowchart of extracting text information corresponding to subtitle information contained in a video provided by an embodiment of the present application
  • FIG. 6 is a process flow diagram of a video processing method applied to a commodity video scene provided by an embodiment of the present application
  • FIG. 7 is a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application.
  • FIG. 8 is a structural block diagram of a computing device provided by an embodiment of the present application.
  • OCR Optical Character Recognition
  • Object detection is to find all objects of interest in the image, including two sub-tasks of object localization and object classification, and to determine the category and location of objects at the same time.
  • Natural Language Processing Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language.
  • ASR Automatic Speech Recognition
  • the language model is an abstract mathematical modeling of language based on the objective facts of the language, and it is a correspondence relationship.
  • the relationship between the language model and the objective facts of language is like the relationship between the abstract line and the concrete line in mathematics.
  • Word Embedding is a method of converting words in text into numeric vectors. In order to analyze them using standard machine learning algorithms, these vectors need to be converted to numbers as input in numeric form. .
  • Multimodal Each source or form of information can be called a modality. For example, people have touch, hearing, vision, and smell; the media of information include voice, video, text, etc.; a variety of sensors, such as radar, infrared, accelerometer, etc. Each of the above can be called a modality. Multimodality refers to the ability to process and understand multimodal information through specific methods.
  • Text smoothness refers to the automatic removal of unsmooth words in automatic speech recognition (ASR) results through algorithms, so as to obtain more natural and smooth sentences.
  • Faster-RCNN Faster Region Based Convolutional Neural Network
  • SSD Single Shot MultiBox Detector
  • Bert (Bidirectional Encoder Representation from Transformers), is a model for natural language processing, fine-tuned through an additional output layer, suitable for the construction of state-of-the-art models for a wide range of tasks, such as question answering tasks and language reasoning, etc.
  • Text CNN is an algorithm that uses convolutional neural networks to classify text.
  • Convolution Neural Network is a kind of feedforward neural network with deep structure including convolution calculation.
  • a video processing method is provided, and the present application also relates to a video processing apparatus, a computing device, a computer-readable storage medium and a computer program product, which are performed one by one in the following embodiments. Detailed description.
  • Fig. 1 shows an example diagram of a specific application scenario of a video processing method provided by an embodiment of the present application.
  • the server receives a product video (that is, the target video), and extracts the multi-modal information in the product video.
  • the product video can be understood as the introduction video of the product, the live product video, etc.
  • the extracted multimodal information includes voice information, subtitle information, and image information, etc., and text information corresponding to the voice information, text information corresponding to the subtitle information, and text information corresponding to the image information are extracted from the extracted multimodal information.
  • the commodity name contained in the commodity information database is queried, and various text information is determined.
  • the product name (referring to the specific item name, such as sports shoes, shampoo, etc.) contained in the product name, and within the search range corresponding to the product name, search for the brand information (referring to the product’s trademark) that matches the product name, and further
  • the name and its corresponding brand information are matched with the product information in the product information database, and then the product name and the product category corresponding to the brand information are determined (that is, the product category: refers to the product category shown by the product, such as the juicer belongs to Kitchen appliances, kitchen appliances belong to household appliances), so as to obtain the brand-category-product list corresponding to the products contained in the product video, so that the product video can be searched, recommended, video summary, and audited according to the product list corresponding to the product task processing.
  • the video processing method provided in the embodiment of the present application determines the product list corresponding to the product contained in the product video by analyzing the multimodal information in the product video, realizes the processing of the product video, and extracts the product video It avoids extracting the commodity information of the commodity in the commodity video through the video information of a single feature, and improves the accuracy of the extracted commodity information.
  • FIG. 2 shows a flowchart of a video processing method provided according to an embodiment of the present application, which specifically includes the following steps:
  • Step 202 Extract at least two modal information from the received target video.
  • the target video including but not limited to live video, commodity video, TV video, movie video, animation video, entertainment video, etc.
  • the modal information is the source of each information Or form, called a modality.
  • people have tactile, auditory, visual, and olfactory senses; information media, such as voice, video, text, etc.; a variety of sensors, such as radar, infrared, accelerometer and so on.
  • sensors such as radar, infrared, accelerometer and so on.
  • the modal information is various, and correspondingly, the acquisition methods of different modal information are also various.
  • the information is understood more comprehensively.
  • the extraction of at least two modal information from the received target video is implemented in the following ways:
  • Image information is extracted from the target video according to preset extraction rules, and subtitle information contained in the image information is extracted.
  • extracting the voice information from the received target video can be understood as separating the audio track from the target video, thereby obtaining the voice information contained in the audio track, and the audio track refers to a piece of information seen in the sequencer software A parallel "track".
  • Each track defines the properties of the track, such as the track's timbre, sound bank, number of channels, input/output ports, volume, etc.
  • the image information can be understood as an image frame
  • the preset extraction rule refers to an extraction rule for extracting image frames from all the image frames included in the target video, such as extracting an image frame every five frames, or An image frame is extracted every two frames, etc., which is not limited here, so as to obtain an image sequence composed of the extracted image frames, that is, image information.
  • the target video may also contain subtitle information, and the subtitle itself can also reflect some video features.
  • text information ie, subtitle information contained in the image frame can be identified by performing text recognition on the extracted image frame.
  • Step 204 Extract text information from the at least two modal information according to the extraction methods corresponding to the at least two modal information.
  • extracting text information from the at least two modal information according to the extraction methods corresponding to the at least two modal information includes:
  • the first text information, the second text information and/or the third text information form the text information.
  • the extraction method corresponding to the voice information can be understood as speech recognition, that is, the method of converting the voice information into text information
  • the extraction method corresponding to the image information can be understood as performing object detection on the target object in the image frame, thereby obtaining
  • the object information of the target object, and the subtitle information may contain reward information, typos, etc., so the invalid information can be removed by text processing, and further, the first text information and image information extracted from the voice information can be extracted.
  • the second text information extracted from the subtitle information and/or the third text information extracted from the subtitle information are combined according to time series to form text information.
  • the extraction method extracts the corresponding first text information from the voice information, including:
  • the voice information is input into a voice recognition model for voice recognition to obtain the initial text information contained in the voice information;
  • the initial text information is adjusted based on the text smoothing model and the text correction model to obtain first text information corresponding to the voice information.
  • the speech recognition model may be an ASR model.
  • the speech recognition performed by the ASR model can be understood as inputting speech information into the speech recognition model for encoding and feature extraction, and the extracted features are stored in the acoustic model library. Go to query to get a single word or Chinese character; then get the language model library to query to get the most matching word or Chinese character, thus forming the initial text information.
  • the initial text information may contain unsmooth sentences, or mantras, etc.
  • a text smoothing model is used to perform text smoothing on the initial text information. , that is, delete the unsmooth words in the automatic speech recognition (ASR) results, so as to obtain more natural and smooth sentences
  • the text correction model which can be a natural language processing model (NLP)
  • NLP natural language processing model
  • the initial text information obtained by speech recognition is: "I went to a hotel today and asked the front desk staff how much is a bowl of dumplings.”
  • the sentence is grammatically smooth, but semantically It's puzzling because hotels usually don't sell dumplings.
  • the natural language processing model will correct the sentence to: "I went to a hotel today and asked the front desk clerk how much a night's sleep.”
  • the server receives a video (that is, the target video), separates the audio track in the video, and inputs the audio information contained in the audio track into the speech recognition module (ASR) for speech recognition to obtain the initial text information.
  • ASR speech recognition module
  • the initial text information is smoothed by the text smoothing module, and the smoothed text information is obtained, and the smoothed text information is further modified by the natural language correction module (NLP) to obtain the corrected voice information (text). ), that is, the first text information.
  • NLP natural language correction module
  • the method of object detection on the image frame by the object recognition model determines the attribute information of the target object contained in the image frame (that is, the text information corresponding to the image information), which improves the accuracy of confirming the attribute information of the target object in the target video.
  • the extraction of the corresponding second text information from the image information according to the extraction method corresponding to the image information is implemented in the following manner:
  • the object recognition model is respectively input to each image frame in the image information, the attribute information of the target object contained in each image frame is obtained, and the attribute information is used as the second text information.
  • the attribute information of the target object can be understood as specific object information appearing in a single image. Objects and their categories that appear in images can be detected through the Object Detection model. The objects appearing in the image can largely reflect the information that the image needs to convey, so it is one of the dimensions of information extraction.
  • the object recognition model can be understood as an object detection network such as YOLO/Faster-RCNN/SSD, which is used to detect the object name and/or object location and other attribute information of the target object contained in the image frame, and use these The attribute information is used as the second text information.
  • object detection network such as YOLO/Faster-RCNN/SSD
  • the server receives a video (that is, the target video), extracts frames from the video, obtains image frames, and uses the object detection module (that is, the object recognition model) for the image frames obtained by extracting the frames.
  • a video that is, the target video
  • the object detection module that is, the object recognition model
  • the attribute information of the target object contained in the frame that is, the target information (text), that is, the second text information.
  • the subtitle information is adjusted by the language processing model and the text processing model, which improves the accuracy of the text information corresponding to the subtitle information (that is, the third text information).
  • the third text information corresponding to the subtitle information is specifically implemented in the following manner:
  • the subtitle information is adjusted based on the language processing model and the text processing model to obtain third text information corresponding to the subtitle information.
  • the language processing model can be understood as a natural language processing model (NLP), which denoises the subtitle information by correcting the subtitle information.
  • NLP natural language processing model
  • the image frame obtained by the adjacent frames may be too small due to the small interval between the image frames.
  • the subtitle information contained in the text may be the same, therefore, it is also necessary to use a text processing model to deduplicate the same text content, and finally obtain the deduplicated text information, that is, the third text information.
  • the server receives a video (that is, the target video), extracts frames from the video, obtains image frames, and inputs the image frames obtained by extracting the frames into a text recognition model (OCR) for text recognition, and the obtained image frames contain
  • OCR text recognition model
  • the obtained image frames contain
  • the subtitle information is obtained, and the subtitle information is modified by the language model to obtain the modified text information, and the modified text information is deduplicated through the text deduplication module to obtain the deduplicated subtitle information (text), that is, the first Three text messages.
  • Step 206 Determine an object list corresponding to the target object included in the target video by matching the preset object information of the target object with the text information.
  • the target objects include: commodities, characters, animals, virtual items, regulations, etc.
  • the object information includes: commodity information, character information, animal information, virtual item information, sensitive word information, etc.
  • the list of described objects includes: commodity list, character list, animal list, virtual item list, sensitive word list, etc.
  • the commodity list can be expressed as a commodity name list, commodity brand list, commodity category list, or a commodity brand-commodity category-commodity name list composed of commodity names, commodity categories, and commodity brands. I won't go into details.
  • the preset object information of the target object is searched in the text information, so as to determine whether the target video contains the corresponding object information, and compare it with the text information.
  • the included object information forms an object list, and through the object list, it is used to indicate that the target video contains the target object.
  • determining the object list corresponding to the target object contained in the target video by matching the object information of the preset target object with the text information is specifically implemented in the following manner:
  • the text information within the preset retrieval range corresponding to the target first sub-information is retrieved, and the target second sub-information corresponding to the target first sub-information is determined;
  • an object list corresponding to the target object included in the target video is determined.
  • the first sub-information can be understood as name information such as commodity name, person name, animal name, etc. Searching in the text information according to the first sub-information can determine which first sub-information contained in the text information, The first sub-information contained in the information is used as the target first sub-information.
  • the second sub-information can be understood as information such as commodity brand, character skin color, animal color, etc., and belongs to the preset retrieval scope: refers to the preset search scope from the above to the following in the text position where the target first sub-information is located Specifically, it can be expressed as the range of the number of words or the range of sentences in the forward or backward direction of the text position, for example, 20 words forward or backward, or two sentences forward or backward.
  • the text message is: "Hello, friends, I bought a pair of sneakers at the A1 official school last week.”
  • the first sub-information of the target is sports shoes
  • the second sub-information is A1.
  • the context threshold that is, the pre- Set the search range
  • the sneakers can be successfully matched.
  • retrieving the text information within the preset retrieval range corresponding to the target first sub-information refers to retrieving the second sub-information near the context of the position where the first sub-information appears in the text information, and retrieving The obtained second sub-information is used as the target second self-information corresponding to the target first sub-information, and further, based on the target first sub-information and the target second sub-information corresponding to the target first sub-information, determine the target video included in the target video.
  • the object list of the target object that is, the information summary list of the included target object.
  • the text information within the preset retrieval range corresponding to the target first sub-information. If the second sub-information cannot be detected, the retrieved target first sub-information will not be processed, that is, the above-mentioned first sub-information will be discarded. The retrieval result of the first sub-information.
  • the target first sub-information and the target second sub-information matching the first sub-information and the second sub-information in the object information are searched explicitly, and further based on the target first sub-information and the target second sub-information information, determine the object list of the target object contained in the target video, realize the determination of the target object contained in the target video according to multiple information, and then form the object list of the target object, improve the accuracy of determining the target object, and pass the object
  • the list describes the included target objects, which realizes the effective extraction of the information of the target objects included in the target video.
  • the text information within the preset retrieval range corresponding to the target first sub-information is retrieved, and the target corresponding to the target first sub-information is determined.
  • the second sub-information including:
  • the text information in the preset retrieval range corresponding to the first sub-information of the target is retrieved;
  • the target second sub-information corresponding to the target first sub-information is determined.
  • the distance can be understood as the number of words spaced between two pieces of information.
  • the third sub-information that is closer to the first sub-information of the target is regarded as the second sub-information of the target, and the corresponding first sub-information of the target is added.
  • the accuracy of the target second sub-information is regarded as the second sub-information of the target, and the corresponding first sub-information of the target is added.
  • the first sub-information of the target is the commodity name: sports shoes, and 2 commodity brands (second Sub-information): A1 and A2, where A1 is 2 words away from sports shoes, and A2 is 10 words away from sports shoes, then A1 is determined as the product brand corresponding to sports shoes.
  • the method further includes:
  • the target second sub-information corresponding to the target first sub-information is determined.
  • the number of times of matching can be understood as the number of times the same third sub-information is retrieved. For example, in the case of retrieving five third sub-information matching the second sub-information by voting, Next, if there are three third sub-informations are A1 (that is, the number of times A1 is matched is three), and there are two third sub-information that are both A2 (that is, the number of times A1 is matched is two), then the matched The third sub-information A1 with a large number of times is used as the target second sub-information, which increases the accuracy of determining the target second sub-information corresponding to the target first sub-information.
  • determining the object list corresponding to the target video based on the target first sub-information and the target second sub-information including:
  • an object list corresponding to the target object included in the target video is determined.
  • the preset information mapping table can be understood as an information mapping table including the first sub-information and the second sub-information.
  • the preset information mapping table also includes other object information.
  • the preset information mapping table can be provided by a third party, or the preset information mapping table can be obtained by scraping data (including the first sub-information and the second sub-information) on the network, and then manually cleaning and labeling these data;
  • a mapping relationship record (that is, the object entry constituting the object list) can be uniquely determined, and the determined multiple mapping relationship records can be further composed into the object list, so that Through the object list, you can quickly understand the situation of the target object contained in the target video.
  • the target object contained in the target video is retrieved through the information contained in the object list, which filters the unimportant information in the target video and improves the performance of the target video. retrieval efficiency.
  • the method further includes:
  • marking the information between the target first sub-information to the target second sub-information in the text information, and determining the unmarked text information can be understood as matching the text information with the object information
  • the part that is, the processed information and context
  • is marked so as to determine the text information that does not match the object information, and perform information matching again for the text information that does not match the object information.
  • so as to increase the amount of matched information in the text information and moreover, to increase the comprehensiveness and accuracy of the object information contained in the object list.
  • the text information is: "Hello friends, I bought a pair of sports shoes at the A1 official school store last week".
  • the first sub-information of the target is sports shoes, and the second sub-information of the target is A1.
  • the text between A1 and the sneakers is marked, and the unmarked text information is "Hello, everyone, last week.”
  • the second sub-information is retrieved from the unmarked text information, so as to determine the target second sub-information contained in the unmarked text information, because the entire text information has been retrieved according to the first sub-information, but not
  • the marked text information can be understood as the text information of the first sub-information that has not been retrieved. Therefore, when retrieving again, the retrieval is performed according to the second sub-information, thereby determining the target second sub-information contained in the unmarked text information, This is because it is possible that after the above-mentioned first sub-information is scanned, there may be first sub-information that does not appear in the text information, so the part of the first sub-information that does not appear is retrieved again.
  • word segmentation processing is performed on the unmarked text information within the preset processing range of the target second sub-information. Specifically, it may be reasonable to perform sentence segmentation on the unmarked text information within the preset processing scope, and obtain the predetermined processing range. Assume that the sentences included in the processing range are processed, and the sentences are segmented to obtain the phrases (that is, word segmentation) that make up the sentence, and the segmented words are converted into the first word vector (that is, word embedding or word vector embedding).
  • the first sub-information in the preset object information is also converted into a second word vector.
  • the specific implementation method of converting the first sub-information into the second word vector is the same as the above-mentioned conversion to the first word vector. The specific implementation manner is similar, and will not be repeated here.
  • the similarity comparison result is greater than the similarity threshold, it indicates that the first word vector is similar to the second word vector, that is, the first sub-information is similar to the word segmentation in the unlabeled text information, and the word segmentation in the unlabeled text information is used as the target.
  • the first sub-information if the similarity comparison result is less than or equal to the similarity threshold, it indicates that the first word vector is not similar to the second word vector, that is, the first sub-information is different from the word segmentation in the unmarked text information, and no processing is performed. Can.
  • the video processing method further includes:
  • the target video and the object list are displayed as the query result corresponding to the query instruction.
  • a query can be made through keywords (object information).
  • the target video corresponding to the object list is displayed, that is, through object information matching, it is quickly determined whether the target video contains the target object, and then the target video is displayed, which improves the query efficiency of the target video containing the target object.
  • an object list may also be displayed, so that the query user can quickly understand the target object contained in the target video.
  • the video processing method provided by the present application includes: extracting at least two modal information from the received target video; Extract text information from the modal information; by matching the preset object information of the target object with the text information, the object list corresponding to the target object contained in the target video is determined, and the multi-modal information from the target video is realized. Extract text information from the video, and determine the target object contained in the target video and the object list corresponding to the target object by means of information matching, which improves the accuracy of determining the target object in the target video and the object list corresponding to the target object, so that Quickly understand the target object contained in the target video, and it is also convenient to perform tasks such as searching, recommending, video summarizing, and reviewing the target video based on the object information.
  • FIG. 6 shows a processing flowchart of a video processing method applied to a commodity video scene provided by an embodiment of the present application, which specifically includes the following steps:
  • Step 602 Receive a product video (ie, target video) containing the target product.
  • Step 604 Extract voice information (text), subtitle information (text) and target information (text) from the product video, and form text information from the extracted voice information (text), subtitle information (text) and target information (text) .
  • the target information means that the product video includes product information such as the product name of the target product.
  • Step 606 Query the text information according to the brand information in the commodity information database, and determine the brand contained in the text information.
  • Step 608 Perform word segmentation on the text information within the context where the brand is located, obtain word segmentation after word segmentation, and convert the segmentation into a first word vector (ie, context information encoding).
  • a first word vector ie, context information encoding
  • Step 610 Convert the commodity name in the commodity information database into a second word vector (ie commodity code).
  • step 608 and step 610 can be interchanged.
  • Step 612 Determine the product corresponding to the brand contained in the text information by comparing the similarity between the first word vector and the second word vector, and determine the category to which the product belongs according to the brand and the product, and obtain the brand-category- Product list (i.e. the product list of the products included in the product video).
  • video product recommendation mainly relies on manual marking, that is, manual review of the products appearing in the video, or single information source extraction, that is, extracting information of a certain dimension in the video, and obtaining the product name appearing in the video, etc.
  • the first method has high labor cost and low efficiency.
  • the second method has a low error tolerance rate and is prone to omission and misjudgment. Therefore, how to accurately extract and mine product information from massive videos has become an application problem that needs to be solved in video recommendation.
  • the video processing method provided by the present application includes: extracting three modal information from the received product video; extracting text information from the three modal information according to the extraction methods corresponding to the three modal information; By matching the commodity information of the preset target commodity with the text information, the commodity list corresponding to the target commodity contained in the commodity video is determined, so as to realize the extraction of text information from the multi-modal information of the commodity video, and through The method of information matching determines the target product included in the product video and the product list corresponding to the target product, which improves the accuracy of determining the target product in the product video and the product list corresponding to the target product. To understand the target product, it is also convenient to perform tasks such as searching, recommending, video summarizing, and reviewing product videos based on product information.
  • FIG. 7 shows a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application.
  • the device includes:
  • an extraction module 702 configured to extract at least two modal information from the received target video
  • the extraction module 704 is configured to extract text information from the at least two modal information according to the extraction methods corresponding to the at least two modal information;
  • the matching module 706 is configured to determine an object list corresponding to the target object contained in the target video by matching the preset object information of the target object with the text information.
  • the matching module 706 includes:
  • a first determination sub-module configured to perform retrieval in the text information according to the first sub-information in the preset object information of the target object, and determine the target first sub-information included in the text information
  • the second determination sub-module is configured to retrieve the text information within the preset retrieval range corresponding to the target first sub-information according to the second sub-information in the object information, and determine that the target first sub-information corresponds to The target second child information;
  • the determining list submodule is configured to determine an object list corresponding to the target object contained in the target video based on the target first sub-information and the target second sub-information.
  • the second determination submodule is further configured as:
  • the text information within the preset retrieval range corresponding to the first sub-information of the target is retrieved;
  • the target second sub-information corresponding to the target first sub-information is determined.
  • the second determination submodule is further configured as:
  • the target second sub-information corresponding to the target first sub-information is determined.
  • determining the object list corresponding to the target video based on the target first sub-information and the target second sub-information including:
  • an object list corresponding to the target object included in the target video is determined.
  • the matching module 706 is further configured to:
  • the extraction module 702 is further configured to:
  • Image information is extracted from the target video according to preset extraction rules, and subtitle information contained in the image information is extracted.
  • the extraction module 704 includes:
  • a first extraction submodule configured to extract corresponding first text information from the voice information according to an extraction method corresponding to the voice information
  • a second extraction submodule configured to extract corresponding second text information from the image information according to an extraction method corresponding to the image information
  • a generating submodule configured to perform text processing on the subtitle information, and generate third text information corresponding to the subtitle information
  • the first text information, the second text information and/or the third text information form the text information.
  • the first extraction submodule is further configured as:
  • the initial text information is adjusted based on the text smoothing model and the text correction model to obtain first text information corresponding to the voice information.
  • the second extraction submodule is further configured as:
  • the generating submodule is further configured as:
  • the subtitle information is adjusted based on the language processing model and the text processing model to obtain third text information corresponding to the subtitle information.
  • the video processing device further includes:
  • an instruction receiving module configured to receive a query instruction for the target object
  • an information matching module configured to match the object information of the target object carried in the query instruction with the object information in the object list
  • the display module is configured to display the target video and the object list as a query result corresponding to the query instruction when the matching is successful.
  • the video processing device includes: extracting at least two modal information from the received target video; Extract text information from the modal information; by matching the preset object information of the target object with the text information, the object list corresponding to the target object contained in the target video is determined, and the multi-modal information from the target video is realized. Extract text information from the video, and determine the target object contained in the target video and the object list corresponding to the target object by means of information matching, which improves the accuracy of determining the target object in the target video and the object list corresponding to the target object, so that Quickly understand the target object contained in the target video, and it is also convenient to perform tasks such as searching, recommending, video summarizing, and reviewing the target video based on the object information.
  • the above is a schematic solution of a video processing apparatus according to this embodiment. It should be noted that the technical solution of the video processing device and the technical solution of the above-mentioned video processing method belong to the same concept, and the details that are not described in detail in the technical solution of the video processing device can be referred to the description of the technical solution of the above-mentioned video processing method. .
  • FIG. 8 shows a structural block diagram of a computing device 800 provided according to an embodiment of the present specification.
  • Components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820 .
  • the processor 820 is connected with the memory 810 through the bus 830, and the database 850 is used for saving data.
  • Computing device 800 also includes access device 840 that enables computing device 800 to communicate via one or more networks 860 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 840 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 800 may also be connected to each other, such as through a bus.
  • bus may also be connected to each other, such as through a bus.
  • FIG. 8 the structural block diagram of the computing device shown in FIG. 8 is only for the purpose of example, rather than limiting the scope of this specification. Those skilled in the art can add or replace other components as required.
  • Computing device 800 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • Computing device 800 may also be a mobile or stationary server.
  • the processor 820 implements the steps of the video processing method by executing computer instructions.
  • the above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned video processing method belong to the same concept, and the details not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned video processing method.
  • An embodiment of the present application further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, implements the steps of the aforementioned video processing method.
  • the above is a schematic solution of a computer-readable storage medium of this embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above-mentioned video processing method belong to the same concept. For details not described in detail in the technical solution of the storage medium, refer to the description of the technical solution of the above-mentioned video processing method.
  • An embodiment of the present specification further provides a computer program product, wherein when the computer program product is executed in a computer, the computer is caused to execute the steps of the above video processing method.
  • the computer instructions include computer program product code, which may be in source code form, object code form, an executable file, some intermediate form, or the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program product code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media, etc.
  • the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供视频处理方法及装置,其中所述视频处理方法包括:从接收的目标视频中抽取至少两种模态信息;按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。

Description

视频处理方法及装置
本申请要求于2020年12月22日提交中国专利局、申请号为CN202011529552.3、发明名称为“视频处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种视频处理方法。本申请同时涉及一种视频处理装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序产品。
背景技术
随着互联网技术的进步,互联网基础建设的逐步完善,移动端设备不断革新,全球已经处于移动互联网时代。而视频作为移动端信息传递的重要载体之一,成为人们日常生活、学习、工作、娱乐中不可缺少的一部分。视频携带的信息量之大,种类之复杂,是图片、文字、音频等传递方式无法望其项背的。
整体来说,视频信息信息容量大,人们在相同时间能处理完的信息中,视频是携带信息量最大的一种。并且视频信息呈多模态,即视频信息天然具有多个维度,它既包含每一帧的图像信息,字幕携带的文字信息,也包含音轨中携带的音频信息等等。此外视频信息具有时序关联,视频信息中每一帧的,或者每一段携带的信息,通常与上一段、下一段视频内容是相关的。视频不仅每一帧承载信息,也通过上下文的关联携带更深层次、更复杂的信息。
而视频信息对于观看者来说,是需要尽可能多地接收。但面对数以亿计的海量视频信息,并不是所有人都对其中每一帧包含的信息感兴趣。在具体任务比如视频检索、摘要,视频推荐、审核等任务中,只需要提取视频信息中感兴趣的部分。如何提取视频中的有效信息来完成这类任务,成为一个亟需解决的问题。
发明内容
有鉴于此,本申请实施例提供了一种视频处理方法。本申请同时涉及一种视频处理装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序产品,以解决现有技术中存在的提取视频中的有效信息遗漏或错误的缺陷。
根据本申请实施例的第一方面,提供了一种视频处理方法,包括:
从接收的目标视频中抽取至少两种模态信息;
按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;
通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。
根据本申请实施例的第二方面,提供了一种视频处理装置,包括:
抽取模块,被配置为从接收的目标视频中抽取至少两种模态信息;
提取模块,被配置为按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;
匹配模块,被配置为通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。
根据本申请实施例的第三方面,提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述指令时实现所述视频处理方法的步骤。
根据本申请实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现所述视频处理方法的步骤。
根据本说明书实施例的第五方面,提供了一种计算机程序产品,其中,当所述计算机程序产品在计算机中执行时,令计算机执行所述视频处理方法的步骤。
本申请提供的视频处理方法,包括:从接收的目标视频中抽取至少两种模态信息;按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;通过将预设的目标对象的对象信息与所述文本信息进行匹 配,确定所述目标视频中包含的目标对象对应的对象列表,实现了从目标视频的多模态信息中提取文本信息,并通过信息匹配的方式,确定目标视频中包含的目标对象,以及目标对象对应的对象列表,提高了确定目标视频中的目标对象以及目标对象对应的对象列表的准确性,以便快速对目标视频中包含的目标对象进行了解,也便于基于对象信息对目标视频进行搜索、推荐、视频摘要、审核等任务处理。
附图说明
图1是本申请一实施例提供的一种视频处理方法的具体应用场景的示例图;
图2是本申请一实施例提供的一种视频处理方法的流程图;
图3是本申请一实施例提供的一种提取视频中包含的语音信息对应的文本信息的流程图;
图4是本申请一实施例提供的一种提取视频中包含的图像信息对应的文本信息的流程图;
图5是本申请一实施例提供的一种提取视频中包含的字幕信息对应的文本信息的流程图;
图6是本申请一实施例提供的一种应用于商品视频场景中的视频处理方法的处理流程图;
图7是本申请一实施例提供的一种视频处理装置的结构示意图;
图8是本申请一实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和 所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本申请一个或多个实施例涉及的名词术语进行解释。
光学字符识别(Optical Character Recognition,OCR):电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程。
目标检测(Object Detection):目标检测即找出图像中所有感兴趣的物体,包含物体定位和物体分类两个子任务,同时确定物体的类别和位置。
自然语言处理(Natural Language Processing,NLP):自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自动语音识别(Automatic Speech Recognition,ASR):将人说话的声音自动转换为文本的过程,而自动语音识别技是一种将人的语音转换为文本的技术。
语言模型(Language Model):语言模型是根据语言客观事实而进行的语言抽象数学建模,是一种对应关系。语言模型与语言客观事实之间的关系,如同数学上的抽象直线与具体直线之间的关系。
词嵌入(Word Embedding):嵌入是一种将文本中的词转换成数字向量的 方法,为了使用标准机器学习算法来对它们进行分析,就需要把这些被转换成数字的向量以数字形式作为输入。
多模态(Multimodal):每一种信息的来源或者形式,都可以称为一种模态。例如,人有触觉,听觉,视觉,嗅觉;信息的媒介有语音、视频、文字等;多种多样的传感器,如雷达、红外、加速度计等。以上的每一种都可以称为一种模态。多模态指通过特定的方法实现处理和理解多源模态信息的能力。
特征(Feature):原意指某事物异于其他事物的特点,在本文及同领域文献中指可以表征某种信息的抽象特性。
文本顺滑(Disfluency Detection):文本顺滑指通过算法自动删除自动语音识别(ASR)结果中的不顺滑的词,从而得到更自然和通顺的句子。
YOLO(You Look Only Once):一种单阶段(one-stage)目标检测网络结构,用于物体检测。
Faster-RCNN(Faster Region Based Convolutional Neural Network):一种两阶段(two-stage)目标检测网络结构,用于物体检测。
SSD(Single Shot MultiBox Detector):一种单阶段(one-stage)目标检测网络结构,用于物体检测。
Bert(Bidirectional Encoder Representation from Transformers,转换器的双向解码表示),是自然语言处理的一种模型,通过一个额外的输出层进行微调,适用于广泛任务的最先进模型的构建,比如问答任务和语言推理等。
Text CNN:是利用卷积神经网络对文本进行分类的算法。其中,卷积神经网络(Convolution Neural Network,CNN)是一类包含卷积计算且具有深度结构的前馈神经网络。
在本申请中,提供了一种视频处理方法,本申请同时涉及一种视频处理装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序产品,在下面的实施例中逐一进行详细说明。
参见图1,图1示出了本申请一实施例提供的一种视频处理方法的具体 应用场景的示例图。
图1的应用场景中服务器接收了一个商品视频(即目标视频),对商品视频中的多模态信息进行抽取,具体的,商品视频,可以理解为商品的介绍视频,直播商品视频等,而抽取的多模态信息包括:语音信息、字幕信息以及图像信息等,并在抽取的多模态信息中提取语音信息对应的文本信息、字幕信息对应的文本信息以及图像信息对应的文本信息。
在提取文本信息的基础上,在语音信息对应的文本信息、字幕信息对应的文本信息和/或图像信息对应的文本信息中,对商品信息库中包含的商品名称进行查询,确定各种文本信息中包含的商品名称(指具体的物品名称,比如运动鞋、洗发水等),并在商品名称对应的搜索范围内,搜索与商品名称匹配的品牌信息(指商品的商标),并进一步将商品名称以及其对应的品牌信息,与商品信息库中的商品信息进行匹配,进而确定品名称以及品牌信息对应的商品类别(即商品类目:是指商品所示的商品类别,比如榨汁机属于厨房电器,厨房电器属于家用电器),从而获得商品视频中包含的商品对应的品牌-类目-商品列表,以便通过根据商品对应的商品列表,对商品视频进行搜索、推荐、视频摘要、审核等任务处理。
本申请实施例提供的所述视频处理方法,通过对商品视频中的多模态信息进行分析,从而确定商品视频中包含的商品对应的商品列表,实现了对商品视频进行处理,提取了商品视频中感兴趣的信息,并避免了通过单一特征的视频信息对商品视频中商品的商品信息进行提取,提高了提取的商品信息的准确性。
图2示出了根据本申请一实施例提供的一种视频处理方法的流程图,具体包括以下步骤:
步骤202:从接收的目标视频中抽取至少两种模态信息。
其中,所述目标视频,包括但不限于直播视频,商品视频、电视视频、电影视频、动画视频、娱乐视频等,在此不做限制;所述模态信息,是将每一种信息的来源或者形式,称为一种模态。例如,人有触觉,听觉,视觉,嗅觉;信息的媒介,有语音、视频、文字等;多种多样的传感器,如雷达、红外、加 速度计等。以上的每一种方式都可以称为一种模态,而从通过上述模态获得的信息,即为模态信息。
实际应用中,若通过单一特征对视频信息的理解可能存在偏差,导致视频内容(比如商品)推荐遗漏或不准确。
具体实施时,模态信息是多种多样的,相应的,获取不同的模态信息的获取方式也是多种多样,通过获取视频中的多个模态的信息,有利于对目标视频中传递的信息了解的更加全面,具体的,所述从接收的目标视频中抽取至少两种模态信息,具体采用如下方式实现:
从接收的目标视频中抽取语音信息;
按照预设抽取规则从所述目标视频中抽取图像信息;和/或
按照预设抽取规则从所述目标视频中抽取图像信息,并提取所述图像信息中包含的字幕信息。
具体的,从接收的目标视频中抽取语音信息,可以理解为从目标视频中分离音轨,从而获得音轨中包含的语音信息,而音轨,是指在音序器软件中看到的一条一条的平行“轨道”。每条音轨分别定义了该条音轨的属性,如音轨的音色,音色库,通道数,输入/输出端口,音量等。
所述图像信息,可以理解为图像帧,所述预设抽取规则,是指从目标视频包含的全部的图像帧中,抽取图像帧的抽取规则,比如每隔五帧抽取一帧图像帧,或者每隔两帧抽取一帧图像帧等,在此不做限制,从而得到由抽取的图像帧所组成的图像序列,即图像信息。
此外,目标视频中还可能包含字幕信息,字幕本身也能体现部分视频特征,具体的,可以通过对抽取的图像帧进行文本识别,从而识别出图像帧中包含的文本信息(即字幕信息)。
步骤204:按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息。
实际应用中,由于模态信息不同,从中提取文本信息的提取方式也不同, 有很多相关的技术,诸如目标检测技术/语音识别/三维卷积/异常检测/文字识别/目标跟踪等等。这些技术对于视频中信息的发现与理解,起到了很大作用,能在很多任务中代替人工,获得视频中存在的关键信息,辅助判断。
在上述抽取多种模态信息的基础上,进一步对每种模态信息分别提取对应的文本信息,以便用文本的形式对视频中包含的信息进行统一,提高了多模态信息之间的可对比性,可选的,所述按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息,包括:
按照所述语音信息对应的提取方式从所述语音信息中提取对应的第一文本信息;
按照所述图像信息对应的提取方式从所述图像信息中提取对应的第二文本信息;和/或
对所述字幕信息进行本文处理,生成所述字幕信息对应的第三文本信息;
其中,所述第一文本信息、第二文本信息和/或第三文本信息形成所述文本信息。
具体的,语音信息对应的提取方式,可以理解为语音识别,即将语音信息转换为文本信息的方式,图像信息对应的提取方式,可以理解为对图像帧中的目标对象进行对象检测,从而获得的目标对象的对象信息,而字幕信息中,可能包含打赏信息、错别字等,因此可以通过进行文本处理,去除其中的无效信息,并进一步的,将语音信息中提取的第一文本信息、图像信息中提取的第二文本信息,和/或字幕信息中提取的第三文本信息按照时序进行组合,形成文本信息。
进一步的,在对语音信息进行语音识别的基础上,还通过第一文本处理模型和第二文本处理模型对语音识别结果进行调整,提高了文本信息的准确性,所述按照所述语音信息对应的提取方式从所述语音信息中提取对应的第一文本信息,包括:
对所述语音信息输入语音识别模型进行语音识别,获得所述语音信息中包 含的初始文本信息;
基于文本顺滑模型和文本修正模型对所述初始文本信息进行调整,获得所述语音信息对应的第一文本信息。
其中,所述语音识别模型,可以是ASR模型,具体的,通过ASR模型进行语音识别,可以理解为将语音信息输入语音识别模型进行编码和特征提取,将提取到的特征拿到声学模型库中去查询,得到单个的单词或汉字;然后再拿到语言模型库中去查询,得到最匹配的单词或汉字,从而形成初始文本信息。
然而由于初始文本信息中可能存在不通顺的语句,或口头禅等,因此,需要对初始文本信息进行文本顺滑以及文本修正,具体的,文本顺滑模型,用于对初始文本信息进行文本顺滑,即删除自动语音识别(ASR)结果中的不顺滑的词,从而得到更自然和通顺的句子;文本修正模型,可以是自然语言处理模型(NLP),用于对初始文本信息/顺滑后的文本信息进行文本修正,比如语音识别获得的初始文本信息为:“我今天去了一家宾馆,问前台务员水饺多少钱一碗。”该句子在语法上是通顺的,但在语义上令人费解,因为宾馆通常不卖水饺。通过自然语言处理模型会将句子修正为:“我今天去了一家宾馆,问前台服务员睡觉多少钱一晚。”
例如图3所示,服务器接收了一个视频(即目标视频),将视频中的音轨进行分离,将音轨中包含语音信息输入语音识别模块(ASR)进行语音识别,获得初始文本信息,对初始文本信息通过文本顺滑模块进行文本顺滑,获得顺滑后的文本信息,并进一步将顺滑后的文本信息通过自然语言修正模块(NLP)进行文本修正,获得修正后的语音信息(文本),即第一文本信息。
具体实施时,通过对象识别模型对图像帧进行对象检测的方式,确定图像帧包含的目标对象的属性信息(即图像信息对应的文本信息),提高了确认目标视频中目标对象的属性信息的准确性,可选的,所述按照所述图像信息对应的提取方式从所述图像信息中提取对应的第二文本信息,具体采用如下方式实现:
对所述图像信息中各个图像帧分别输入对象识别模型,获得所述各个图像 帧中包含的目标对象的属性信息,将所述属性信息作为所述第二文本信息。
具体的,目标对象的属性信息,可以理解为单张图像中出现的具体物体信息。通过目标检测(Object Detection)模型能够检出图像中出现的物体及其类别。图像中出现的物体在很大程度上能反映图像需要传递的信息,因此作为信息提取的维度之一。
其中,所述对象识别模型,可以理解为YOLO/Faster-RCNN/SSD等对象检测网络,用于检测出图像帧中包含的目标对象的对象名称,和/或对象位置等属性信息,并将这些属性信息作为第二文本信息。
例如图4所示,服务器接收了一个视频(即目标视频),将视频进行抽帧,获得图像帧,并对抽帧获得的图像帧通过目标检测模块(即对象识别模型)进行对象,获得图像帧中包含的目标对象的属性信息,即目标信息(文本),也即第二文本信息。
通过语言处理模型以及文本处理模型对字幕信息进行调整,提高了字幕信息对应的文本信息(即第三文本信息)的准确度,可选的,所述对所述字幕信息进行本文处理,生成所述字幕信息对应的第三文本信息,具体采用如下方式实现:
基于语言处理模型和文本处理模型所述字幕信息进行调整,以获得所述字幕信息对应的第三文本信息。
具体的,所述语言处理模型,可以理解为自然语言处理模型(NLP),通过对字幕信息进行修正去噪,此外,可能由于图像帧之间的间隔太小,相邻抽帧获得的图像帧中包含的字幕信息可能相同,因此,还需要利用文本处理模型对相同的文本内容进行去重,最终获得去重后的文本信息,即第三文本信息。
例如图5所示,服务器接收了一个视频(即目标视频),将视频进行抽帧,获得图像帧,将抽帧获得的图像帧输入文字识别模型(OCR)进行文本识别,获得图像帧中包含的字幕信息,并将字幕信息通过语言模型进行文本修正,获得修正后的文本信息,将修正后的文本信息通过文本去重模块进行去重,获得去重后的字幕信息(文本),即第三文本信息。
步骤206:通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。
可选的,所述目标对象包括:商品、人物、动物、虚拟物品、规章等,相应的,所述对象信息包括:商品信息、人物信息、动物信息、虚拟物品信息、敏感词信息等,所述对象列表包括:商品列表、人物列表、动物列表、虚拟物品列表、敏感词列表等。
其中,商品列表,可以表现为商品名称列表、商品品牌列表、商品类目列表,或由商品名称、商品类目以及商品品牌,共同组成的商品品牌-商品类目-商品名称列表等,在此不做赘述。
具体的,通过将预设的目标对象的对象信息与所述文本信息进行匹配,可以理解为在文本信息中查找预设的对象信息,从而确定目标视频中是否包含相应的对象信息,并将其包含的对象信息形成对象列表,并通过对象列表,用以表明目标视频中包含目标对象。
进一步的,所述通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表,具体采用如下方式实现:
根据预设的目标对象的对象信息中的第一子信息在所述文本信息中进行检索,确定所述文本信息中包含的目标第一子信息;
根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索,确定所述目标第一子信息对应的目标第二子信息;
基于所述目标第一子信息以及所述目标第二子信息,确定所述目标视频中包含的目标对象对应的对象列表。
其中,所述第一子信息,可以理解为商品名称、人物名称、动物名称等名称信息,根据第一子信息在文本信息中进行检索,可以确定文本信息中包含哪些第一子信息,将文本信息中包含的第一子信息作为目标第一子信息。
所述第二子信息,可以理解为商品品牌,人物肤色,动物颜色等信息,所属预设检索范围:是指预先设置的在目标第一子信息所在的文本位置从上文到下文搜索的范围,具体的,可以表示为该文本位置向前或向后的字数范围或语句范围等,比如向前或向后20个字,或者向前或向后两句话等。
比如文本信息为:“朋友们大家好,上周在A1官方旗舰店买了双运动鞋”其中,目标第一子信息为运动鞋,第二子信息为A1,假设设置上下文的阈值(即预设检索范围)为20,也就是向前至多搜索20个字,向后至多搜索20个字。因此当检索到A1时,能够成功匹配到运动鞋。
具体的,在目标第一子信息对应的预设检索范围内的文本信息进行检索,是指在第一子信息在文本信息中出现的位置的上下文附近,对第二子信息进行检索,将检索到的第二子信息作为与目标第一子信息对应的目标第二自行信息,并进一步,基于目标第一子信息以及目标第一子信息对应的目标第二子信息,确定目标视频中包含的目标对象的对象列表,即包含的目标对象的信息汇总列表。
此外,还存在于所述目标第一子信息对应的预设检索范围内的文本信息内,检测不到第二子信息的情况,则不对检索到的目标第一子信息进行处理,即抛弃上述对第一子信息的检索结果。
本申请实施例,通过显示地搜索与对象信息中的第一子信息以及第二子信息匹配的目标第一子信息以及目标第二子信息,并进一步基于目标第一子信息以及目标第二子信息,确定目标视频中包含的目标对象的对象列表,实现了根据多个信息确定目标视频中包含的目标对象,并进而形成目标对象的对象列表,提升了确定目标对象的准确性,并通过对象列表对包含的目标对象进行描述,实现了对目标视频中包含的目标对象的信息进行有效提取。
可选的,所述根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索,确定所述目标第一子信息对应的目标第二子信息,包括:
根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检 索范围内的文本信息进行检索;
在检索到多个与所述第二子信息匹配的第三子信息的情况下,确定每个第三子信息与所述目标第一子信息在所述文本信息中的距离;
根据所述距离,确定所述目标第一子信息对应的目标第二子信息。
具体的,所述距离,可以理解为两个信息中间间隔的字数,将距离目标第一子信息更近的第三子信息,作为目标第二子信息,增加了确定目标第一子信息对应的目标第二子信息的准确性。
以目标对象为商品为例,目标第一子信息为商品名称:运动鞋,在运动鞋所在的上下文信息(向前或向后20个字的范围内)中检索到了2个商品品牌(第二子信息):A1和A2,其中A1距离运动鞋2个字,A2距离运动鞋息10个字,则将A1确定为运动鞋对应的商品品牌。
可选的,除上述确定目标第二子信息的方式之外,所述检索到多个与所述第二子信息匹配的第三子信息之后,还包括:
确定每种第三子信息被匹配的次数;
基于被匹配的次数,确定所述目标第一子信息对应的目标第二子信息。
具体的,被匹配的次数,可以理解为检索到同一种第三子信息的次数,比如,可以按照投票的方式,在检索到五个与所述第二子信息匹配的第三子信息的情况下,其中有三个第三子信息都为A1(即A1被匹配的次数为三次),有两个第三子信息都为A2(即A1被匹配的次数为两次),则将被匹配的次数多的第三子信息A1作为目标第二子信息,增加了确定目标第一子信息对应的目标第二子信息的准确性。
可选的,所述基于所述目标第一子信息以及所述目标第二子信息,确定所述目标视频对应的对象列表,包括:
根据所述目标第一子信息以及所述目标第二子信息在预设信息映射表中的映射关系,确定所述目标视频中包含的目标对象对应的对象列表。
实际应用中,预设信息映射表,可以理解为包含第一子信息,第二子信息 在内的信息映射表,此外,预设信息映射表中还包含其他的对象信息,具体实施时,该预设信息映射表可以由第三方提供,也可以通过在网络上抓取数据(包括第一子信息以及第二子信息),再通过人工清理、标注这些数据进而获得预设信息映射表;
在确定目标第一子信息以及目标第二子信息的基础上,可以唯一地确定一条映射关系记录(即组成对象列表的对象条目),并进一步将确定的多条映射关系记录组成对象列表,以便通过对象列表,可以快速了解目标视频中包含的目标对象的情况,此外,还通过对象列表中包含的信息对目标视频中包含的目标对象进行检索,过滤了目标视频中不重要的信息,提高了检索效率。
进一步的,在上述进行可选的,所述确定所述目标第一子信息对应的目标第二子信息之后,还包括:
在所述文本信息中对所述目标第一子信息到所述目标第二子信息之间的信息进行标记,并确定未标记的文本信息;
根据所述第二子信息在所述未标记的文本信息中进行检索,确定所述未标记的文本信息中包含的目标第二子信息;
确定所述未标记的文本信息中包含的目标第二子信息对应的预设处理范围;
对所述预设处理范围内的未标记的文本信息进行分词处理,并将所述分词处理获得的分词转换为第一词向量;
将所述第一词向量与通过所述第一子信息转换的第二词向量进行相似度对比,以确定所述未标记的文本信息中包含的目标第二子信息对应的目标第一子信息。
具体的,在所述文本信息中对所述目标第一子信息到所述目标第二子信息之间的信息进行标记,确定未标记的文本信息,可以理解为将文本信息中匹配到对象信息的部分(即处理完的信息以及上下文)进行标记(以便在后续的匹配中跳过),从而确定未匹配到对象信息的文本信息,并对未匹配到对象信息 的文本信息,再次进行信息匹配,以提高文本信息中的匹配到的信息数量,更加地,增加了对象列表包含的对象信息的全面性以及准确性。
沿用上例,文本信息为:“朋友们大家好,上周在A1官方旗舰店买了双运动鞋”其中,目标第一子信息为运动鞋,目标第二子信息为A1,对文本信息中A1到运动鞋之间的文字进行标记,则未标记的文本信息为“朋友们大家好,上周在”。
具体的,在未标记的文本信息中检索第二子信息,从而确定未标记的文本信息中包含的目标第二子信息,是因为已经根据第一子信息对整个文本信息进行了检索,而未标记的文本信息,可以理解为未检索到第一子信息的文本信息,因此,再次检索时,则根据第二子信息进行检索,从而确定未标记的文本信息中包含的目标第二子信息,这是由于可能在上述的第一子信息扫描后,可能文本信息中存在没有显示出现的第一子信息,因此对这部分没有显示出现的第一子信息进行再次检索。
进一步的,对目标第二子信息的预设处理范围内的未标记的文本信息进行分词处理,具体的,可以理为对预设处理范围内的未标记的文本信息进行分句,获得在预设处理范围内包含的句子,并对句子进行分词,获得组成句子的词组(即分词),并将分词转换为第一词向量(即词嵌入或词向量嵌入),具体的,可以理解为通过Bert/Text CNN等转换模型,对分词进行特征提取,从而将分词转换为向量编码,其中,所述预设处理范围,与上述预设检索范围类似,在此不做赘述。并将预设的对象信息中的第一子信息也进行转换,转换为第二词向量,具体的,将第一子信息转换为第二词向量的具体实现方式与上述转换为第一词向量的具体实现方式类似,在此不再赘述。
更进一步的,将第一词向量以及第二词向量进行相似度对比(由于第一子信息没有显示出现,否则就会被打上标签,即标记,因此才会提取词向量对应不相似度),若相似度对比结果大于相似度阈值,表明第一词向量与第二词向量相似,即第一子信息与未标记的文本信息中的分词相似,则将未标记的文本信息中的分词作为目标第一子信息;若相似度对比结果小于等于相似度阈值, 表明第一词向量与第二词向量不相似,即第一子信息与未标记的文本信息中的分词不一样,不做处理即可。
可选的,所述视频处理方法,在确定所述目标视频中包含的目标对象对应的对象列表之后,还包括:
接收针对目标对象的查询指令;
将所述查询指令中携带的目标对象的对象信息与所述对象列表中的对象信息进行匹配;
在匹配成功的情况下,将所述目标视频以及所述对象列表作为所述查询指令对应的查询结果进行展示。
实际应用中,在对目标对象进行视频处理,获得目标视频中包含的目标对象的对象列表之后,可以通过关键字(对象信息)进行查询,在查询到的对象列表中包含查询关键字的情况下,将对象列表对应的目标视频进行展示,即通过对象信息匹配,快速确定目标视频中是否包含目标对象,进而对目标视频进行展示,提高了对包含目标对象的目标视频的查询效率。
实际应用中,在将目标视频作为查询结果进行展示的基础上,还可以将对象列表进行展示,以便查询用户对目标视频中包含的目标对象进行快速了解。
综上所述,本申请提供的视频处理方法,包括:从接收的目标视频中抽取至少两种模态信息;按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表,实现了从目标视频的多模态信息中提取文本信息,并通过信息匹配的方式,确定目标视频中包含的目标对象,以及目标对象对应的对象列表,提高了确定目标视频中的目标对象以及目标对象对应的对象列表的准确性,以便快速对目标视频中包含的目标对象进行了解,也便于基于对象信息对目标视频进行搜索、推荐、视频摘要、审核等任务处理。
下述结合附图6,以本申请提供的视频处理方法在商品视频场景中的应用 为例,对所述视频处理方法进行进一步说明。其中,图6示出了本申请一实施例提供的一种应用于商品视频场景中的视频处理方法的处理流程图,具体包括以下步骤:
步骤602:接收包含目标商品的商品视频(即目标视频)。
步骤604:在商品视频中提取语音信息(文本)、字幕信息(文本)以及目标信息(文本),并将提取的语音信息(文本)、字幕信息(文本)以及目标信息(文本)形成文本信息。
具体的,目标信息,是指商品视频包含目标商品的商品名称等商品信息。
步骤606:根据商品信息库中的品牌信息在文本信息中进行查询,确定所述文本信息中包含的品牌。
步骤608:将品牌所在的上下文范围内的文本信息进行分词处理,获得分词处理后的分词,并将分词转换为第一词向量(即上下文信息编码)。
步骤610:将商品信息库中的商品名称转换为第二词向量(即商品编码)。
需要说明的是,步骤608和步骤610的执行顺序可以互换。
步骤612:通过将第一词向量与第二词向量进行相似度对比,以确定文本信息中包含的品牌对应的商品,并根据品牌以及商品,确定商品所属的类目,获得品牌-类目-商品列表(即商品视频中包含的商品的商品列表)。
实际应用中,根据视频内容推荐对应商品是视频信息提取的应用方式之一。目前视频商品推荐主要依赖于人工打标,即人工审核视频中出现的商品,或单一信息源提取,即提取视频中某一个维度的信息,获得视频中出现的商品名称等。第一种方式人工成本较高,效率低下。第二种方式容错率低,容易遗漏、误判,因此,怎样从海量视频中准确地提取、挖掘商品信息,成为视频推荐中一个需要解决的应用问题。
综上所述,本申请提供的视频处理方法,包括:从接收的商品视频中抽取三种模态信息;按照三种模态信息对应的提取方式,从三种模态信息中提取文本信息;通过将预设的目标商品的商品信息与所述文本信息进行匹配,确定所 述商品视频中包含的目标商品对应的商品列表,实现了从商品视频的多模态信息中提取文本信息,并通过信息匹配的方式,确定商品视频中包含的目标商品,以及目标商品对应的商品列表,提高了确定商品视频中的目标商品以及目标商品对应的商品列表的准确性,以便快速对商品视频中包含的目标商品进行了解,也便于基于商品信息对商品视频进行搜索、推荐、视频摘要、审核等任务处理。
与上述方法实施例相对应,本申请还提供了视频处理装置实施例,图7示出了本申请一实施例提供的一种视频处理装置的结构示意图。如图7所示,该装置包括:
抽取模块702,被配置为从接收的目标视频中抽取至少两种模态信息;
提取模块704,被配置为按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;
匹配模块706,被配置为通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。
可选的,所述匹配模块706,包括:
第一确定子模块,配置为根据预设的目标对象的对象信息中的第一子信息在所述文本信息中进行检索,确定所述文本信息中包含的目标第一子信息;
第二确定子模块,配置为根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索,确定所述目标第一子信息对应的目标第二子信息;
确定列表子模块,被配置为基于所述目标第一子信息以及所述目标第二子信息,确定所述目标视频中包含的目标对象对应的对象列表。
可选的,所述第二确定子模块,进一步被配置为:
根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索;
在检索到多个与所述第二子信息匹配的第三子信息的情况下,确定每个第三子信息与所述目标第一子信息在所述文本信息中的距离;
根据所述距离,确定所述目标第一子信息对应的目标第二子信息。
可选的,所述第二确定子模块,还被配置为:
确定每种第三子信息被匹配的次数;
基于被匹配的次数,确定所述目标第一子信息对应的目标第二子信息。
可选的,所述基于所述目标第一子信息以及所述目标第二子信息,确定所述目标视频对应的对象列表,包括:
根据所述目标第一子信息以及所述目标第二子信息在预设信息映射表中的映射关系,确定所述目标视频中包含的目标对象对应的对象列表。
可选的,所述匹配模块706,还被配置为:
在所述文本信息中对所述目标第一子信息到所述目标第二子信息之间的信息进行标记,并确定未标记的文本信息;
根据所述第二子信息在所述未标记的文本信息中进行检索,确定所述未标记的文本信息中包含的目标第二子信息;
确定所述未标记的文本信息中包含的目标第二子信息对应的预设处理范围;
对所述预设处理范围内的未标记的文本信息进行分词处理,并将所述分词处理获得的分词转换为第一词向量;
将所述第一词向量与通过所述第一子信息转换的第二词向量进行相似度对比,以确定所述未标记的文本信息中包含的目标第二子信息对应的目标第一子信息。
可选的,所述抽取模块702,进一步被配置为:
从接收的目标视频中抽取语音信息;
按照预设抽取规则从所述目标视频中抽取图像信息;和/或
按照预设抽取规则从所述目标视频中抽取图像信息,并提取所述图像信息中包含的字幕信息。
可选的,所述提取模块704,包括:
第一提取子模块,被配置为按照所述语音信息对应的提取方式从所述语音信息中提取对应的第一文本信息;
第二提取子模块,被配置为按照所述图像信息对应的提取方式从所述图像信息中提取对应的第二文本信息;和/或
生成子模块,被配置为对所述字幕信息进行本文处理,生成所述字幕信息对应的第三文本信息;
其中,所述第一文本信息、第二文本信息和/或第三文本信息形成所述文本信息。
可选的,所述第一提取子模块,进一步被配置为:
对所述语音信息输入语音识别模型进行语音识别,获得所述语音信息中包含的初始文本信息;
基于文本顺滑模型和文本修正模型对所述初始文本信息进行调整,获得所述语音信息对应的第一文本信息。
可选的,所述第二提取子模块,进一步被配置为:
对所述图像信息中各个图像帧分别输入对象识别模型,获得所述各个图像帧中包含的目标对象的属性信息,将所述属性信息作为所述第二文本信息。
可选的,所述生成子模块,进一步被配置为:
基于语言处理模型和文本处理模型所述字幕信息进行调整,以获得所述字幕信息对应的第三文本信息。
可选的,所述视频处理装置,还包括:
接收指令模块,被配置为接收针对目标对象的查询指令;
信息匹配模块,被配置为将所述查询指令中携带的目标对象的对象信息与所述对象列表中的对象信息进行匹配;
展示模块,被配置为在匹配成功的情况下,将所述目标视频以及所述对象列表作为所述查询指令对应的查询结果进行展示。
综上所述,本申请提供的视频处理装置,包括:从接收的目标视频中抽取至少两种模态信息;按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表,实现了从目标视频的多模态信息中提取文本信息,并通过信息匹配的方式,确定目标视频中包含的目标对象,以及目标对象对应的对象列表,提高了确定目标视频中的目标对象以及目标对象对应的对象列表的准确性,以便快速对目标视频中包含的目标对象进行了解,也便于基于对象信息对目标视频进行搜索、推荐、视频摘要、审核等任务处理。
上述为本实施例的一种视频处理装置的示意性方案。需要说明的是,该视频处理装置的技术方案与上述的视频处理方法的技术方案属于同一构思,视频处理装置的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
图8示出了根据本说明书一个实施例提供的一种计算设备800的结构框图。该计算设备800的部件包括但不限于存储器810和处理器820。处理器820与存储器810通过总线830相连接,数据库850用于保存数据。
计算设备800还包括接入设备840,接入设备840使得计算设备800能够经由一个或多个网络860通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备840可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本说明书的一个实施例中,计算设备800的上述部件以及图8中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图8所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备800可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备800还可以是移动式或静止式的服务器。
其中,处理器820通过执行计算机指令,实现所述的视频处理方法的步骤。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的视频处理方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现如前所述视频处理方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的视频处理方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
本说明书一实施例还提供一种计算机程序产品,其中,当所述计算机程序产品在计算机中执行时,令计算机执行上述视频处理方法的步骤。
上述为本实施例的一种计算机程序产品的示意性方案。需要说明的是,该计算机程序产品的技术方案与上述的视频处理方法的技术方案属于同一构思,计算机程序产品的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方 式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序产品代码,所述计算机程序产品代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序产品代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本申请的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。

Claims (16)

  1. 一种视频处理方法,包括:
    从接收的目标视频中抽取至少两种模态信息;
    按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;
    通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。
  2. 根据权利要求1所述的视频处理方法,所述通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表,包括:
    根据预设的目标对象的对象信息中的第一子信息在所述文本信息中进行检索,确定所述文本信息中包含的目标第一子信息;
    根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索,确定所述目标第一子信息对应的目标第二子信息;
    基于所述目标第一子信息以及所述目标第二子信息,确定所述目标视频中包含的目标对象对应的对象列表。
  3. 根据权利要求2所述的视频处理方法,所述根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索,确定所述目标第一子信息对应的目标第二子信息,包括:
    根据所述对象信息中的第二子信息,在所述目标第一子信息对应的预设检索范围内的文本信息进行检索;
    在检索到多个与所述第二子信息匹配的第三子信息的情况下,确定每个第三子信息与所述目标第一子信息在所述文本信息中的距离;
    根据所述距离,确定所述目标第一子信息对应的目标第二子信息。
  4. 根据权利要求3所述的视频处理方法,所述检索到多个与所述第二子 信息匹配的第三子信息之后,还包括:
    确定每种第三子信息被匹配的次数;
    基于被匹配的次数,确定所述目标第一子信息对应的目标第二子信息。
  5. 根据权利要求2-4任意一项所述的视频处理方法,所述基于所述目标第一子信息以及所述目标第二子信息,确定所述目标视频对应的对象列表,包括:
    根据所述目标第一子信息以及所述目标第二子信息在预设信息映射表中的映射关系,确定所述目标视频中包含的目标对象对应的对象列表。
  6. 根据权利要求2-5任意一项所述的视频处理方法,所述确定所述目标第一子信息对应的目标第二子信息之后,还包括:
    在所述文本信息中对所述目标第一子信息到所述目标第二子信息之间的信息进行标记,并确定未标记的文本信息;
    根据所述第二子信息在所述未标记的文本信息中进行检索,确定所述未标记的文本信息中包含的目标第二子信息;
    确定所述未标记的文本信息中包含的目标第二子信息对应的预设处理范围;
    对所述预设处理范围内的未标记的文本信息进行分词处理,并将所述分词处理获得的分词转换为第一词向量;
    将所述第一词向量与通过所述第一子信息转换的第二词向量进行相似度对比,以确定所述未标记的文本信息中包含的目标第二子信息对应的目标第一子信息。
  7. 根据权利要求1-6任意一项所述的视频处理方法,所述从接收的目标视频中抽取至少两种模态信息,包括:
    从接收的目标视频中抽取语音信息;
    按照预设抽取规则从所述目标视频中抽取图像信息;和/或
    按照预设抽取规则从所述目标视频中抽取图像信息,并提取所述图像信息 中包含的字幕信息。
  8. 根据权利要求7所述的视频处理方法,所述按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息,包括:
    按照所述语音信息对应的提取方式从所述语音信息中提取对应的第一文本信息;
    按照所述图像信息对应的提取方式从所述图像信息中提取对应的第二文本信息;和/或
    对所述字幕信息进行本文处理,生成所述字幕信息对应的第三文本信息;
    其中,所述第一文本信息、第二文本信息和/或第三文本信息形成所述文本信息。
  9. 根据权利要求8所述的视频处理方法,所述按照所述语音信息对应的提取方式从所述语音信息中提取对应的第一文本信息,包括:
    对所述语音信息输入语音识别模型进行语音识别,获得所述语音信息中包含的初始文本信息;
    基于文本顺滑模型和文本修正模型对所述初始文本信息进行调整,获得所述语音信息对应的第一文本信息。
  10. 根据权利要求8所述的视频处理方法,所述按照所述图像信息对应的提取方式从所述图像信息中提取对应的第二文本信息,包括:
    对所述图像信息中各个图像帧分别输入对象识别模型,获得所述各个图像帧中包含的目标对象的属性信息,将所述属性信息作为所述第二文本信息。
  11. 根据权利要求8所述的视频处理方法,所述对所述字幕信息进行本文处理,生成所述字幕信息对应的第三文本信息,包括:
    基于语言处理模型和文本处理模型所述字幕信息进行调整,以获得所述字幕信息对应的第三文本信息。
  12. 根据权利要求1-11任意一项所述的视频处理方法,还包括:
    接收针对目标对象的查询指令;
    将所述查询指令中携带的目标对象的对象信息与所述对象列表中的对象信息进行匹配;
    在匹配成功的情况下,将所述目标视频以及所述对象列表作为所述查询指令对应的查询结果进行展示。
  13. 一种视频处理装置,包括:
    抽取模块,被配置为从接收的目标视频中抽取至少两种模态信息;
    提取模块,被配置为按照所述至少两种模态信息对应的提取方式,从所述至少两种模态信息中提取文本信息;
    匹配模块,被配置为通过将预设的目标对象的对象信息与所述文本信息进行匹配,确定所述目标视频中包含的目标对象对应的对象列表。
  14. 一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述指令时实现权利要求1-12任意一项所述方法的步骤。
  15. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求1-12任意一项所述方法的步骤。
  16. 一种计算机程序产品,当所述计算机程序产品在计算机中执行时,令计算机执行权利要求1-12任意一项所述方法的步骤。
PCT/CN2021/120390 2020-12-22 2021-09-24 视频处理方法及装置 WO2022134701A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21908702.0A EP4207772A4 (en) 2020-12-22 2021-09-24 VIDEO PROCESSING METHOD AND APPARATUS
US18/298,243 US20230245455A1 (en) 2020-12-22 2023-04-10 Video processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011529552.3 2020-12-22
CN202011529552.3A CN112738556B (zh) 2020-12-22 2020-12-22 视频处理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/298,243 Continuation US20230245455A1 (en) 2020-12-22 2023-04-10 Video processing

Publications (1)

Publication Number Publication Date
WO2022134701A1 true WO2022134701A1 (zh) 2022-06-30

Family

ID=75605698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120390 WO2022134701A1 (zh) 2020-12-22 2021-09-24 视频处理方法及装置

Country Status (4)

Country Link
US (1) US20230245455A1 (zh)
EP (1) EP4207772A4 (zh)
CN (1) CN112738556B (zh)
WO (1) WO2022134701A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134636A (zh) * 2022-08-29 2022-09-30 北京达佳互联信息技术有限公司 信息推荐方法及装置
CN115881295A (zh) * 2022-12-06 2023-03-31 首都医科大学附属北京天坛医院 帕金森症状信息检测方法、装置、设备和计算机可读介质
CN115966061A (zh) * 2022-12-28 2023-04-14 上海帜讯信息技术股份有限公司 基于5g消息的灾情预警处理方法、系统和装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738556B (zh) * 2020-12-22 2023-03-31 上海幻电信息科技有限公司 视频处理方法及装置
CN115022732B (zh) * 2022-05-25 2023-11-03 阿里巴巴(中国)有限公司 视频生成方法、装置、设备及介质
CN116562270A (zh) * 2023-07-07 2023-08-08 天津亿科科技有限公司 一种支持多模态输入的自然语言处理系统及其方法
CN117573870B (zh) * 2023-11-20 2024-05-07 中国人民解放军国防科技大学 一种多模态数据的文本标签提取方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180055A (zh) * 2016-03-11 2017-09-19 阿里巴巴集团控股有限公司 业务对象的展示方法及装置
US10146751B1 (en) * 2014-12-31 2018-12-04 Guangsheng Zhang Methods for information extraction, search, and structured representation of text data
CN110147467A (zh) * 2019-04-11 2019-08-20 北京达佳互联信息技术有限公司 一种文本描述的生成方法、装置、移动终端及存储介质
CN110582025A (zh) * 2018-06-08 2019-12-17 北京百度网讯科技有限公司 用于处理视频的方法和装置
CN112738556A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8396870B2 (en) * 2009-06-25 2013-03-12 University Of Tennessee Research Foundation Method and apparatus for predicting object properties and events using similarity-based information retrieval and modeling
TW201222288A (en) * 2010-11-22 2012-06-01 Inst Information Industry Image retrieving system and method and computer program product thereof
US9699485B2 (en) * 2012-08-31 2017-07-04 Facebook, Inc. Sharing television and video programming through social networking
CN103902611A (zh) * 2012-12-28 2014-07-02 鸿富锦精密工业(深圳)有限公司 视频内容搜索系统及方法
US9576075B2 (en) * 2014-10-31 2017-02-21 International Business Machines Corporation Context aware query selection
CN106878632B (zh) * 2017-02-28 2020-07-10 北京知慧教育科技有限公司 一种视频数据的处理方法和装置
US20190080207A1 (en) * 2017-07-06 2019-03-14 Frenzy Labs, Inc. Deep neural network visual product recognition system
CN108833973B (zh) * 2018-06-28 2021-01-19 腾讯科技(深圳)有限公司 视频特征的提取方法、装置和计算机设备
CN110795597A (zh) * 2018-07-17 2020-02-14 上海智臻智能网络科技股份有限公司 视频关键字确定、视频检索方法及装置、存储介质、终端
CN111428088B (zh) * 2018-12-14 2022-12-13 腾讯科技(深圳)有限公司 视频分类方法、装置及服务器
US11176191B2 (en) * 2019-01-22 2021-11-16 Amazon Technologies, Inc. Search result image selection techniques
CN109905772B (zh) * 2019-03-12 2022-07-22 腾讯科技(深圳)有限公司 视频片段查询方法、装置、计算机设备及存储介质
CN110225387A (zh) * 2019-05-20 2019-09-10 北京奇艺世纪科技有限公司 一种信息搜索方法、装置及电子设备
CN110502661A (zh) * 2019-07-08 2019-11-26 天脉聚源(杭州)传媒科技有限公司 一种视频搜索方法、系统及存储介质
CN111241340B (zh) * 2020-01-17 2023-09-08 Oppo广东移动通信有限公司 视频标签确定方法、装置、终端及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10146751B1 (en) * 2014-12-31 2018-12-04 Guangsheng Zhang Methods for information extraction, search, and structured representation of text data
CN107180055A (zh) * 2016-03-11 2017-09-19 阿里巴巴集团控股有限公司 业务对象的展示方法及装置
CN110582025A (zh) * 2018-06-08 2019-12-17 北京百度网讯科技有限公司 用于处理视频的方法和装置
CN110147467A (zh) * 2019-04-11 2019-08-20 北京达佳互联信息技术有限公司 一种文本描述的生成方法、装置、移动终端及存储介质
CN112738556A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4207772A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134636A (zh) * 2022-08-29 2022-09-30 北京达佳互联信息技术有限公司 信息推荐方法及装置
CN115881295A (zh) * 2022-12-06 2023-03-31 首都医科大学附属北京天坛医院 帕金森症状信息检测方法、装置、设备和计算机可读介质
CN115881295B (zh) * 2022-12-06 2024-01-23 首都医科大学附属北京天坛医院 帕金森症状信息检测方法、装置、设备和计算机可读介质
CN115966061A (zh) * 2022-12-28 2023-04-14 上海帜讯信息技术股份有限公司 基于5g消息的灾情预警处理方法、系统和装置
CN115966061B (zh) * 2022-12-28 2023-10-24 上海帜讯信息技术股份有限公司 基于5g消息的灾情预警处理方法、系统和装置

Also Published As

Publication number Publication date
EP4207772A4 (en) 2024-02-21
CN112738556B (zh) 2023-03-31
CN112738556A (zh) 2021-04-30
EP4207772A1 (en) 2023-07-05
US20230245455A1 (en) 2023-08-03

Similar Documents

Publication Publication Date Title
WO2022134701A1 (zh) 视频处理方法及装置
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
JP6236075B2 (ja) インタラクティブ方法、インタラクティブ装置及びサーバー
US20200327327A1 (en) Providing a response in a session
WO2023065617A1 (zh) 基于预训练模型和召回排序的跨模态检索系统及方法
CN110740389B (zh) 视频定位方法、装置、计算机可读介质及电子设备
CN110083729B (zh) 一种图像搜索的方法及系统
Doughty et al. Action modifiers: Learning from adverbs in instructional videos
JP5894149B2 (ja) Top−k処理を活用した意味の充実
CN112100440B (zh) 视频推送方法、设备及介质
WO2024001057A1 (zh) 一种基于注意力片段提示的视频检索方法
CN113806588B (zh) 搜索视频的方法和装置
CN113190709B (zh) 一种基于短视频关键帧的背景音乐推荐方法和装置
CN113704507B (zh) 数据处理方法、计算机设备以及可读存储介质
CN111949806A (zh) 一种基于Resnet-Bert网络模型的跨媒体检索方法
WO2024046189A1 (zh) 文本生成方法以及装置
CN115114395A (zh) 内容检索及模型训练方法、装置、电子设备和存储介质
CN116034401A (zh) 用于使用自然语言描述检索视频的系统和方法
WO2022134699A1 (zh) 视频处理方法及装置
CN111988668B (zh) 一种视频推荐方法、装置、计算机设备及存储介质
CN114510564A (zh) 视频知识图谱生成方法及装置
WO2022241987A1 (zh) 图像检索方法及装置
Cho et al. Recognizing human–human interaction activities using visual and textual information
Koorathota et al. Editing like humans: a contextual, multimodal framework for automated video editing
Tapu et al. TV news retrieval based on story segmentation and concept association

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908702

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021908702

Country of ref document: EP

Effective date: 20230330

NENP Non-entry into the national phase

Ref country code: DE