WO2023235576A1 - Extraction de contenu textuel à partir d'une vidéo d'une session de communication - Google Patents
Extraction de contenu textuel à partir d'une vidéo d'une session de communication Download PDFInfo
- Publication number
- WO2023235576A1 WO2023235576A1 PCT/US2023/024304 US2023024304W WO2023235576A1 WO 2023235576 A1 WO2023235576 A1 WO 2023235576A1 US 2023024304 W US2023024304 W US 2023024304W WO 2023235576 A1 WO2023235576 A1 WO 2023235576A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- title
- frames
- distinguishing
- text
- Prior art date
Links
- 238000004891 communication Methods 0.000 title claims abstract description 153
- 238000000605 extraction Methods 0.000 title abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 152
- 238000012015 optical character recognition Methods 0.000 claims abstract description 69
- 238000004458 analytical method Methods 0.000 claims description 51
- 238000012545 processing Methods 0.000 claims description 33
- 238000001514 detection method Methods 0.000 claims description 32
- 238000001914 filtration Methods 0.000 claims description 28
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 24
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 238000013473 artificial intelligence Methods 0.000 claims description 8
- 239000000284 extract Substances 0.000 abstract description 21
- 238000010586 diagram Methods 0.000 description 18
- 230000015654 memory Effects 0.000 description 13
- 238000013527 convolutional neural network Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 11
- 238000012805 post-processing Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 6
- IRPVABHDSJVBNZ-RTHVDDQRSA-N 5-[1-(cyclopropylmethyl)-5-[(1R,5S)-3-(oxetan-3-yl)-3-azabicyclo[3.1.0]hexan-6-yl]pyrazol-3-yl]-3-(trifluoromethyl)pyridin-2-amine Chemical compound C1=C(C(F)(F)F)C(N)=NC=C1C1=NN(CC2CC2)C(C2[C@@H]3CN(C[C@@H]32)C2COC2)=C1 IRPVABHDSJVBNZ-RTHVDDQRSA-N 0.000 description 5
- 238000010191 image analysis Methods 0.000 description 5
- 230000002093 peripheral effect Effects 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- SRVXSISGYBMIHR-UHFFFAOYSA-N 3-[3-[3-(2-amino-2-oxoethyl)phenyl]-5-chlorophenyl]-3-(5-methyl-1,3-thiazol-2-yl)propanoic acid Chemical compound S1C(C)=CN=C1C(CC(O)=O)C1=CC(Cl)=CC(C=2C=C(CC(N)=O)C=CC=2)=C1 SRVXSISGYBMIHR-UHFFFAOYSA-N 0.000 description 2
- KVCQTKNUUQOELD-UHFFFAOYSA-N 4-amino-n-[1-(3-chloro-2-fluoroanilino)-6-methylisoquinolin-5-yl]thieno[3,2-d]pyrimidine-7-carboxamide Chemical compound N=1C=CC2=C(NC(=O)C=3C4=NC=NC(N)=C4SC=3)C(C)=CC=C2C=1NC1=CC=CC(Cl)=C1F KVCQTKNUUQOELD-UHFFFAOYSA-N 0.000 description 2
- CYJRNFFLTBEQSQ-UHFFFAOYSA-N 8-(3-methyl-1-benzothiophen-5-yl)-N-(4-methylsulfonylpyridin-3-yl)quinoxalin-6-amine Chemical compound CS(=O)(=O)C1=C(C=NC=C1)NC=1C=C2N=CC=NC2=C(C=1)C=1C=CC2=C(C(=CS2)C)C=1 CYJRNFFLTBEQSQ-UHFFFAOYSA-N 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- VCGRFBXVSFAGGA-UHFFFAOYSA-N (1,1-dioxo-1,4-thiazinan-4-yl)-[6-[[3-(4-fluorophenyl)-5-methyl-1,2-oxazol-4-yl]methoxy]pyridin-3-yl]methanone Chemical compound CC=1ON=C(C=2C=CC(F)=CC=2)C=1COC(N=C1)=CC=C1C(=O)N1CCS(=O)(=O)CC1 VCGRFBXVSFAGGA-UHFFFAOYSA-N 0.000 description 1
- MAYZWDRUFKUGGP-VIFPVBQESA-N (3s)-1-[5-tert-butyl-3-[(1-methyltetrazol-5-yl)methyl]triazolo[4,5-d]pyrimidin-7-yl]pyrrolidin-3-ol Chemical compound CN1N=NN=C1CN1C2=NC(C(C)(C)C)=NC(N3C[C@@H](O)CC3)=C2N=N1 MAYZWDRUFKUGGP-VIFPVBQESA-N 0.000 description 1
- ZGYIXVSQHOKQRZ-COIATFDQSA-N (e)-n-[4-[3-chloro-4-(pyridin-2-ylmethoxy)anilino]-3-cyano-7-[(3s)-oxolan-3-yl]oxyquinolin-6-yl]-4-(dimethylamino)but-2-enamide Chemical compound N#CC1=CN=C2C=C(O[C@@H]3COCC3)C(NC(=O)/C=C/CN(C)C)=CC2=C1NC(C=C1Cl)=CC=C1OCC1=CC=CC=N1 ZGYIXVSQHOKQRZ-COIATFDQSA-N 0.000 description 1
- MOWXJLUYGFNTAL-DEOSSOPVSA-N (s)-[2-chloro-4-fluoro-5-(7-morpholin-4-ylquinazolin-4-yl)phenyl]-(6-methoxypyridazin-3-yl)methanol Chemical compound N1=NC(OC)=CC=C1[C@@H](O)C1=CC(C=2C3=CC=C(C=C3N=CN=2)N2CCOCC2)=C(F)C=C1Cl MOWXJLUYGFNTAL-DEOSSOPVSA-N 0.000 description 1
- APWRZPQBPCAXFP-UHFFFAOYSA-N 1-(1-oxo-2H-isoquinolin-5-yl)-5-(trifluoromethyl)-N-[2-(trifluoromethyl)pyridin-4-yl]pyrazole-4-carboxamide Chemical compound O=C1NC=CC2=C(C=CC=C12)N1N=CC(=C1C(F)(F)F)C(=O)NC1=CC(=NC=C1)C(F)(F)F APWRZPQBPCAXFP-UHFFFAOYSA-N 0.000 description 1
- ABDDQTDRAHXHOC-QMMMGPOBSA-N 1-[(7s)-5,7-dihydro-4h-thieno[2,3-c]pyran-7-yl]-n-methylmethanamine Chemical compound CNC[C@@H]1OCCC2=C1SC=C2 ABDDQTDRAHXHOC-QMMMGPOBSA-N 0.000 description 1
- HCDMJFOHIXMBOV-UHFFFAOYSA-N 3-(2,6-difluoro-3,5-dimethoxyphenyl)-1-ethyl-8-(morpholin-4-ylmethyl)-4,7-dihydropyrrolo[4,5]pyrido[1,2-d]pyrimidin-2-one Chemical compound C=1C2=C3N(CC)C(=O)N(C=4C(=C(OC)C=C(OC)C=4F)F)CC3=CN=C2NC=1CN1CCOCC1 HCDMJFOHIXMBOV-UHFFFAOYSA-N 0.000 description 1
- BYHQTRFJOGIQAO-GOSISDBHSA-N 3-(4-bromophenyl)-8-[(2R)-2-hydroxypropyl]-1-[(3-methoxyphenyl)methyl]-1,3,8-triazaspiro[4.5]decan-2-one Chemical compound C[C@H](CN1CCC2(CC1)CN(C(=O)N2CC3=CC(=CC=C3)OC)C4=CC=C(C=C4)Br)O BYHQTRFJOGIQAO-GOSISDBHSA-N 0.000 description 1
- WNEODWDFDXWOLU-QHCPKHFHSA-N 3-[3-(hydroxymethyl)-4-[1-methyl-5-[[5-[(2s)-2-methyl-4-(oxetan-3-yl)piperazin-1-yl]pyridin-2-yl]amino]-6-oxopyridin-3-yl]pyridin-2-yl]-7,7-dimethyl-1,2,6,8-tetrahydrocyclopenta[3,4]pyrrolo[3,5-b]pyrazin-4-one Chemical compound C([C@@H](N(CC1)C=2C=NC(NC=3C(N(C)C=C(C=3)C=3C(=C(N4C(C5=CC=6CC(C)(C)CC=6N5CC4)=O)N=CC=3)CO)=O)=CC=2)C)N1C1COC1 WNEODWDFDXWOLU-QHCPKHFHSA-N 0.000 description 1
- KCBWAFJCKVKYHO-UHFFFAOYSA-N 6-(4-cyclopropyl-6-methoxypyrimidin-5-yl)-1-[[4-[1-propan-2-yl-4-(trifluoromethyl)imidazol-2-yl]phenyl]methyl]pyrazolo[3,4-d]pyrimidine Chemical compound C1(CC1)C1=NC=NC(=C1C1=NC=C2C(=N1)N(N=C2)CC1=CC=C(C=C1)C=1N(C=C(N=1)C(F)(F)F)C(C)C)OC KCBWAFJCKVKYHO-UHFFFAOYSA-N 0.000 description 1
- AYCPARAPKDAOEN-LJQANCHMSA-N N-[(1S)-2-(dimethylamino)-1-phenylethyl]-6,6-dimethyl-3-[(2-methyl-4-thieno[3,2-d]pyrimidinyl)amino]-1,4-dihydropyrrolo[3,4-c]pyrazole-5-carboxamide Chemical compound C1([C@H](NC(=O)N2C(C=3NN=C(NC=4C=5SC=CC=5N=C(C)N=4)C=3C2)(C)C)CN(C)C)=CC=CC=C1 AYCPARAPKDAOEN-LJQANCHMSA-N 0.000 description 1
- IDRGFNPZDVBSSE-UHFFFAOYSA-N OCCN1CCN(CC1)c1ccc(Nc2ncc3cccc(-c4cccc(NC(=O)C=C)c4)c3n2)c(F)c1F Chemical compound OCCN1CCN(CC1)c1ccc(Nc2ncc3cccc(-c4cccc(NC(=O)C=C)c4)c3n2)c(F)c1F IDRGFNPZDVBSSE-UHFFFAOYSA-N 0.000 description 1
- LXRZVMYMQHNYJB-UNXOBOICSA-N [(1R,2S,4R)-4-[[5-[4-[(1R)-7-chloro-1,2,3,4-tetrahydroisoquinolin-1-yl]-5-methylthiophene-2-carbonyl]pyrimidin-4-yl]amino]-2-hydroxycyclopentyl]methyl sulfamate Chemical compound CC1=C(C=C(S1)C(=O)C1=C(N[C@H]2C[C@H](O)[C@@H](COS(N)(=O)=O)C2)N=CN=C1)[C@@H]1NCCC2=C1C=C(Cl)C=C2 LXRZVMYMQHNYJB-UNXOBOICSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 239000011295 pitch Substances 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- XIIOFHFUYBLOLW-UHFFFAOYSA-N selpercatinib Chemical compound OC(COC=1C=C(C=2N(C=1)N=CC=2C#N)C=1C=NC(=CC=1)N1CC2N(C(C1)C2)CC=1C=NC(=CC=1)OC)(C)C XIIOFHFUYBLOLW-UHFFFAOYSA-N 0.000 description 1
- XGVXKJKTISMIOW-ZDUSSCGKSA-N simurosertib Chemical compound N1N=CC(C=2SC=3C(=O)NC(=NC=3C=2)[C@H]2N3CCC(CC3)C2)=C1C XGVXKJKTISMIOW-ZDUSSCGKSA-N 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1813—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
- H04L12/1831—Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8146—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
- H04N21/8153—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics comprising still images, e.g. texture, background image
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
- H04N7/152—Multipoint control units therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/567—Multimedia conference systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
Definitions
- the present invention relates generally to digital communication, and more particularly, to systems and methods for providing extraction of textual content from video of a communication session.
- the present invention relates generally to digital communication, and more particularly, to systems and methods for providing extraction of textual content from video of a communication session.
- FIG. 1A is a diagram illustrating an exemplary environment in which some embodiments may operate.
- FIG. IB is a diagram illustrating an exemplary computer system that may execute instructions to perform some of the methods herein.
- FIG. 2A is a flow chart illustrating an exemplary method of extraction of textual content that may be performed in some embodiments.
- FIG. 2B is a flow chart illustrating an exemplary method video frame type classification that may be performed in some embodiments.
- FIG. 2C is a flow chart illustrating an exemplary method of title detection for presented slides that may be performed in some embodiments.
- FIG. 2D is a flow chart illustrating an exemplary method of resolution-based extraction of textual content that may be performed in some embodiments.
- FIG. 3 A is a diagram illustrating one example embodiment of a distinguishing frame containing text.
- FIG. 3B is a diagram illustrating one example embodiment of an extracted title and extracted textual content from a distinguishing frame containing text.
- FIG. 3C is a diagram illustrating one example embodiment of a distinguishing frame containing text.
- FIG. 3D is a diagram illustrating one example embodiment of an extracted title and extracted textual content from a distinguishing frame containing text.
- FIG. 3E is a diagram illustrating one example embodiment of a distinguishing frame containing text.
- FIG. 4 is a diagram illustrating example embodiments of frames of video content with differing classifications.
- FIG. 5 is a diagram illustrating an exemplary computer that may perform processing in some embodiments.
- a computer system may include a processor, a memory, and a non-transitory computer-readable medium.
- the memory and non-transitory medium may store instructions for performing methods and steps described herein.
- Digital communication tools and platforms have been essential in providing the ability for people and organizations to communicate and collaborate remotely, e.g., over the internet.
- video communication platforms allowing for remote video sessions between multiple participants.
- Such techniques are educational and useful, and can lead to drastically improved sales performance results for a sales team.
- recordings of meetings simply include the content of the meeting, and the communications platforms which host the meetings do not provide the sorts of post-meeting, or potentially in-meeting, intelligence and analytics that such a sales team would find highly relevant and useful to their needs.
- the system receives video content of a communication session which includes a number of participants.
- the system then extracts frames from the video content, and classifies the frames of the video content containing text.
- the system identifies one or more distinguishing frames containing text. For each distinguishing frame containing text, the system detects a title within the frame, crops a title area with the title within the frame, and extracts, via optical character recognition (“OCR”), the title from the cropped title area of the frame.
- OCR optical character recognition
- the system extracts, via OCR, textual content from the distinguishing frames containing text, and then transmits the extracted textual content and extracted titles to one or more client devices.
- the system receives video content of a communication session with a number of participants; extracts frames from the video content; classifies the frames of the video content based on image analysis; and transmits, to one or more client devices, the classification of the frames of the video content.
- the system receives video content of a communication session with a number of participants; extracts frames from the video content; classifies the frames of the video content; identifies one or more distinguishing frames containing a presentation slide; for each distinguishing frame containing a presentation slide, detects a title within the frame; and transmits, to one or more client devices, the titles for each of the distinguishing frames comprising a presentation slide.
- the system receives video content of a communication session which includes a number of participants.
- the system then extracts high-resolution versions and low-resolution versions of frames from the video content, and classifies the low-resolution frames of the video content based on identifying text within the low-resolution frames.
- the system identifies one or more low-resolution distinguishing frames containing text. For each low-resolution distinguishing frame containing text, the system detects a title within the frame, crops a title area with the title within the frame, and extracts, via optical character recognition (“OCR”), the title from the cropped title area of the high-resolution version of the frame.
- OCR optical character recognition
- the system extracts, via OCR, textual content from the high-resolution versions of the low-resolution distinguishing frames containing text, and then transmits the extracted textual content and extracted titles to one or more client devices.
- FIG. 1A is a diagram illustrating an exemplary environment in which some embodiments may operate.
- a client device 150 is connected to a processing engine 102 and, optionally, a communication platform 140.
- the processing engine 102 is connected to the communication platform 140, and optionally connected to one or more repositories and/or databases, including, e.g., a video content repository 130, titles repository 132, and/or textual content repository 134.
- One or more of the databases may be combined or split into multiple databases.
- the user’s client device 150 in this environment may be a computer, and the communication platform 140 and processing engine 102 may be applications or software hosted on a computer or multiple computers which are communicatively coupled via remote server or locally.
- the exemplary environment 100 is illustrated with only one client device, one processing engine, and one communication platform, though in practice there may be more or fewer additional client devices, processing engines, and/or communication platforms.
- the client device(s), processing engine, and/or communication platform may be part of the same computer or device.
- the processing engine 102 may perform the exemplary method of FIG. 2 or other method herein and, as a result, provide extraction of textual content from video for a communication session. In some embodiments, this may be accomplished via communication with the client device, processing engine, communication platform, and/or other device(s) over a network between the device(s) and an application server or some other network server.
- the processing engine 102 is an application, browser extension, or other piece of software hosted on a computer or similar device, or is itself a computer or similar device configured to host an application, browser extension, or other piece of software to perform some of the methods and embodiments herein.
- the client device 150 is a device with a display configured to present information to a user of the device who is a participant of the video communication session. In some embodiments, the client device presents information in the form of a visual UI with multiple selectable UI elements or components. In some embodiments, the client device 150 is configured to send and receive signals and/or information to the processing engine 102 and/or communication platform 140. In some embodiments, the client device is a computing device capable of hosting and executing one or more applications or other programs capable of sending and/or receiving information. In some embodiments, the client device may be a computer desktop or laptop, mobile phone, virtual assistant, virtual reality or augmented reality device, wearable, or any other suitable device capable of sending and receiving information.
- the processing engine 102 and/or communication platform 140 may be hosted in whole or in part as an application or web service executed on the client device 150.
- one or more of the communication platform 140, processing engine 102, and client device 150 may be the same device.
- the user’s client device 150 is associated with a first user account within a communication platform, and one or more additional client device(s) may be associated with additional user account(s) within the communication platform.
- optional repositories can include a video content repository 130, title repository 132, and/or textual content repository 134.
- the optional repositories function to store and/or maintain, respectively, video content for the communication session; extracted titles from frames of the video content; and extracted textual content from frames of the video content.
- the optional database(s) may also store and/or maintain any other suitable information for the processing engine 102 or communication platform 140 to perform elements of the methods and systems herein.
- the optional database(s) can be queried by one or more components of system 100 (e.g., by the processing engine 102), and specific stored data in the database(s) can be retrieved.
- Communication platform 140 is a platform configured to facilitate meetings, presentations (e.g., video presentations) and/or any other communication between two or more parties, such as within, e.g., a video conference or virtual classroom.
- a video communication session within the communication platform 140 may be, e.g., one-to-many (e.g., a participant engaging in video communication with multiple attendees), one-to-one (e.g., two friends remotely communication with one another by video), or many-to-many (e.g., multiple participants video conferencing with each other in a remote group setting).
- FIG. IB is a diagram illustrating an exemplary computer system 150 with software modules that may execute some of the functionality described herein.
- the modules illustrated are components of the processing engine 102.
- Receiving module 152 functions to receive video content of a communication session which includes a number of participants.
- Frames module 154 functions to extract frames from the video content.
- Classifying module 156 functions to classify frames of the video content.
- Distinguishing module 158 functions to identify one or more distinguishing frames containing text. For each distinguishing frame containing text, the system detects a title within the frame and crops a title area with the title within the frame.
- Extracting module 160 functions to extract, via OCR, the title from the cropped title area of the frame.
- the system also functions to extract, via OCR, textual content from the distinguishing frames containing text.
- Transmitting module 162 functions to transmit the extracted textual content and extracted titles to one or more client devices.
- FIG. 2A is a flow chart illustrating an exemplary method that may be performed in some embodiments.
- the system receives video content of a communication session which includes a number of participants.
- a communication session may be, e.g., a remote video session, audio session, chat session, or any other suitable communication session between participants.
- the communication session can be hosted or maintained on a communication platform, which the system maintains a connection to in order to connect to the communication session.
- the system displays a user interface (“UI”) for each of the participants in the communication session.
- the UI can include one or more participant windows or participant elements corresponding to video feeds, audio feeds, chat messages, or other aspects of communication from participants to other participants within the communication session.
- the video content the system receives is any recorded video content that captures the communication session.
- the video content can include any content that is shown within the communication session, including, e.g., video feeds showing participants, presentation slides which are presented during the session, screens, desktops, or windows which are shared, annotations, or any other suitable content which can be shared during a communication session.
- the video content is composed of a multitude of frames.
- the system receives the video content from a client device which was used by a participant to connect to the communication session.
- the video content is generated by a client device, or the system itself, during and/or after the communication session.
- video content of a session may be recorded upon a permitted participant, such as a host of the session, selecting one or more “record” options from their user interface.
- the video content may be recorded automatically based on a user’s preferences.
- the system extracts frames from the video content.
- extracting frames from the video content includes extracting high-resolution versions of the frames and low-resolution versions of the frames.
- the different versions of the frames may be used for achieving speed and efficiency in the text extraction process. For example, low-resolution frames may be processed and analyzed, and high-resolution frames may be used for extraction of text.
- extracting frames involves the system generating a thumbnail for each frame, with the thumbnail being used as the frame for the purposes of this method.
- an asynchronous thumbnail extraction service may be queried, and may function to generate individual thumbnail frames, then downsize them (for example, the service may downsize the frame by 10 times).
- the thumbnail extraction service may further aggregate the individual thumbnail frames into tiles (for example, to a grid of 5x5 tiles).
- the resulting thumbnails may then be uploaded to an image server, where they can then be retrieved for further processing.
- the system classifies frames of the video content.
- a frame classifier may be used. By classifying video frames into a number of categories, e.g., 4 categories, consecutively-same frames of video can be grouped within a single segment.
- the categories may include black frames (i.e., empty or devoid of content), face frames (i.e., frames where faces of participants are shown via their respective video feeds), slide frames (i.e., frames in which presentation slides are being presented), and demo frames (i.e., frames where a demonstration of a product, technique, or similar is being presented). Face frames may be used to analyze the sentiment and/or engagement of participants.
- Slide and demo frames may be used to analyze, for example, the duration of product demonstrations in a sales meeting.
- Slide and demo frames which contain text may also be used for various natural language parsing projects after OCR is performed, among other things. Examples of such frame classifications are described below with respect to FIG. 4.
- a neural network may be used to classify the frames of the video content.
- a convolutional neural network CNN
- CNN convolutional neural network
- Such convolutional neural networks have the advantage of being relatively high-accuracy and lightweight.
- different sizes of convolutional kernels may be used, e.g., 1x1, 3x3, or 5x5. Different levels of the inception field may be obtained using these differing sizes.
- 1x1 convolutions, 3x3 convolutions, 5x5 convolutions, and so on may be performed, thus increasing the depth and width of the network.
- a number of different classification types may be accommodated, such as, e.g., 4 classification types (for example, the 4 types of frames describes above). The values can then be calculated for incoming frames.
- such a convolutional model can be trained prior to performing the classification.
- a dataset may be used, such as a dataset containing, for example, 100,000 frames which are each labeled manually for one or more parameters.
- the frames can be labeled by one of 4 frame types.
- the model can then be trained on this dataset to be able to predict classification types, given incoming frames of video content.
- the system identifies one or more distinguishing frames containing text.
- the identification process involves finding a distinguishing frame, or key frame, which indicates new or changed content in comparison to its previous neighboring frame.
- the distinguishing frame can be a new presentation slide, or a same presentation slide with new content. Since demos and slide presentations within communication sessions are mostly static in nature, any major change between two neighboring frames can be assumed to indicate a new slide for which text must be extracted. By only extracting text relating to distinguishing frames rather than all frames, computational speed and efficiency can drastically increase.
- multiple steps may be involved in identifying distinguishing frames.
- frames extracted from previous steps are used as input.
- one or more elements such as a thumbnail of a participant’s video feed in the upper right comer, can be removed by padding the area with, for example, a black rectangle.
- the system can then invert the frame’s colors.
- the text within the frame may turn from black to white, for example.
- the system can calculate the difference in the image between two neighboring frames. After this subtraction is performed, the resulting background (i.e., the shared aspects between two neighboring frames) will be black. If two neighboring frames are the same or very similar, the resulting difference in the image will be black or primarily black.
- the system finds the distinguishing frames, defined as the frames with new content.
- the system may automatically identify such distinguishing frames by running a value (e.g., 0-255) summation along the x- and y-axis, thus obtaining values given a predefined threshold.
- the system initially filters out frames which are classified as a black frame or a face frame during this process. This filtering is performed because such frame types do not typically contain meaningful or relevant text for purposes of textual extraction.
- the system filters out frames which do not contain text, or filters out frames which cannot be distinguished from neighboring frames based on the content of the frame, which can be determined in various ways, including, e.g., machine vision or assumptions about frame classification types.
- the system detects a title within the frame. In other words, the system detects that a title is present within the frame. At this step, the system does not yet extract the title from the frame, but rather verifies that there is title text present. Thus, the system must recognize which text is the title within a frame containing text.
- Title detection is an object detection problem, which involves the task of detecting instances of objects of a certain class within a given image.
- one-stage methods may be used which prioritize inference speed, such as, e.g., a You Only Look Once (“YOLO”) model.
- two-stage methods may be used which prioritize detection accuracy, such as, e.g., Faster R-CNN.
- a YOLO model approach to title detection is described further below.
- detecting the title within the frame includes a first step of dividing the frame into one or more grids of residual blocks. Residual blocks ca be used to create grids in the particular image, such as, for example, 7x7 grids of residual blocks. Each of these grids acts as central points and a particular prediction for each of these grids is performed accordingly.
- detecting the title within the frame includes a second step of generating one or more segregated bounding boxes within the grids of residual blocks. Each of the central points for a particular prediction is considered for the creation of the bounding boxes. While the classification tasks work well for each grid, the bounding boxes must be segregating for each of the predictions that are made.
- detecting the title within the frame includes a third step of determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- this model is trained on a dataset of frames, containing a manually labeled bounding box for each frame.
- the title detection is based on one or more title detection rules.
- one or more candidate titles are to be determined prior to determining the title, and one of the title detection rules includes determining that the number of candidate titles determined for the frame does not exceed a threshold number of candidate titles.
- one of the title detection rules includes determining that the font size for the title meets or exceeds a threshold ratio of font size relative to other text within the frame.
- one of the title detection rules includes determining that the position of the title within the frame matches with one or more prespecified title positions. Such prespecified title positions may include, for example, one or more of: center, left, and top title positions corresponding to areas of the frame.
- detecting the title within the frame comprises use of one or more artificial intelligence (“Al”) models, such as, for example, machine learning models, neural networks, machine vision models, or any other suitable Al models.
- Al artificial intelligence
- the system determines a layout analysis of each distinguishing frame comprising text.
- determining the layout analysis involves classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- determining the layout analysis involves one or more deep neural network techniques.
- determining the layout analysis involves one or more image processing techniques.
- Layout analysis involves building a model or knowledge representation that holds some data involving placement of constituent elements within a frame.
- a pipeline for layout analysis may be achieved using a combination of deep neural networks and image processing techniques to achieve this task. In some embodiments, such an approach is capable of handling different types of frames, and is capable of potentially handling complex backgrounds.
- one or more steps may involve deep learning based image, text, and icon detection; comparison of visionbased and deep learning-based approaches for text extraction; and grid identification as well as block rectification.
- the system crops a title area with the title within the frame.
- any suitable method of cropping an image to a specified area may be used.
- the system crops the title area to the detected bounding box from the previous step. In some embodiments, such a cropping may not be strictly or precisely limited to just the title, but may include other elements of the frame.
- the system extracts, via OCR, the title from the cropped title area of the frame.
- OCR is a technology that is designed to recognize text within an image.
- an OCR model is used to extract the title text from the cropped title area.
- OCR-based text extraction may involve such techniques as, e.g., feature extraction, matrix matching, layout analysis, iterative OCR, lexicon-based OCR, near-neighbor analysis, binarization, character segmentation, normalization, or any other suitable techniques related to OCR.
- the system extracts, via OCR, textual content from the distinguishing frames containing text. Once the titles have been extracted from particular distinguishing frames containing text, then the system can proceed to capture the textual content in full from such frames.
- OCR-based text extraction techniques may apply, depending on various embodiments.
- the system transmits the extracted titles and extracted textual content to one or more client devices.
- the system formats them prior to transmitting the titles and textual content to the client devices.
- they are formatted into a structured data markup format, such as, e.g., JSON format.
- they are structured to be presented towards various usages, such as, for example, search results formatting, analytics data formatting, and more.
- FIG. 2B is a flow chart illustrating an exemplary method video frame type classification that may be performed in some embodiments.
- the system receives video content of a communication session comprising a plurality of participants, as described above with respect to FIG. 2A.
- step 232 the system extracts frames from the video content, as described above with respect to FIG. 2A.
- the system classifies the frames of the video content based on image analysis, as described above with respect to FIG. 2A.
- the frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- classifying the frames of the video content can be performed using a CNN.
- the system further includes post-processing of the frames based on the classification of the frames. This may include, in some embodiments, determining a time between two neighboring frames that does not meet a length threshold; and removing noise between the two neighboring frames. This may be known as a “smoothing” postprocessing step.
- the system may determine one or more differences in the classification of the neighboring frames. For example, the frames of the video content may be showing face-to-face chatting between participants for some time, then switch to a slide presentation, then switch to a demonstration of a product.
- the system may segment the communication session into separate topic segments based on the determined differences in the classification in neighboring frames. Thus, one segment may be generated for the face-to-face chatting, another for the slide presentation, another for the demonstration of the product, and so on. In some embodiments, the system may present a visual indication of these differences, such as a chart or graph of the differences from beginning of the video content to end of the video content.
- the system transmits, to one or more client devices, the classification of the frames of the video content, as described above with respect to FIG. 2A.
- FIG. 2C is a flow chart illustrating an exemplary method of title detection for presented slides that may be performed in some embodiments.
- the system receives video content of a communication session containing a number of participants, as described above with respect to FIG. 2A.
- the system extracts frames from the video content; as described above with respect to FIG. 2A.
- step 244 the system classifies the frames of the video content, as described above with respect to FIG. 2A.
- the system identifies one or more distinguishing frames containing a presentation slide; as described above with respect to FIG. 2A.
- step 248 for each distinguishing frame comprising a presentation slide, the system detects a title within the frame, as described above with respect to FIG. 2A.
- the system transmits, to one or more client devices, the titles for each of the distinguishing frames containing a presentation slide.
- information about the bounding boxes of the titles are also transmitted.
- the information about the bounding boxes may be transmitted in the form of, e.g., the location of the bounding box within the frame, pixel locations or coordinates, dot-per-inch (DPI) dimensions, or a ratio pertaining to a relative location of the bounding box, such as, for example, two-thirds of the way in and one- third of the way down from the top left corner of the frame.
- DPI dot-per-inch
- FIG. 2D is a flow chart illustrating an exemplary method of resolution-based extraction of textual content that may be performed in some embodiments.
- the system receives video content of a communication session with a number of participants, as described above with respect to FIG. 2A.
- the system extracts high-resolution versions and low-resolution versions of frames from the video content, as described above with respect to FIG. 2A.
- the high-resolution versions and low-resolution versions may capture different resolutions of the same content. Computational speeds for extracting text are different when performing processing with different frame resolutions. With low-resolution frames rather than high-resolution frames, the difference can often be significant. However, while the task of deciding which frames should have OCR extraction conducted on them is suitable to be performed on low-resolution frames, OCR extraction itself is not suitable for low-resolution frames, as a significant amount of visual information will be lost which lowers too much accuracy of OCR extraction to be useful. In some embodiments with a hybrid high-resolution / low-resolution method, both resolutions can be leveraged in different ways to provide speed but also accuracy, as described further below.
- the system classifies the low-resolution frames of the video content, as described above with respect to FIG. 2A.
- the classification step can be performed using low-resolution frames, resulting in an increase in speed but not a decrease in accuracy.
- the system identifies one or more low-resolution distinguishing frames containing text, as described above with respect to FIG. 2A.
- the identification of distinguishing frames can be performed using low-resolution frames, without a loss of accuracy.
- step 268 for each low-resolution distinguishing frame containing text, the system detects a title within the frame, as described above with respect to FIG. 2A.
- step 270 the system crops a title area with the title within the frame, as described above with respect to FIG. 2A.
- the system extracts, via optical character recognition (OCR), the title from the cropped title area of the high-resolution version of the frame, as described above with respect to FIG. 2A.
- OCR optical character recognition
- a timestamp for the low-resolution distinguishing frame is used to locate the corresponding high-resolution distinguishing frame at the same timestamp.
- the OCR title extraction step is then performed on the high-resolution frame with identical content.
- the system extracts, via OCR, textual content from the high-resolution versions of the low-resolution distinguishing frames comprising text, as described above with respect to FIG. 2A.
- a timestamp for the low- resolution distinguishing frame is used to locate the corresponding high-resolution distinguishing frame at the same timestamp.
- the OCR textual content extraction step is then performed on the high-resolution frame with identical content.
- FIG. 3 A is a diagram illustrating one example embodiment of a distinguishing frame containing text.
- textual content may be extracted from one or more frames of the video content.
- a presentation slide is shown.
- a title area is present, with a title that reads, “Welcome to All Hands”.
- a date and company name are further provided below the title.
- FIG. 3B is a diagram illustrating one example embodiment of an extracted title and extracted textual content from a distinguishing frame containing text.
- the presentation slide illustrated in FIG. 3 A has had its text extracted using one or more processes described above with respect to FIG. 2A.
- the extracted text includes an extracted title, timestamp associated with the frame of the video content, and three separate pieces of textual content that have been extracted.
- FIG. 3C is a diagram illustrating one example embodiment of a distinguishing frame containing text.
- a title is identified as “Requests for:” and a bounding box is generated around the title.
- a date is displayed in the top left corner, it is not recognized as a title.
- the date and/or the thumbnail of the video feed in the top right corner are replacing with padding of a black rectangle.
- FIG. 3D is a diagram illustrating one example embodiment of an extracted title and extracted textual content from a distinguishing frame containing text.
- the frame illustrated in FIG. 3C has its title and textual content extracted, which is presented along with a timestamp for the frame.
- FIG. 3E is a diagram illustrating one example embodiment of a distinguishing frame containing text.
- the illustrated example shows a frame in which a title has been detected. Although there are potentially some complex aspects within the frame, the process still successfully detects a title, generates a bounding box around the title, and extracts the title via OCR.
- FIG. 4 is a diagram illustrating example embodiments of frames of video content with differing classifications.
- Frame 410 is a demo frame, wherein one person is sharing their screen or desktop. In this frame, the person may be demonstrating or introducing a product or technique, for example.
- Frame 412 is a slide frame, where one person is sharing slides, using software such as, e.g., Microsoft PowerPoint.
- Frame 414 is a face frame, wherein no participant is sharing a screen or slides, but rather two participants are shown on camera within their respective video feeds.
- Frame 416 is a black frame, wherein a person is present in the meeting, but has not enabled their video feed and is not sharing a screen or any slides.
- FIG. 5 is a diagram illustrating an exemplary computer that may perform processing in some embodiments.
- Exemplary computer 500 may perform operations consistent with some embodiments.
- the architecture of computer 500 is exemplary.
- Computers can be implemented in a variety of other ways. A wide variety of computers can be used in accordance with the embodiments herein.
- Processor 501 may perform computing functions such as running computer programs.
- the volatile memory 502 may provide temporary storage of data for the processor 501.
- RAM is one kind of volatile memory.
- Volatile memory typically requires power to maintain its stored information.
- Storage 503 provides computer storage for data, instructions, and/or arbitrary information. Non-volatile memory, which can preserve data even when not powered and including disks and flash memory, is an example of storage.
- Storage 503 may be organized as a file system, database, or in other ways. Data, instructions, and information may be loaded from storage 503 into volatile memory 502 for processing by the processor 501.
- the computer 500 may include peripherals 505.
- Peripherals 505 may include input peripherals such as a keyboard, mouse, trackball, video camera, microphone, and other input devices.
- Peripherals 505 may also include output devices such as a display.
- Peripherals 505 may include removable media devices such as CD-R and DVD-R recorders / players.
- Communications device 506 may connect the computer 100 to an external medium.
- communications device 506 may take the form of a network adapter that provides communications to a network.
- a computer 500 may also include a variety of other devices 504.
- the various components of the computer 500 may be connected by a connection medium such as a bus, crossbar, or network.
- Example 1 A method, comprising: receiving video content of a communication session comprising a plurality of participants; extracting frames from the video content; classifying the frames of the video content; identifying one or more distinguishing frames comprising text; for each distinguishing frame comprising text: detecting a title within the frame, cropping a title area with the title within the frame, and extracting, via optical character recognition (OCR), the title from the cropped title area of the frame; extracting, via OCR, textual content from the distinguishing frames comprising text; and transmitting, to one or more client devices, the extracted textual content and the extracted titles.
- OCR optical character recognition
- Example 2 The method of example 1, wherein the frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 3 The method of example 2, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which are classified as a black frame or a face frame.
- Example 4 The method of any of examples 1-3, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which do not contain text.
- Example 5 The method of any of examples 1-4, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 6 The method of any of examples 1-5, wherein detecting the title within the frame comprises use of a You Only Look Once (YOLO) model.
- YOLO You Only Look Once
- Example 7 The method of any of examples 1-6, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 8 The method of example 7, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 9 The method of example 8, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 10 The method of any of examples 1-9, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 11 The method of any of examples 1-10, wherein detecting the title within the frame comprises use of one or more artificial intelligence (Al) models.
- Al artificial intelligence
- Example 12 The method of any of examples 1-11, further comprising: determining a layout analysis of each distinguishing frame comprising text.
- Example 13 The method of example 12, further comprising: classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- Example 14 The method of example 12, wherein determining a layout analysis of each distinguishing frame comprising text comprises one or more deep neural network techniques.
- Example 15 The method of example 12, wherein determining a layout analysis of each distinguishing frame comprising text comprises one or more image processing techniques.
- Example 16 A communication system comprising one or more processors configured to perform the operations of receiving video content of a communication session comprising a plurality of participants; extracting frames from the video content; classifying the frames of the video content; identifying one or more distinguishing frames comprising text; for each distinguishing frame comprising text: detecting a title within the frame, cropping a title area with the title within the frame, and extracting, via optical character recognition (OCR), the title from the cropped title area of the frame; extracting, via OCR, textual content from the distinguishing frames comprising text; and transmitting, to one or more client devices, the extracted textual content and the extracted titles.
- OCR optical character recognition
- Example 17 The communication system of example 16, wherein the frames of the video content may be classified as one or more of a black frame, a face frame, a slide frame, and a demo frame.
- Example 18 The communication system of example 17, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which are classified as a black frame or a face frame.
- Example 19 The communication system of any of examples 16-18, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which do not contain text.
- Example 20 The communication system of any of examples 16-19, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 21 The communication system of any of examples 16-20, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 22 The communication system of example 21, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 23 The communication system of example 22, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 24 The communication system of any of examples 16-23, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 25 The communication system of any of examples 16-24, wherein detecting the title within the frame comprises use of one or more artificial intelligence (Al) models.
- Example 26 The communication system of example 16, wherein the one or more processors are further configured to perform the operation of: determining a layout analysis of each distinguishing frame comprising text.
- Example 27 The communication system of example 26, further comprising: classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- Example 28 The communication system of example 26, wherein determining a layout analysis of each distinguishing frame comprising text comprises one or more deep neural network techniques.
- Example 29 The communication system of example 26, wherein determining a layout analysis of each distinguishing frame comprising text comprises one or more image processing techniques.
- Example 30 The communication system of example 16, wherein detecting the title within the frame comprises use of a You Only Look Once (YOLO) model.
- YOLO You Only Look Once
- Example 31 A non-transitory computer-readable medium containing instructions comprising: instructions for receiving video content of a communication session comprising a plurality of participants; instructions for extracting frames from the video content; instructions for classifying the frames of the video content; instructions for identifying one or more distinguishing frames comprising text; for each distinguishing frame comprising text: instructions for detecting a title within the frame, instructions for cropping a title area with the title within the frame, and instructions for extracting, via optical character recognition (OCR), the title from the cropped title area of the frame; instructions for extracting, via OCR, textual content from the distinguishing frames comprising text; and instructions for transmitting, to one or more client devices, the extracted textual content and the extracted titles.
- OCR optical character recognition
- Example 32 The non-transitory computer-readable medium of example 31, wherein the frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 33 The non-transitory computer-readable medium of example 32, wherein identifying one or more distinguishing frames comprising text comprises: filtering out frames which are classified as a black frame or a face frame.
- Example 34 The non-transitory computer-readable medium of any of examples 31-
- identifying one or more distinguishing frames comprising text comprises: filtering out frames which do not contain text.
- Example 35 The non-transitory computer-readable medium of any of examples 31-
- identifying one or more distinguishing frames comprising text comprises: filtering out frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 36 The non-transitory computer-readable medium of any of examples 31-
- detecting the title within the frame comprises use of a You Only Look Once (YOLO) model.
- Example 37 The non-transitory computer-readable medium of any of examples 31-
- detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 38 The non-transitory computer-readable medium of example 37, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 39 The non-transitory computer-readable medium of example 38, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 40 The non-transitory computer-readable medium of any of examples 31-
- detecting the title within the frame is based on one or more title detection rules.
- Example 41 The non-transitory computer-readable medium of any of examples 31-
- detecting the title within the frame comprises use of one or more artificial intelligence (Al) models.
- Al artificial intelligence
- Example 42 The non-transitory computer-readable medium of any of examples 31-
- Example 43 The non-transitory computer-readable medium of example 42, further comprising: classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- Example 44 The non-transitory computer-readable medium of example 42, wherein determining a layout analysis of each distinguishing frame comprising text comprises one or more deep neural network techniques.
- Example 45 The non-transitory computer-readable medium of example 42, wherein determining a layout analysis of each distinguishing frame comprising text comprises one or more image processing techniques.
- Example 46 A method, comprising: receiving video content of a communication session comprising a plurality of participants; extracting frames from the video content; classifying the frames of the video content based on image analysis; and transmitting, to one or more client devices, the classification of the frames of the video content.
- Example 47 The method of example 46, wherein the frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 48 The method of example 47, wherein the textual content of the face frame is used for one or more of: a sentiment analysis presentation, and an engagement analysis presentation.
- Example 49 The method of example 47, wherein the textual content of the product frame is used for one or more of: presentation of an analysis of the duration of a product demonstration within the communication session, and training data for one or more natural language parsing (NLP) tasks.
- NLP natural language parsing
- Example 50 The method of example 47, wherein the textual content of the demo frame is used for one or more of: an analysis of the duration of a product demonstration within the communication session, and training data for one or more natural language parsing (NLP) tasks.
- NLP natural language parsing
- Example 51 The method of any of examples 46-50, wherein classifying the frames of the video content is performed using a convolutional neural network (CNN).
- CNN convolutional neural network
- Example 52 The method of any of examples 46-51, further comprising: postprocessing the frames based on the classification of the frames.
- Example 53 The method of example 52, wherein the post-processing comprises: determining a time between two neighboring frames that does not meet a length threshold; and removing noise between the two neighboring frames.
- Example 54 The method of any of examples 46-53, further comprising: determining one or more differences in the classification in neighboring frames.
- Example 55 The method of example 54, further comprising: segmenting the communication session into topic segments based on the determined differences in the classification in neighboring frames.
- Example 56 The method of example 54, further comprising: presenting, to the one or more client devices, a visual indication of the differences in the classification in neighboring frames throughout the video content of the communication session.
- Example 57 A communication system comprising one or more processors configured to perform the operations of: receiving video content of a communication session comprising a plurality of participants; extracting frames from the video content; classifying the frames of the video content based on image analysis; and transmitting, to one or more client devices, the classification of the frames of the video content.
- Example 58 The communication system of example 57, wherein the frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 59 The communication system of example 58, wherein the textual content of the face frame is used for one or more of: a sentiment analysis presentation, and an engagement analysis presentation.
- Example 60 The communication system of example 58, wherein the textual content of the product frame is used for one or more of: presentation of an analysis of the duration of a product demonstration within the communication session, and training data for one or more natural language parsing (NLP) tasks.
- NLP natural language parsing
- Example 61 The communication system of example 58, wherein the textual content of the demo frame is used for one or more of: an analysis of the duration of a product demonstration within the communication session, and training data for one or more natural language parsing (NLP) tasks.
- NLP natural language parsing
- Example 62 The communication system of any of examples 57-61, wherein classifying the frames of the video content is performed using a convolutional neural network (CNN).
- CNN convolutional neural network
- Example 63 The communication system of any of examples 57-62, further comprising: post-processing the frames based on the classification of the frames.
- Example 64 The communication system of example 63, wherein the postprocessing comprises: determining a time between two neighboring frames that does not meet a length threshold; and removing noise between the two neighboring frames.
- Example 65 A non-transitory computer-readable medium containing instructions comprising: instructions for receiving video content of a communication session comprising a plurality of participants; instructions for extracting frames from the video content; instructions for classifying the frames of the video content based on image analysis; and instructions for transmitting, to one or more client devices, the classification of the frames of the video content.
- Example 66 The non-transitory computer-readable medium of example 65, wherein the frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 67 The non-transitory computer-readable medium of example 65, wherein the textual content of the face frame is used for one or more of: a sentiment analysis presentation, and an engagement analysis presentation.
- Example 68 The non-transitory computer-readable medium of example 65, wherein the textual content of the product frame is used for one or more of: presentation of an analysis of the duration of a product demonstration within the communication session, and training data for one or more natural language parsing (NLP) tasks.
- NLP natural language parsing
- Example 69 The non-transitory computer-readable medium of example 65, wherein the textual content of the demo frame is used for one or more of: an analysis of the duration of a product demonstration within the communication session, and training data for one or more natural language parsing (NLP) tasks.
- NLP natural language parsing
- Example 70 The non-transitory computer-readable medium of any of examples 65-
- CNN convolutional neural network
- Example 71 The non-transitory computer-readable medium of any of examples 65-
- Example 72 The non-transitory computer-readable medium of example 52, wherein the post-processing comprises: determining a time between two neighboring frames that does not meet a length threshold; and removing noise between the two neighboring frames.
- Example 73 The non-transitory computer-readable medium of any of examples 65- 72, further comprising: determining one or more differences in the classification in neighboring frames.
- Example 74 The non-transitory computer-readable medium of example 54, further comprising: segmenting the communication session into topic segments based on the determined differences in the classification in neighboring frames.
- Example 75 The non-transitory computer-readable medium of example 54, further comprising: presenting, to the one or more client devices, a visual indication of the differences in the classification in neighboring frames throughout the video content of the communication session.
- Example 76 A method, comprising: receiving video content of a communication session comprising a plurality of participants; extracting frames from the video content; classifying the frames of the video content; identifying one or more distinguishing frames comprising a presentation slide; for each distinguishing frame comprising a presentation slide, detecting a title within the frame; and transmitting, to one or more client devices, the titles for each of the distinguishing frames comprising a presentation slide.
- Example 77 The method of example 76, further comprising: extracting, via optical character recognition (OCR), the title for each distinguishing frame comprising a presentation slide.
- OCR optical character recognition
- Example 78 The method of any of examples 76-77, wherein detecting the title within the frame comprises using one or more artificial intelligence (Al) models.
- Al artificial intelligence
- Example 79 The method of any of examples 76-78, wherein detecting the title within the frame comprises a plurality of You Only Look Once (YOLO) techniques.
- YOLO You Only Look Once
- Example 80 The method of any of examples 76-79, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 81 The method of example 80, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 82 The method of example 81 , wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 83 The method of any of examples 76-82, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 84 The method of example 83, wherein one or more candidate titles are determined prior to determining the title, and wherein one of the title detection rules comprises determining that the number of candidate titles determined for the frame does not exceed a threshold number of candidate titles.
- Example 85 The method of example 83, wherein one of the title detection rules comprises determining that the font size for the title meets or exceeds a threshold ratio of font size relative to other text within the frame.
- Example 86 The method of example 83, wherein one of the title detection rules comprises determining that the position of the title within the frame matches with one or more prespecified title positions.
- Example 87 The method of example 86, wherein the prespecified title positions comprise one or more of: center, left, and top title positions corresponding to areas of the frame.
- Example 88 The method of any of examples 76-87, wherein identifying the one or more distinguishing frames comprising a presentation slide comprises: filtering out frames of the video content which are classified as a black frame, face frame, or demo frame.
- Example 89 The method of claim 76, wherein identifying one or more distinguishing frames comprising a presentation slide comprises: filtering out frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 90 The method of any of examples 76-89, wherein the one or more processors are further configured to perform the operation of: determining a layout analysis of each distinguishing frame comprising a presentation slide.
- Example 91 A communication system comprising one or more processors configured to perform the operations of: receiving video content of a communication session comprising a plurality of participants; extracting frames from the video content; classifying the frames of the video content; identifying one or more distinguishing frames comprising a presentation slide; for each distinguishing frame comprising a presentation slide, detecting a title within the frame; and transmitting, to one or more client devices, the titles for each of the distinguishing frames comprising a presentation slide.
- Example 92 The communication system of example 91, further comprising: extracting, via optical character recognition (OCR), the title for each distinguishing frame comprising a presentation slide.
- OCR optical character recognition
- Example 93 The communication system of any of examples 91-92, wherein detecting the title within the frame comprises using one or more artificial intelligence (Al) models.
- Al artificial intelligence
- Example 94 The communication system of any of examples 91-93, wherein detecting the title within the frame comprises a plurality of You Only Look Once (YOLO) techniques.
- YOLO You Only Look Once
- Example 95 The communication system of any of examples 91-94, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 96 The communication system of example 95, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks. [0197]
- Example 97 The communication system of example 96, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 98 The communication system of any of examples 91-97, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 99 The communication system of example 98, wherein one or more candidate titles are determined prior to determining the title, and wherein one of the title detection rules comprises determining that the number of candidate titles determined for the frame does not exceed a threshold number of candidate titles.
- Example 100 The communication system of example 98, wherein one of the title detection rules comprises determining that the font size for the title meets or exceeds a threshold ratio of font size relative to other text within the frame.
- Example 101 The communication system of example 98, wherein one of the title detection rules comprises determining that the position of the title within the frame matches with one or more prespecified title positions.
- Example 102 The communication system of example 101, wherein the prespecified title positions comprise one or more of: center, left, and top title positions corresponding to areas of the frame.
- Example 103 The communication system of any of examples 91-102, wherein identifying the one or more distinguishing frames comprising a presentation slide comprises: filtering out frames of the video content which are classified as a black frame, face frame, or demo frame.
- Example 104 The communication system of any of examples 91-103, wherein identifying one or more distinguishing frames comprising a presentation slide comprises: filtering out frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 105 The communication system of any of examples 91-104, wherein the one or more processors are further configured to perform the operation of: determining a layout analysis of each distinguishing frame comprising a presentation slide.
- Example 106 A non-transitory computer-readable medium comprising instructions comprising: instructions for receiving video content of a communication session comprising a plurality of participants; instructions for extracting frames from the video content; instructions for classifying the frames of the video content; instructions for identifying one or more distinguishing frames comprising a presentation slide; for each distinguishing frame comprising a presentation slide, instructions for detecting a title within the frame; and instructions for transmitting, to one or more client devices, the titles for each of the distinguishing frames comprising a presentation slide.
- Example 107 The non-transitory computer-readable medium of example 106, further comprising: extracting, via optical character recognition (OCR), the title for each distinguishing frame comprising a presentation slide.
- OCR optical character recognition
- Example 108 The non-transitory computer-readable medium of any of examples 106-107, wherein detecting the title within the frame comprises using one or more artificial intelligence (Al) models.
- Al artificial intelligence
- Example 109 The non-transitory computer-readable medium of any of examples 106-108, wherein detecting the title within the frame comprises a plurality of You Only Look Once (YOLO) techniques.
- YOLO You Only Look Once
- Example 110 The non-transitory computer-readable medium of any of examples 106-109, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 111 The non-transitory computer-readable medium of example 110, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 112 The non-transitory computer-readable medium of example 111, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 113 The non-transitory computer-readable medium of any of examples 106-112, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 114 The non-transitory computer-readable medium of example 113, wherein one or more candidate titles are determined prior to determining the title, and wherein one of the title detection rules comprises determining that the number of candidate titles determined for the frame does not exceed a threshold number of candidate titles.
- Example 115 The non-transitory computer-readable medium of example 113, wherein one of the title detection rules comprises determining that the font size for the title meets or exceeds a threshold ratio of font size relative to other text within the frame.
- Example 116 The non-transitory computer-readable medium of example 113, wherein one of the title detection rules comprises determining that the position of the title within the frame matches with one or more prespecified title positions.
- Example 117 The non-transitory computer-readable medium of example 116, wherein the prespecified title positions comprise one or more of: center, left, and top title positions corresponding to areas of the frame.
- Example 118 The non-transitory computer-readable medium of any of examples 106-117, wherein identifying the one or more distinguishing frames comprising a presentation slide comprises: filtering out frames of the video content which are classified as a black frame, face frame, or demo frame.
- Example 119 The non-transitory computer-readable medium of any of examples 106-118, wherein identifying one or more distinguishing frames comprising a presentation slide comprises: filtering out frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 120 The non-transitory computer-readable medium of any of examples 106-119, wherein the one or more processors are further configured to perform the operation of: determining a layout analysis of each distinguishing frame comprising a presentation slide. [0221] Example 121.
- a method comprising: receiving video content of a communication session comprising a plurality of participants; extracting high-resolution versions and low- resolution versions of frames from the video content; classifying the low-resolution frames of the video content; identifying one or more low-resolution distinguishing frames comprising text; for each low-resolution distinguishing frame comprising text: detecting a title within the frame, cropping a title area with the title within the frame, and extracting, via optical character recognition (OCR), the title from the cropped title area of the high-resolution version of the frame; extracting, via OCR, textual content from the high-resolution versions of the low- resolution distinguishing frames comprising text; and transmitting, to one or more client devices, the extracted textual content and the extracted titles.
- OCR optical character recognition
- Example 122 The method of example 121, wherein extracting the title from the cropped title area of the high-resolution version of the frame comprises: identifying a timestamp corresponding to the low-resolution frame; locating the high-resolution version of the low-resolution frame via the timestamp; and extracting, via OCR, the title from the high- resolution version of the frame.
- Example 123 The method of any of examples 121-122, wherein extracting the textual content from the high-resolution versions of the low-resolution distinguishing frames comprising text comprises: identifying timestamps corresponding to each of the low-resolution distinguishing frames comprising text; locating the high-resolution versions of the low- resolution frames via the corresponding timestamps; and extracting, via OCR, the textual content from the high-resolution versions of the low-resolution distinguishing frames.
- Example 124 The method of any of examples 121-123, wherein the low-resolution frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 125 The method of example 124, wherein identifying one or more low- resolution distinguishing frames comprising text comprises: filtering out low-resolution frames which are classified as a black frame or a face frame.
- Example 126 The method of any of examples 121-125, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low- resolution frames which do not contain text.
- Example 127 The method of any of examples 121-126, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low- resolution frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 128 The method of any of examples 121-127, wherein detecting the title within the frame comprises a plurality of You Only Look Once (YOLO) techniques.
- YOLO You Only Look Once
- Example 129 The method of any of examples 121-128, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 130 The method of example 127, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 131 The method of example 128, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 132 The method of any of examples 121-131, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 133 The method of any of examples 121-132, wherein detecting the title within the frame comprises one or more machine learning algorithms.
- Example 134 The method of any of examples 121-133, further comprising: determining a layout analysis of each low-resolution distinguishing frame comprising text.
- Example 135. The method of example 132, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- Example 136 The method of example 132, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more deep neural network techniques.
- Example 137 The method of example 132, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more image processing techniques.
- Example 138 The method of example 132, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more image processing techniques.
- Example 139 A communication system comprising one or more processors configured to perform the operations of: receiving video content of a communication session comprising a plurality of participants; extracting high-resolution versions and low-resolution versions of frames from the video content; classifying the low-resolution frames of the video content; identifying one or more low-resolution distinguishing frames comprising text; for each low-resolution distinguishing frame comprising text: detecting a title within the frame, cropping a title area with the title within the frame, and extracting, via optical character recognition (OCR), the title from the cropped title area of the high-resolution version of the frame; extracting, via OCR, textual content from the high-resolution versions of the low- resolution distinguishing frames comprising text; and transmitting, to one or more client devices, the extracted textual content and the extracted titles.
- OCR optical character recognition
- Example 140 The communication system of example 139, wherein extracting the title from the cropped title area of the high-resolution version of the frame comprises: identifying a timestamp corresponding to the low-resolution frame; locating the high-resolution version of the low-resolution frame via the timestamp; and extracting, via OCR, the title from the high-resolution version of the frame.
- Example 141 The communication system of any of examples 139-140, wherein extracting the textual content from the high-resolution versions of the low-resolution distinguishing frames comprising text comprises: identifying timestamps corresponding to each of the low-resolution distinguishing frames comprising text; locating the high-resolution versions of the low-resolution frames via the corresponding timestamps; and extracting, via OCR, the textual content from the high-resolution versions of the low-resolution distinguishing frames.
- Example 142 The communication system of any of examples 139-141, wherein the low-resolution frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 143 The communication system of example 142, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low- resolution frames which are classified as a black frame or a face frame.
- Example 144 The communication system of any of examples 139-143, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low-resolution frames which do not contain text.
- Example 145 The communication system of any of examples 139-144, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low-resolution frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 146 The communication system of any of examples 139-145, wherein detecting the title within the frame comprises a plurality of You Only Look Once (YOLO) techniques.
- YOLO You Only Look Once
- Example 147 The communication system of any of examples 139-146, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 148 The communication system of example 147, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 149 The communication system of example 148, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- IOU intersection of union
- Example 150 The communication system of any of examples 139-149, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 151 The communication system of any of examples 139-150, wherein detecting the title within the frame comprises one or more machine learning algorithms.
- Example 152 The communication system of any of examples 139-151, further comprising: determining a layout analysis of each low-resolution distinguishing frame comprising text.
- Example 153 The communication system of example 152, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- Example 154 The communication system of example 152, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more deep neural network techniques.
- Example 155 The communication system of example 152, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more image processing techniques.
- Example 156 The communication system of example 152, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more image processing techniques.
- Example 157 A non-transitory computer-readable medium containing instructions comprising: instructions for receiving video content of a communication session comprising a plurality of participants; instructions for extracting high-resolution versions and low-resolution versions of frames from the video content; instructions for classifying the low-resolution frames of the video content; instructions for identifying one or more low-resolution distinguishing frames comprising text; for each low-resolution distinguishing frame comprising text: instructions for detecting a title within the frame, instructions for cropping a title area with the title within the frame, and instructions for extracting, via optical character recognition (OCR), the title from the cropped title area of the high-resolution version of the frame; instructions for extracting, via OCR, textual content from the high-resolution versions of the low-resolution distinguishing frames comprising text; and instructions for transmitting, to one or more client devices, the extracted textual content and the extracted titles.
- OCR optical character recognition
- Example 158 The non-transitory computer-readable medium of example 157, wherein extracting the title from the cropped title area of the high-resolution version of the frame comprises: identifying a timestamp corresponding to the low-resolution frame; locating the high-resolution version of the low-resolution frame via the timestamp; and extracting, via OCR, the title from the high-resolution version of the frame.
- Example 159 The non-transitory computer-readable medium of example 157, wherein extracting the title from the cropped title area of the high-resolution version of the frame comprises: identifying a timestamp corresponding to the low-resolution frame; locating the high-resolution version of the low-resolution frame via the timestamp; and extracting, via OCR, the title from the high-resolution version of the frame.
- extracting the textual content from the high-resolution versions of the low- resolution distinguishing frames comprising text comprises: identifying timestamps corresponding to each of the low-resolution distinguishing frames comprising text; locating the high-resolution versions of the low-resolution frames via the corresponding timestamps; and extracting, via OCR, the textual content from the high-resolution versions of the low-resolution distinguishing frames.
- Example 160 The non-transitory computer-readable medium of any of examples 157-159, wherein the low-resolution frames of the video content may be classified as one or more of: a black frame, a face frame, a slide frame, and a demo frame.
- Example 161 The non-transitory computer-readable medium of example 160, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low-resolution frames which are classified as a black frame or a face frame.
- Example 162 The non-transitory computer-readable medium of any of examples 157-161, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low-resolution frames which do not contain text.
- Example 163 The non-transitory computer-readable medium of any of examples 157-162, wherein identifying one or more low-resolution distinguishing frames comprising text comprises: filtering out low-resolution frames which cannot be distinguished from neighboring frames based on the content of the frame.
- Example 164 The non-transitory computer-readable medium of any of examples 157-163, wherein detecting the title within the frame comprises a plurality of You Only Look Once (YOLO) techniques.
- YOLO You Only Look Once
- Example 165 The non-transitory computer-readable medium of any of examples 157-164, wherein detecting the title within the frame comprises: dividing the frame into one or more grids of residual blocks.
- Example 166 The non-transitory computer-readable medium of example 165, wherein detecting the title within the frame further comprises: generating one or more segregated bounding boxes within the grids of residual blocks.
- Example 167 The non-transitory computer-readable medium of example 166, wherein detecting the title within the frame further comprises: determining, via intersection of union (IOU) techniques, a top bounding box with highest prediction confidence for the title from the segregated bounding boxes.
- Example 168 The non-transitory computer-readable medium of any of examples 157-167, wherein detecting the title within the frame is based on one or more title detection rules.
- Example 169 The non-transitory computer-readable medium of any of examples 157-168, wherein detecting the title within the frame comprises one or more machine learning algorithms.
- Example 170 The non-transitory computer-readable medium of any of examples 157-169, further comprising: determining a layout analysis of each low-resolution distinguishing frame comprising text.
- Example 171 The non-transitory computer-readable medium of example 170, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises classifying a plurality of areas of the frame into one or more of: text, title, table, image, and list areas.
- Example 172 The non-transitory computer-readable medium of example 170, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more deep neural network techniques.
- Example 173 The non-transitory computer-readable medium of example 170, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more image processing techniques.
- Example 174 The non-transitory computer-readable medium of example 170, wherein determining the layout analysis of each low-resolution distinguishing frame comprising text comprises one or more image processing techniques.
- the present disclosure also relates to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the intended purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
- the present disclosure may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the present disclosure.
- a machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer).
- a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium such as a read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Graphics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne des procédés et des systèmes qui permettent de fournir une extraction de contenu textuel à partir d'une vidéo d'une session de communication. Dans un mode de réalisation, le système reçoit un contenu vidéo d'une session de communication qui comprend un certain nombre de participants. Le système extrait ensuite des trames du contenu vidéo, et classifie les trames du contenu vidéo. Le système identifie une ou plusieurs trames distinctives contenant du texte. Pour chaque trame distinctive contenant du texte, le système détecte un titre à l'intérieur de la trame, recadre une zone de titre avec le titre à l'intérieur de la trame, et extrait, par l'intermédiaire d'une reconnaissance optique de caractères (« OCR »), le titre à partir de la zone de titre recadrée de la trame. Le système extrait, par OCR, un contenu textuel à partir des trames distinctives contenant du texte, puis transmet le contenu textuel extrait et les titres extraits à un ou plusieurs dispositifs clients.
Applications Claiming Priority (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/832,636 | 2022-06-04 | ||
US17/832,640 US20230394858A1 (en) | 2022-06-04 | 2022-06-04 | Resolution-based extraction of textual content from video of a communication session |
US17/832,637 US20230394827A1 (en) | 2022-06-04 | 2022-06-04 | Title detection for slides presented in a communication session |
US17/832,640 | 2022-06-04 | ||
US17/832,636 US20230394851A1 (en) | 2022-06-04 | 2022-06-04 | Video frame type classification for a communication session |
US17/832,637 | 2022-06-04 | ||
US17/832,635 US20230394861A1 (en) | 2022-06-04 | 2022-06-04 | Extraction of textual content from video of a communication session |
US17/832,635 | 2022-06-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023235576A1 true WO2023235576A1 (fr) | 2023-12-07 |
Family
ID=87036215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/024304 WO2023235576A1 (fr) | 2022-06-04 | 2023-06-02 | Extraction de contenu textuel à partir d'une vidéo d'une session de communication |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023235576A1 (fr) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015073501A2 (fr) * | 2013-11-15 | 2015-05-21 | Citrix Systems, Inc. | Génération de synthèses électroniques de réunions en ligne |
WO2021051024A1 (fr) * | 2019-09-11 | 2021-03-18 | Educational Vision Technologies, Inc. | Ressource de prise de notes modifiable avec superposition optionnelle |
WO2022031283A1 (fr) * | 2020-08-05 | 2022-02-10 | Hewlett-Packard Development Company, L.P. | Contenu de flux vidéo |
US20220051011A1 (en) * | 2020-03-25 | 2022-02-17 | Verizon Media Inc. | Systems and methods for deep learning based approach for content extraction |
-
2023
- 2023-06-02 WO PCT/US2023/024304 patent/WO2023235576A1/fr unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015073501A2 (fr) * | 2013-11-15 | 2015-05-21 | Citrix Systems, Inc. | Génération de synthèses électroniques de réunions en ligne |
WO2021051024A1 (fr) * | 2019-09-11 | 2021-03-18 | Educational Vision Technologies, Inc. | Ressource de prise de notes modifiable avec superposition optionnelle |
US20220051011A1 (en) * | 2020-03-25 | 2022-02-17 | Verizon Media Inc. | Systems and methods for deep learning based approach for content extraction |
WO2022031283A1 (fr) * | 2020-08-05 | 2022-02-10 | Hewlett-Packard Development Company, L.P. | Contenu de flux vidéo |
Non-Patent Citations (1)
Title |
---|
HONGLIN LI ET AL: "Hierarchical Segmentation of Presentation Videos through Visual and Text Analysis", SIGNAL PROCESSING AND INFORMATION TECHNOLOGY, 2006 IEEE INTERNATIONAL SYMPOSIUM ON, IEEE, PI, 1 August 2006 (2006-08-01), pages 314 - 319, XP031002446, ISBN: 978-0-7803-9753-8 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10936915B2 (en) | Machine learning artificial intelligence system for identifying vehicles | |
Jaton | We get the algorithms of our ground truths: Designing referential databases in digital image processing | |
CN106686339B (zh) | 电子会议智能 | |
CN107680019B (zh) | 一种考试方案的实现方法、装置、设备及存储介质 | |
US10073861B2 (en) | Story albums | |
US10276213B2 (en) | Automatic and intelligent video sorting | |
US10963700B2 (en) | Character recognition | |
CN104063683A (zh) | 一种基于人脸识别的表情输入方法和装置 | |
KR102002024B1 (ko) | 객체 라벨링 처리 방법 및 객체 관리 서버 | |
US11832023B2 (en) | Virtual background template configuration for video communications | |
CN113761253A (zh) | 视频标签确定方法、装置、设备及存储介质 | |
TW201539210A (zh) | 個人資訊管理服務系統 | |
Oza et al. | Insurance claim processing using RPA along with chatbot | |
CN112699758A (zh) | 基于动态手势识别的手语翻译方法、装置、计算机设备及存储介质 | |
CN111274447A (zh) | 基于视频的目标表情生成方法、装置、介质、电子设备 | |
CN114817754B (zh) | 一种vr学习系统 | |
US20230394851A1 (en) | Video frame type classification for a communication session | |
US20230394861A1 (en) | Extraction of textual content from video of a communication session | |
US20230394827A1 (en) | Title detection for slides presented in a communication session | |
US20230394858A1 (en) | Resolution-based extraction of textual content from video of a communication session | |
WO2023235576A1 (fr) | Extraction de contenu textuel à partir d'une vidéo d'une session de communication | |
CN108140173A (zh) | 将从通信中解析的附件分类 | |
CN116401828A (zh) | 基于数据特征的关键事件可视化显示方法 | |
CN108287817B (zh) | 一种信息处理方法及设备 | |
CN113761281B (zh) | 虚拟资源处理方法、装置、介质及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23735516 Country of ref document: EP Kind code of ref document: A1 |