WO2020224362A1 - Procédé et dispositif de segmentation de vidéo - Google Patents

Procédé et dispositif de segmentation de vidéo Download PDF

Info

Publication number
WO2020224362A1
WO2020224362A1 PCT/CN2020/083397 CN2020083397W WO2020224362A1 WO 2020224362 A1 WO2020224362 A1 WO 2020224362A1 CN 2020083397 W CN2020083397 W CN 2020083397W WO 2020224362 A1 WO2020224362 A1 WO 2020224362A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice information
point
video
segment
presentation
Prior art date
Application number
PCT/CN2020/083397
Other languages
English (en)
Chinese (zh)
Inventor
苏芸
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020224362A1 publication Critical patent/WO2020224362A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Definitions

  • This application relates to the field of information technology, and more specifically, to a video segmentation method and a video segmentation device.
  • a complete video can be divided into multiple segments. In this way, the user can directly watch the segment of interest.
  • a common video segmentation method is to segment the video based on the text information in the video.
  • the text information in the above video may be subtitles in the video, or text obtained by performing voice recognition on the video.
  • the basis for segmenting a video currently comes from the video itself.
  • the current video segmentation based on text information in the video needs to obtain all text information of the video.
  • the video stream of the live video is generated in real time. Therefore, all text information of the video can be obtained only after the live video broadcast ends. Therefore, the above method cannot segment the live video in real time.
  • the above method only segments the video according to the text information of the video. This may cause the determined segmentation point to not necessarily be a suitable segmentation point.
  • the present application provides a video segmentation method and a video segmentation device, which can improve the accuracy of video segmentation.
  • an embodiment of the present application provides a video segmentation method, including: a video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information includes a presentation in the video to be processed At least one of the document and the content description information of the video to be processed; the video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information; the video segmentation device determines the segmentation point of the video to be processed according to the segmentation point, Segment the video to be processed.
  • the above technical solutions can combine information other than the content of the to-be-processed video to segment the to-be-processed video, thereby improving the accuracy of segmentation.
  • the video segmentation device determines the video segmentation of the to-be-processed video according to the text information and the voice information.
  • the segmentation point includes: determining the switching point of the presentation, which presents different content before and after the switching point; determining at least one pause point according to the voice information; determining according to the switching point and the at least one pause point The segment point.
  • the switching of the presentation often means that the content of the speaker's speech has changed. Therefore, the above technical solution divides the to-be-processed video into unused segments by considering the change of the presentation, and can reasonably quickly determine the segmentation point of the to-be-processed video.
  • the above technical solution only needs to be based on the switching point of the presentation and the pause point near the switching point when determining the segmentation point of the video to be processed. Therefore, the above technical solution does not need to obtain the completed video file, and the video can be segmented. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.
  • the determining the segment point according to the switching point and the at least one stopping point includes: determining the switching point and the at least one stopping point If one of the stopping points of is the same, the switching point is determined to be the segment point; when it is determined that any one of the at least one stopping point is different from that of the switching point, the at least one stopping point is determined The stop point closest to the switching point is the segment point.
  • the determining the switching point of the presentation includes: determining that the moment when a switching signal for instructing to switch the content of the presentation is acquired is the switching point .
  • the text information further includes the content description information
  • the video segmentation device determines the segmentation of the to-be-processed video based on the text information and the voice information.
  • the method further includes: determining that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.
  • the video segmentation device determines the to-be-processed video according to the text information and the voice information
  • the segmentation point includes: determining the segmentation point of the video to be processed according to the voice information, keywords of the content description information, and pause points in the voice information.
  • the content description information is information input by the user in advance to describe the video to be processed.
  • the content description information can usually include some key information in the video to be processed, such as keywords, key content, etc. Therefore, the key content described in different segments of the video to be processed can be determined more accurately based on the content description information, so that the video to be processed can be segmented more accurately.
  • the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before the first voice information fragment
  • determining the segment point of the to-be-processed video includes: according to the The first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information, determine the first segment point, where the segment point of the to-be-processed video includes the first segment Segment point.
  • the location of the segmentation point can be determined only based on the keywords of the content description information and the voice information in two adjacent video segments.
  • the division of video clips can be implemented according to fixed time and step size. Therefore, during the video playback process, video segments can be divided into the played video. In this way, the video can be segmented without obtaining the completed video file. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.
  • the first segmentation point includes: according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the content description Key words of information, determine the similarity between the first voice information segment and the second voice information segment; determine that the similarity between the first voice information segment and the second voice information segment is less than the similarity threshold; according to the voice information
  • the stop point is to determine the first segment point.
  • the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment
  • determining the first segment point includes: according to the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and At least one of the words adjacent to the pause point determines the first segment point.
  • the pause points in the first voice information segment include K, or, the pause points adjacent to the first voice information segment include K.
  • the first segment point is determined according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and the words adjacent to the pause point Including: when K is equal to 1, determine the K pause points as the segment point; include a preset in K words that are a positive integer greater than or equal to 2 and adjacent to the K pause points In the case of words, determine the pause point adjacent to the predetermined word as the segment point; in the case where K is a positive integer greater than or equal to 2 and the K words include at least two predetermined words , Determine the pause point with the longest pause duration among the at least two pause points adjacent to at least two preset words as the segment point; K is a positive integer greater than or equal to 2 and the K words do not include In the case of
  • the text information further includes the presentation
  • the video segmentation device determines the segmentation of the to-be-processed video according to the text information and the voice information
  • the method further includes: determining that the presentation duration of the current page of the presentation is greater than the first preset duration; or determining that the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
  • the method further includes: the video segmentation device according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text , Determine the summary of the segment, where the target text includes at least one of the presentation and the content description information.
  • the user can use the summary to quickly determine the desired location when reviewing the video.
  • the foregoing technical solution takes into account information other than the video to be processed in the process of determining the summary. This can improve the accuracy of the determined abstract and the speed of determining the abstract.
  • the video segmentation device determines the segment based on the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text
  • the abstract includes: determining the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text; and determining the abstract of the segment according to the third keyword vector.
  • the video segmentation device determines the summary of the segment according to the third keyword vector, including: according to the target text and the segmented voice information , Determine the reference text, where the reference text includes J sentences, where J is a positive integer greater than or equal to 1; according to the key words of the segmented voice information, the key words of the target text and each sentence in the J sentences , Determine J keyword vectors; according to the third keyword vector and the J keyword vectors, determine the summary of the segment.
  • determining the reference text according to the target text and the segmented voice information includes: if the target text includes redundant sentences, adding The redundant sentence in the target text is deleted, the revised target text is obtained, and the revised target text is combined with the segmented speech information to obtain the reference text; in the case that the target text does not include the redundant sentence, The target text is combined with the segmented voice information to obtain the reference text.
  • determining the summary of the segment according to the third keyword vector and the J keyword vectors includes: according to the third keyword vector and The J keyword vectors determine J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j Is a positive integer greater than or equal to 1 and less than or equal to J; determine the R distances with the shortest distance among the J distances, where R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the score The summary of the segment includes sentences corresponding to the R distances.
  • the to-be-processed video is a real-time video stream
  • the voice information of the to-be-processed video is the real-time video stream from the start time of the real-time video stream or the previous The voice information from the segment point to the current moment.
  • an embodiment of the present application provides a video segmentation device, and the device includes a unit for executing the first aspect or any possible implementation manner of the first aspect.
  • the video segmentation apparatus of the second aspect may be a computer device, or may be a component (such as a chip or a circuit, etc.) that can be used in a computer device.
  • an embodiment of the present application provides a storage medium, and the storage medium stores instructions for implementing the first aspect or any one of the possible implementation manners of the first aspect.
  • the embodiments of the present application provide a computer program product containing instructions.
  • the computer program product When the computer program product is run on a computer, the computer can execute the first aspect or any one of the possible implementations of the first aspect. The method described.
  • FIG. 1 is a schematic diagram of a system that can apply the video segmentation method provided by the embodiments of the present application;
  • FIG. 2 is a schematic diagram of another system that can apply the video segmentation method provided by the embodiments of the present application;
  • Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • Fig. 4 is a schematic diagram of a video conference process provided according to an embodiment of the present application.
  • Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application.
  • Fig. 7 is a structural block diagram of a video segmentation device according to an embodiment of the present application.
  • Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • And/or describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, both A and B exist, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects are in an "or” relationship.
  • At least one item (a) in the following” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • At least one of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c may be single or multiple.
  • words such as “first” and “second” do not limit the number and execution order.
  • computer-readable media may include, but are not limited to: magnetic storage devices (for example, hard disks, floppy disks, or tapes, etc.), optical disks (for example, compact discs (CD), digital versatile discs (DVD)) Etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.).
  • various storage media described herein may represent one or more devices and/or other machine-readable media for storing information.
  • machine-readable medium may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
  • FIG. 1 is a schematic diagram of a system that can apply the video segmentation method provided by this application.
  • FIG. 1 shows a video conference system, which includes a conference control server 101, a conference terminal 111, a conference terminal 112, and a conference terminal 113.
  • the conference terminal 111, the conference terminal 112, and the conference terminal 113 can establish a conference through the conference control server 101.
  • Video conferences usually include at least two conference rooms. Each conference site can access the conference control server through a conference terminal.
  • the conference terminal may be a device used to access a video conference.
  • the conference terminal can be used to receive conference data and present the conference content on the display device according to the conference data.
  • the conference terminal may include a host and a display device.
  • the host can receive conference data through a communication interface, generate a video signal according to the received conference data, and output the video signal to the display device in a wired or wireless manner.
  • the display device presents the content of the meeting according to the received video signal.
  • the display device may be built in the host.
  • the conference terminal may be an electronic device with a built-in display device, such as a notebook computer, a tablet computer, or a smart phone.
  • the display device may be a display device externally placed on the host.
  • the host may be a computer host, and the display device may be a display, a television, or a projector.
  • the display device used for presenting conference content may also be a display device external to the host.
  • the host may be a notebook computer, and the display device may be a monitor, a television or a projector externally connected to the notebook computer.
  • a video conference may include a main venue and at least one branch venue.
  • the conference terminal for example, the conference terminal 111 in the main conference site can upload the collected media stream of the main conference site to the conference control server 101.
  • the conference control server 101 may generate conference data according to the received media stream, and send the conference data to the conference terminals (for example, the conference terminal 112 and the conference terminal 113) in the branch venue.
  • the conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data.
  • the conference terminal in each conference site can upload the collected media stream to the conference control server 101.
  • the conference terminal 111 is the conference terminal used to access the video conference in the conference room 1
  • the conference terminal 112 is the conference terminal used to access the video conference in the conference room 2
  • the conference terminal 113 is the conference terminal used to access the video in the conference room 3.
  • the conference terminal for the conference can upload the collected media stream of the conference site 1 to the conference control server 101.
  • the conference control server 101 can generate conference data 1 according to the media stream of the conference site 1, and send the conference data 1 to the conference terminal 112 and the conference terminal 113.
  • the conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data 1.
  • the conference terminal 112 can also upload the collected media stream of the conference site 2 to the conference control server, and the conference control server 101 can generate conference data 2 according to the media stream of the conference site 2, and send the conference data 2 to the conference terminal 111
  • the conference terminal 111 and the conference terminal 113 can present the conference content on the display device according to the received conference data 2;
  • the conference terminal 113 can also upload the collected media stream of the conference site 3 to the conference control server, and the conference control
  • the server 101 can generate conference data 3 according to the media stream of the conference site 3, and send the conference data 3 to the conference terminal 111 and the conference terminal 112, and the conference terminal 111 and the conference terminal 112 can display the conference data 3 on the display device according to the received conference data 3 Content of meeting.
  • the media stream may be an audio stream.
  • the media stream may be a video stream.
  • the media device responsible for collecting media streams may be built in the conference terminal (for example, a camera and microphone in the conference terminal), or may be externally connected to the conference terminal, which is not limited in this embodiment of the application.
  • the speakers of the conference use presentations during the speech.
  • the media stream may be an audio stream of the speaker.
  • the presentation used by the speaker during the speaking process can be uploaded to the conference control server 101 through an auxiliary stream (also called a data stream or a computer screen stream).
  • the conference control server 101 generates conference data based on the received audio stream and auxiliary stream.
  • the conference data may include the received audio stream and auxiliary stream.
  • the conference data may include a processed audio stream obtained after processing the received audio stream and the auxiliary stream.
  • Processing the received audio stream can be a transcoding operation on the received audio stream, for example, the bit rate of the audio stream can be reduced, so as to reduce the amount of data required to transmit the audio stream to other conference terminals.
  • the conference data may include the received audio stream, an audio stream with a different bit rate from the received audio stream, and the auxiliary stream.
  • the conference terminal can select a suitable audio stream according to the network condition and/or the way of accessing the conference. For example, if the network condition of the conference terminal is good or Wi-Fi is used to access the conference, you can choose an audio stream with a higher bit rate, so that you can listen to a clearer sound.
  • the conference data may include subtitles corresponding to the speaker's speech in addition to at least one bit rate audio stream and auxiliary stream.
  • the subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.
  • the media stream may be a video stream of the speaker during the speech.
  • the media stream can include both the voice information and picture information of the speaker during the speech.
  • the media stream uploaded to the conference control server 101 is the video stream.
  • the speaker uses a presentation during the speech and uses an output device (such as a projector, a television, etc.) to show the presentation.
  • the screen information in the media stream includes the presentation displayed by the speaker. Therefore, the video stream uploaded to the conference control server 101 includes the presentation.
  • the conference control server 101 can directly determine the conference data according to the video stream.
  • the presentation used by the speaker during the speech can be uploaded to the conference control server 101 in the form of auxiliary streams.
  • the conference control server 101 may generate conference data according to the collected video stream and the auxiliary stream.
  • the conference data may include collected video streams and auxiliary streams.
  • the conference data may include a processed video obtained after processing the collected video stream and the auxiliary stream. Processing the captured video stream can be a transcoding operation on the captured video stream, for example, the resolution of the video stream can be reduced, so as to reduce the amount of data required to transmit the video stream to other conference terminals.
  • the conference data may include a collected video stream, a video stream with a resolution different from the collected video stream, and the auxiliary stream.
  • the conference terminal can select an appropriate video stream according to the network conditions and/or the way to access the conference. For example, if the network of the conference terminal is good or Wi-Fi is used to access the conference, a video stream with a higher resolution can be selected so that the audience can see a clearer picture. For another example, if the network condition of the conference terminal is poor, a video stream with a lower resolution can be selected, which can reduce the interruption of the conference live broadcast caused by the bad network condition. For another example, if the conference terminal uses a mobile network to access the conference, a video stream with a lower resolution can be selected, which can reduce data consumption.
  • the conference data may also include subtitles corresponding to the speaker's speech.
  • the subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.
  • FIG. 2 is a schematic diagram of another system to which the video segmentation method provided by this application can be applied.
  • FIG. 2 shows a distance education system, which includes a course server 201, a main device 211, a client device 212, and a client device 213.
  • the main device 211 can upload the collected media stream to the course server 201.
  • the course server 201 can generate course data according to the media stream, and send the course data to the client device 212 and the client device 213, and the client device 212 and the client device 213 can present the course data on the display device according to the received course data.
  • Course content
  • the main device 211 may be a notebook computer or a desktop computer.
  • the client device 212 and the client device 213 may be notebook computers, desktop computers, tablet computers, smart phones, and so on.
  • the teacher in charge of the lecture uses the presentation during the lecture.
  • the media stream may be an audio stream of the teacher's lecture.
  • the presentation used by the teacher during the lecture can be uploaded to the course server 201 through auxiliary streams.
  • the course server 201 generates course data according to the received audio stream and auxiliary stream.
  • the media stream may be a video stream of the teacher during the lecture.
  • the media stream can include both the audio information and the picture information of the teacher during the lecture.
  • the media stream uploaded to the course server 201 is the video stream.
  • the teacher uses a presentation during the lecture and uses an output device (such as a projector, a television, etc.) to show the presentation.
  • the picture information in the media stream includes the presentation presented by the teacher. Therefore, the presentation is included in the video stream uploaded to the course server 201.
  • the course server 201 can directly determine the course data according to the video stream.
  • the presentation used by the teacher during the lecture can be uploaded to the course server 201 by way of auxiliary streams.
  • the course server 201 can generate course data according to the collected video stream and the auxiliary stream.
  • the specific content of the course data is similar to the specific content of the conference data, so I won’t repeat it for brevity.
  • Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • the method shown in FIG. 3 can be executed by a video segmentation device.
  • the video segmentation device may be a computer device that can implement the method provided in the embodiment of the application, such as a personal computer, a notebook computer, a tablet computer, a server, etc., or it may be a computer device that can implement the method provided in the embodiment of the application.
  • Internal hardware such as a graphics card, a graphics processing unit (GPU), or a dedicated device for implementing the method provided in the embodiment of the present application.
  • the video segmentation device may be the conference control server 101 in the system shown in FIG.
  • the video segmentation device may be a conference terminal uploading a media stream in the system shown in FIG. 1 or a piece of hardware in the conference terminal.
  • the video segmentation apparatus may be the main device 211 in the system shown in FIG. 2 or a piece of hardware provided in the main device 211.
  • the video segmentation device may be the course server 201 or a piece of hardware in the course server 201 in the embodiment shown in FIG. 2.
  • the video segmentation device acquires text information of a to-be-processed video and voice information of the to-be-processed video, where the text information includes at least one of a presentation in the to-be-processed video and content description information of the to-be-processed video.
  • the presentation refers to the presentation presented by the speaker of the conference during the speech.
  • the embodiment of the present application does not limit the file format of the presentation, as long as the presentation is displayed on the display device during the speech of the speaker, it can be the presentation.
  • the presentation may be in ppt format or pptx format.
  • the presentation can be a PDF format.
  • the presentation may also be in word format or txt format.
  • the content description information is the information used to describe the content of the speech uploaded by the speaker or the host of the meeting before starting the meeting.
  • the content description information includes an outline, abstract, and/or key information of the speaker's speech content in the video conference.
  • the content description information may include keywords of the speaker's speech content.
  • the content description information may include a summary of the content of the speaker's speech.
  • the content of the speaker's speech may include multiple parts, and the content description information may include the subject, abstract, and/or keywords of each part of the multiple parts.
  • the voice information may include voice-text conversion of the speaker's speech to obtain the corresponding text.
  • the embodiment of the present application does not limit the specific implementation of the voice-text conversion, as long as the recognized voice can be converted into the corresponding text.
  • the voice information may also include at least one pause point obtained by performing voice recognition on the speaker's speech. The pause point represents the natural pause of the speaker in the process of speaking.
  • the video segmentation device determines a segmentation point of the video to be processed according to the text information and the voice information.
  • the text information may include at least one of the presentation and the content description information.
  • the text information can have the following three situations:
  • Case 2 The text information only includes the content description information
  • Case 3 The text information includes the presentation and the content description information.
  • the speaker may only show the presentation during the speech without uploading the content description information in advance. Therefore, the above situation 1 may occur. In other cases, the speaker may only upload the content description information in advance without showing the presentation during the speech. Therefore, the above situation 2 may occur. In other cases, the speaker may show the presentation during the speech and upload the content description information in advance. Therefore, it is possible to write case 3 above.
  • the video segmentation device may determine the segmentation point of the video to be processed according to the presentation.
  • the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information.
  • the video segmentation device may determine the segmentation point of the video to be processed according to one of the presentation and the content description information.
  • the video segmentation device can determine the segmentation point of the video to be processed according to the presentation or the content description information.
  • the video segmentation device may determine the presentation duration of the current page of the presentation, and based on the presentation time The presentation duration of the current page determines whether to determine the segmentation point of the video to be processed according to the presentation or to determine the segmentation point of the video to be processed according to the content description information.
  • the video segmentation device may determine the to-be-processed based on the content description information and the voice information when the presentation of the current page of the presentation is longer than the first preset duration.
  • the segment point of the video In this way, it is possible to avoid the situation that a segment of the video is too long due to the speaker showing the same content for a long time.
  • the first preset duration can be set as required.
  • the first preset duration may be 10 minutes.
  • the first preset duration may be 15 minutes.
  • the video segmentation device may determine the content description information and the voice information when the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
  • the second preset duration can be set as needed.
  • the second preset duration may be 20 seconds.
  • the second preset duration may be 10 seconds.
  • the first preset duration is greater than the second preset duration.
  • the video segmentation device may, in the case where the presentation duration of the current page of the presentation is greater than the second preset duration and less than or equal to the first preset duration, according to the presentation The document and the voice information determine the segmentation point of the video to be processed.
  • only the first preset duration may be set. If the presentation duration of the presentation on the current page is greater than the first preset duration, the segmentation point of the video to be processed is determined according to the content description information and the voice information. If the presentation duration of the current page of the presentation is not greater than the first preset duration, the segmentation point of the video to be processed may be determined according to the presentation and the voice information. The presentation duration of the current page of the presentation is the duration of the presentation staying on the current page.
  • the start moment of the presentation duration of the current page of the presentation is the moment when the presentation is switched to the current page
  • the end moment of the presentation duration of the current page of the presentation is the moment of the presentation from the current page. The moment when the page is switched to another page.
  • the device may start counting the video segment from the time T 1. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point. If the presentation is switched to page n+1 at time T 2 (T 2 is greater than T 1 ), and the time length from time T 1 to time T 2 is less than or equal to the second preset time length, the video segmentation device The segmentation point of the video to be processed can be determined according to the content description information and the voice information.
  • the video segmentation device may determine the to-be-processed video based on the presentation and the voice information The segmentation point. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.
  • the start time of the presentation duration of the current page of the presentation may be the previous segment point
  • the end time of the presentation duration of the current page of the presentation is the beginning of the presentation from the current page The moment to switch to another page.
  • n is a positive integer equal to or greater than 1
  • the video segment of the device descriptive information and the voice information according to the content determines to be treated as a divide point of video time T 4.
  • the video segments from the apparatus may be timed time T 4. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point.
  • the video segmentation device can determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.
  • the video segmentation device may determine the segmentation of the to-be-processed video based on the presentation and the voice information. Segment point. In other words, even if the text information includes the presentation and the content description information at the same time, the video segmentation device can only refer to the presentation and the voice information (that is, the content description information is not used) to determine the video to be processed The segmentation point.
  • the video segmentation device may determine the content of the to-be-processed video according to the content description information and the voice information. Segment point.
  • the video segmentation device can only refer to the content description information and the voice information (that is, the presentation will not be used) to determine the video to be processed The segmentation point.
  • the video segmentation device determining the segmentation point of the video to be processed according to the presentation and the voice information may include: the video segmentation device determines the switching point of the presentation, and the content of the presentation before and after the switching point Different; the video segmentation device determines at least one pause point based on the voice information; the video segmentation device determines the segment point based on the switching point and the at least one pause point.
  • the switching point of the presentation refers to the moment when the presentation is switched.
  • the switching of the presentation can refer to the page turning of the presentation. For example, switch from page 1 to page 2.
  • the switching of the presentation can also mean that the content of the presentation changes without turning pages.
  • the speaker may only show a part (for example, the upper half) of a certain page of the presentation, and then scroll to the remaining part (for example, the lower half) of the page.
  • the presentation is not page turning at this time, the content in the presentation has changed.
  • the video segmentation device may obtain a switching signal for instructing to switch the content of the presentation.
  • the video segmentation device may determine that the moment when the switching signal is acquired is the switching point.
  • the video segmentation device may obtain the content of the presentation.
  • the video segmentation device may determine the switching point according to the change of the content of the presentation. For example, the video segmentation device may determine that the first moment is the switching when it is determined that the content of the presentation presented at the first moment of the video to be processed is different from the content of the presentation presented at the second moment. point.
  • the first moment and the second moment are adjacent moments, and the first moment is before the second moment.
  • the first time is before the second time and the interval between the first time and the second time is less than a preset time length. In other words, in this case, the video segmentation device can detect whether the content presented by the presentation has changed every period of time.
  • the video segmentation device may determine the switching point in combination with the acquired switching signal for instructing to switch the content of the presentation and the content presented by the presentation.
  • the video segment unit acquires the switching signal at time T 1.
  • the video segment presentation means may obtain the content after the content F F before time T 1 to a time T 1 and the presentation of presentation 2, F 1 and F 2 is greater than or equal to a positive integer.
  • F 1 and F 2 may take smaller values, for example, F 1 and F 2 may be equal to 2. This can reduce the amount of calculation. If the content of the presentation. 1 in the frame F and F presented in two different two consecutive frames, it may be determined that the presentation time of the presentation frame where the content changes for the switching point.
  • the video segmentation device may determine whether the content presented by the presentation at different moments (or different frames) is the same in the following manner: the video segmentation device compares the presentation at different moments (or The change of the pixel value at the same position in different frames) exceeds the number P of the preset change value. If P is greater than the first preset threshold P 1 , the content presented by the presentation of the video segmentation device has changed.
  • the change in the pixel value can be determined by calculating the absolute value of the difference between the pixel gray values.
  • the change of the pixel value can be determined by calculating the sum of the absolute values of the differences in the three color channels.
  • the video segmentation device may determine keywords based on subsequent presentations. For example, the video segmentation device determines that the presentation at time T 1 and the presentation at time T 2 (time T 2 is later than time T 1 ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P 2 and less than P 1 . In this case, the device may determine keywords video segment presentation time T 2 according to the.
  • the voice information may also include at least one pause point.
  • the at least one pause point used to determine the segment point may be all the pause points from the start time to the current time. If the segment point determined in step 302 is the first segment point of the to-be-processed video, the start time is the start time of the to-be-processed video. If the segment point determined in step 302 is the k-th segment point of the to-be-processed video (k is a positive integer greater than or equal to 2), then the start time is the time where the k-1 segment point is located.
  • the video segmentation device may also determine a pause point within a time range according to the moment when the switching point is located, and the switching point is within this time range. For example, if the switching point is at time T 1 , the video segmentation device can determine the pause point from time T 1 -t to time T 1 +t.
  • the video segmentation device determines that the switching point is the segmentation point when it is determined that the switching point is the same as one of the at least one pause point. When it is determined that the switching point is not the same as any one of the at least one stopping point, the video segmentation device determines that the one of the at least one stopping point that is closest to the switching point is the segment point .
  • the distance between the stopping point and the switching point refers to the time difference between the stopping point and the switching point. For example, assuming that the switching point is at time T 1, the at least one stop point a point positioned pause time T 2, T 2 T 1 as a difference as t.
  • the stopping point at time T 2 is the segment point. If the distance from two of the at least one stopping point to the switching point is the same and less than the distance from the switching point to the other stopping points except the two stopping points, then any of the two stopping points can be determined.
  • a pause point is the switching point.
  • the video segmentation device determining the segmentation point of the video to be processed according to the content description information and the voice information may include: the video segmentation device according to the voice information, the keywords of the content description information, and the information in the voice information Pause point, determine the segment point of the video to be processed.
  • the to-be-processed video may be divided into multiple voice information segments.
  • the first voice information segment and the second voice information segment are two consecutive voice information segments among the multiple voice information segments.
  • the first voice information segment follows the second voice information segment.
  • the video segmentation device can determine the first segmentation point according to the first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information. Is one of at least one segment point included in the to-be-processed video.
  • the video segmentation device can intercept text fields on the voice information with a window length W and a step size S.
  • the video segmentation device can cut out at least one text field of length W.
  • Each text field of length W is a piece of voice information.
  • the video segmentation device can determine whether the first voice information segment is similar to the second voice information segment. If the first voice information segment is not similar to the second voice information segment, it can be determined that a segment point of the to-be-processed video is near the first voice information segment. If the second voice information segment is similar to the first voice information segment, continue to determine that a third voice information segment adjacent to the first voice information segment and located after the first voice information segment and the first voice information segment Is it similar.
  • the similarity degree can be used as a criterion for measuring whether the first voice information segment is similar to the second voice information segment. If the similarity between the first voice information segment and the second voice information segment is greater than or equal to a similarity threshold, it can be considered that the first voice information segment is similar to the second voice information segment; if the first voice information segment is If the similarity of the second voice information segment is less than the similarity threshold, it can be considered that the first voice information segment is not similar to the second voice information segment.
  • the video segmentation device may be based on keywords of the first voice information segment, keywords of the second voice information segment, content of the first voice information segment, and the first voice information segment.
  • the content of the voice information segment and the keywords of the content description information determine the similarity between the first voice information segment and the second voice information segment.
  • the video segmentation device can determine the keywords of the first voice information segment. Assuming that the number of keywords determined from the first voice information segment is N, the number of keywords determined from the content description information is M, and there are no duplicate keywords among the M keywords and the N keywords.
  • the video segmentation device can determine keywords according to the following methods:
  • Step 1 According to the preset stop word list or according to the part of speech of each word in the text, remove the words that do not represent the actual meaning, such as " ⁇ ", “this", “then”, etc. Stop words (Stop Words) are manually input, some characters or words that are not automatically generated. These words do not represent actual meaning and will be filtered out before or after processing natural language data. A set of stop words composed of stop words can be called a stop word list.
  • Step 2 Count the frequency of each of the remaining words in the text.
  • the frequency of each word in the text can be determined according to the following formula:
  • TF(n) represents the frequency of the nth word in the remaining words of the text in the text after step 1
  • N(n) represents the number of times the nth word appears
  • All_N represents the remaining The total number of words.
  • Step 3 Determine at least one word with the highest frequency as a keyword of the text.
  • the text is content description information
  • the M words with the highest appearance frequency are keywords of the content description information, where M is a positive integer greater than or equal to 1.
  • the text is the first voice information segment
  • the N words with the highest occurrence frequency are keywords of the first voice information segment
  • N is a positive integer greater than or equal to 1.
  • the repeated words are deleted from the N words, and the following words are selected as the The keyword of the first voice information segment. For example, assuming that N is equal to 2 and M is equal to 1, the keywords of the content description information include "student".
  • the determined word with the highest frequency in the first voice information segment is "student”, then continue to determine the word with the second highest frequency. If the word with the second highest appearance frequency is "school”, it can be determined that "school” is a keyword of the first voice information segment, and continue to determine the word with the third highest appearance frequency. Assuming that the word with the third highest frequency is "course”, it can be determined that "course” is another keyword of the first voice information segment. If the text is the second voice information segment, it can be determined that the N words with the highest occurrence frequency are keywords of the second voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the second voice information fragment are the same as the keywords of the content description information, the repeated words are deleted from the N words, and the following words are selected as the The keyword of the second voice information segment.
  • the video segmentation device may determine the first keyword vector based on the keywords of the first voice information segment, the keywords of the content description information, and the content of the first voice information segment . Specifically, the video segmentation device may determine the frequency of the keywords of the first voice information segment and the keywords of the content description information in the content of the first voice information segment, and the frequency is the first keyword vector .
  • the content of the voice information fragment refers to all the words included in the voice information fragment. For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "course" and "school”. Assuming that the frequencies of the above three keywords in the first voice information segment are 0.1, 0.2, and 0.3, respectively, the first keyword vector is (0.3, 0.2, 0.1).
  • the video segmentation device may also determine the second keyword vector based on the keywords of the second voice information segment, the keywords of the content description information, and the content of the second voice information segment. Specifically, the video segmentation device may determine the keyword of the second voice information segment and the frequency of the keyword of the content description information in the content of the second voice information segment, and the frequency is the second keyword vector . For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "breakfast" and "nutrition”. Assuming that the frequencies of the above three keywords appearing in the second voice information segment are 0.3, 0.25, and 0.05, respectively, the second keyword vector is (0.3, 0.25, 0.05).
  • the video segmentation device determines the distance based on the first keyword vector and the second keyword vector. If the distance is greater than a preset distance, the similarity between the first voice information segment and the second voice information segment can be considered Less than the similarity threshold. In this case, the video segmentation device determines the segmentation point according to the first voice information segment.
  • the video segmentation device may determine a distance based on the first keyword vector and the second keyword vector in the following manner:
  • Step 1 Expand the first keyword vector into a first vector, and expand the second keyword vector into a second vector, wherein the keyword corresponding to the first vector and the keyword corresponding to the second vector include the first vector
  • the first keyword vector is (0.3,0.2,0.1)
  • the corresponding keywords are “school”, “course” and “student”
  • the second keyword vector is (0.3,0.25,0.05)
  • the corresponding keywords are “student”, “breakfast” and "nutrition”.
  • the first vector is (0.3,0.1,0,0.2,0)
  • the corresponding keywords are "school”, “student”, “breakfast”, “course”, “nutrition”
  • the second The vector is (0,0.3,0.25,0,0.05)
  • the corresponding keywords are "school”, “student”, “breakfast”, “course”, and “nutrition”.
  • Step 2 Calculate the distance between the first vector and the second vector.
  • the distance between the first vector and the second vector is the distance determined according to the first keyword vector and the second keyword vector.
  • the distance between the first vector and the second vector may be the Euclidean distance. Because the same keywords in the two voice information fragments may be few. Therefore, if the distance between the first vector and the second vector is a cosine distance, there may be many zero values in the calculation result. Therefore, it may be more appropriate to select the Euclidean distance as the distance between the first vector and the second vector.
  • the distance between the first vector and the second vector may be a cosine distance.
  • the first keyword vector and the second keyword vector may also be term frequency-inverse document frequency, binary term frequency, etc.
  • Determining the distance between the first keyword vector and the second keyword vector may be determining the n-norm distance of the first keyword vector and the second keyword vector (n is a positive integer greater than or equal to 1), Determine the relative entropy distance between the first keyword vector and the second keyword vector.
  • the first vector and the first vector are binarized.
  • the first vector after the binarization process is (1,1,0,1,0)
  • the second vector after the binarization process is (0,1,1,0,1).
  • the 1-norm distance is calculated to obtain the repetition degree of the keywords of the first voice information segment and the keywords of the second voice information segment.
  • the repetition of keywords can be considered a special form of distance.
  • the degree of repetition of the keywords can be used to determine whether the first voice information segment is similar to the second voice information segment.
  • the repetition degree of the keyword is greater than or equal to a preset repetition degree, it can be considered that the first voice information segment is similar to the second voice information segment; if the repetition degree of the keyword is less than the preset repetition degree, it can be considered The first voice information segment and the second voice information segment are not similar. It can be seen that in this case, the preset degree of repetition can be regarded as the similarity threshold.
  • the extraction of keywords may also be determined according to the term frequency-inverse document frequency.
  • the word frequency can be determined based on formula 1.1.
  • the inverse document frequency can be determined according to the following formula:
  • IDF(n) log(Num_Doc/(Doc(n)+1), formula 1.2
  • IDF(n) represents the inverse document frequency of the nth word
  • Num_Doc represents the total number of documents in the corpus
  • Doc(n) represents the number of documents containing the nth word in the corpus.
  • Term frequency-inverse document frequency can be determined according to the following formula:
  • TF-IDF(n) represents the word frequency of the nth word-the inverse document frequency. If the keyword is determined based on the term frequency-inverse document frequency, the first keyword vector is composed of the term frequency of the keyword-inverse document frequency.
  • the extraction of keywords may also be based on a text ranking (TextRank) method of word maps. If the keyword is determined according to the TextRank based on the word graph, the first keyword vector may be composed of the weight of the word.
  • TextRank text ranking
  • the video segmentation device may determine the segmentation point according to the first voice information segment.
  • the video segmentation device may first determine whether a pause point is included in the first voice information segment. If the first voice information segment includes a pause point, it can be determined that the pause point is the segment point. If the first voice information segment includes multiple pauses, it can be determined whether the word after each of the multiple pauses is a preset word.
  • the presupposition includes conjunctions with segmentation meaning, such as "next", “below”, "next point” and so on.
  • the word after the pause refers to the word adjacent to the pause after the pause. If only one word after one of the multiple pause points is a preset word, it can be determined that the pause point is a segmentation point.
  • the word following at least two of the multiple pauses is a preset word, it can be determined that the pause of the pause duration among the at least two pauses is the segment point. If none of the words following the multiple pause points is the preset word, it can be determined that the pause point with the longest pause duration among the multiple pause points is the segment point. If the first voice information segment does not include a pause point, the segment point may be determined according to the pause point adjacent to the first voice segment. It can be understood that there may be two pause points adjacent to the first speech segment, one is located before the first speech segment, and the other is located after the first speech segment. The video segmentation device may determine the segmentation point according to the distance between the two pause points and the first voice information segment.
  • the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the start position of the first voice information segment. If the pause point is after the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the end position of the first voice information segment.
  • the pause point located before the first voice information segment and adjacent to the first voice information segment is referred to as the front pause point, and the distance between the front pause point and the first voice information segment is referred to as Distance 1:
  • the pause point located after the first voice segment and adjacent to the first voice segment is called a post-pause point, and the distance between the post-pause point and before the first voice information segment is called a distance 2. If the distance 1 is less than the distance 2, the front stop point can be determined to be the segment point; if the distance 1 is greater than the distance 2, then the back stop point can be determined as the segment point.
  • word 1 the word after the pre-pause
  • word 2 the word after the post-pause
  • word 1 the word after the pre-pause
  • word 2 the word after the post-pause
  • the pause point is the natural pause of the speaker. Therefore, the pause point is a certain time long.
  • the intermediate moment of the pause point may be determined as the segment point.
  • the end time of the pause point may be determined as the segment point.
  • the starting moment of the pause point may be determined as the segment point.
  • the video segmentation device segments the to-be-processed video according to the segmentation point.
  • the start time of the segment is the start time of the video to be processed, and the end time of the segment is the segment point. If the segment point is the k-th segment of the video to be processed (k is a positive integer greater than or equal to 2), then the start time of the segment is the k-1 segment point, and the The end time is the segment point.
  • the video segmentation device can also determine the summary of the segment.
  • the video segmentation device may determine a summary of the segment according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text.
  • the target text includes at least one of the presentation and the content description information.
  • the video segmentation device may first determine a third keyword vector, and then determine the segment summary according to the third keyword vector.
  • the video segmentation device can determine the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text, where the content of the segmented voice information refers to the composition of the segment All sentences of the voice information of the segment.
  • the target text includes the presentation; if the text information only includes the content description information, the target text includes the content description information; if the text The information includes the presentation and the content description information, and the target text includes the presentation and the content description information.
  • the video segmentation device determines the implementation manner of the keyword of the segmented voice information, the video segmentation device determines the implementation manner of the keyword of the target text, and the video segmentation device determines the keyword of the first voice information segment The implementation is similar.
  • the video segmentation device can determine the keyword of the target text according to the subsequent presentation. For example, the video segmentation device determines that the presentation at time T 1 and the presentation at time T 2 (time T 2 is later than time T 1 ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P 2 and less than P 1 . In this case, the video segment of the target device may determine the keywords text presentation based on the time T 2.
  • the number of keywords determined from the presentation is L
  • the number of keywords determined from the content description information is M
  • the number of keywords determined from the segmented voice information is Q
  • the L keywords There are no duplicate keywords among the M keywords and the Q keywords.
  • the video segmentation device may first determine M keywords from the content description information, and then determine the L words that appear most frequently in the presentation. If one or more words in the L words also belong to the M keywords, delete the one or more words from the L words, and then continue to determine the word with the second highest frequency from the presentation , Until the determined L keywords and the M keywords have no intersection. After that, the video segmentation device determines Q words from the segmented voice information. If one or more words in the Q words belong to the M keywords or the L keywords, delete the one or more words from the Q words, and then continue from the segmented voice information Determine the word with the second highest frequency until the determined Q keywords do not overlap with L keywords and M keywords.
  • the third keyword vector includes the Q keywords, the L keywords and the frequency of the M keywords in the segmented voice information. It is understandable that if the target text does not include the content description information, the value of M is 0; if the target text does not include the presentation, the value of L is 0.
  • the video segmentation device may determine the summary of the segment according to the determined third keyword vector.
  • the video segmentation device may determine the reference text according to the content of the target text and the segmented voice information, where the reference text includes J sentences, and J is a positive integer greater than or equal to 1; according to the segmented voice
  • the keywords of the information, the keywords of the target text, and each sentence in the J sentences determine J keyword vectors; determine the abstract of the segment according to the third keyword vector and the J keyword vectors .
  • the j-th keyword vector in the J keyword vectors is the frequency of occurrence of the keywords of the segmented voice information and the keywords of the target text in the j-th sentence.
  • the target text includes redundant sentences
  • the target text does not include the redundant sentence
  • the target text is combined with the content of the segmented voice information to obtain the reference text.
  • one or more sentences in the presentation may also appear in the content description information.
  • delete one or more sentences in the presentation that are the same as the content description information and then merge the presentation with the redundant sentences deleted, the content description information, and the content of the segmented voice information to obtain The reference text.
  • the target text does not include redundant sentences, for example, any sentence in the presentation appears in the content description information, or the target text includes only one of the presentation and the content description information. Then, the target text can be directly combined with the content of the segmented voice information to obtain the reference text.
  • the video segmentation device determines the summary of the segment according to the third keyword vector and the J keyword vectors, including: the video segmentation device determines according to the third keyword vector and the J keyword vectors J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j is greater than or equal to 1 and less than or A positive integer equal to J; determine the R distances with the shortest distance among the J distances, and R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the summary of the segment includes the R distances The sentence corresponding to the distance.
  • the specific implementation manner of the video segmentation device determining the j-th distance according to the third keyword vector and the j-th keyword vector and the video segmentation device determining the distance according to the first keyword vector and the second keyword vector The implementation of is similar, the difference is: the j-th distance determined according to the third keyword vector and the j-th keyword vector is the Euclidean distance; the distance determined according to the first keyword vector and the second keyword vector It can be Euclidean distance or cosine distance.
  • the reason why the j-th distance determined by the third keyword vector and the j-th keyword vector cannot be the cosine distance is that the j-th keyword vector is normalized when the cosine distance is calculated. But the j-th tube detects that the modulus length of your vector just reflects the overall frequency of sentence j and keywords, so it cannot be normalized.
  • the above-mentioned vectors are all the frequency (that is, the word frequency) of the keywords in the specific text.
  • the aforementioned vector may also be determined according to a word vector (word to vector, word2vec) determined.
  • the first keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; add the word vectors of all keywords and average them to obtain the first keyword vector.
  • the second keyword vector and the first keyword vector are determined in a similar manner, so there is no need to repeat them here.
  • the third keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; determine the word frequency of each keyword; weight the word vectors of all keywords according to the word frequency of each keyword On average, the third keyword vector is obtained.
  • the j-th keyword vector can be determined by the following steps: segment the j-th sentence and remove stop words; use word2vex to determine the word vector of each remaining word; add all the word vectors to average, Get the j-th keyword vector.
  • the direct distance between the third keyword vector and the j-th keyword vector may be a cosine distance.
  • Fig. 4 is a schematic diagram of a conference process provided according to an embodiment of the present application.
  • the conference terminal 1 transmits audio and video stream 1 to the conference control server.
  • the conference terminal 2 transmits the audio and video stream 2 to the conference control server.
  • the conference terminal 3 transmits the audio and video stream 3 to the conference control server.
  • the conference control server determines the main conference site.
  • the main conference site determined by the conference control server is the conference site where the conference terminal 1 is located.
  • the conference control server sends the conference data to the conference terminal 2 and the conference terminal 3.
  • the conference terminal 2 and the conference terminal 3 store conference data.
  • the conference control server may also send conference data to the conference terminal 1, and the conference terminal 1 may also store the conference data.
  • the conference control server segments the audio and video stream 1 in real time (that is, determines the segment point) and extracts a summary of each segment.
  • the conference control server sends the segment point and summary to the conference terminal 2 and the conference terminal 3. In this way, the conference terminal 2 and the conference terminal 3 can independently select the review point to play the review video. Of course, in some implementations, the conference control server may also send the segment point and summary to the conference terminal 1.
  • Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • the video segmentation device determines whether the meeting reservation includes text related to the meeting content. In other words, the video segmentation device can determine whether the to-be-processed video includes content description information. If the result of the determination is yes (that is, the video to be processed includes content description information), step 502 is executed; if the result of the determination is no (that is, the video to be processed does not include content description information), then step 503 is executed.
  • the video segmentation device extracts keywords related to the content of the conference. In other words, the video segmentation device determines the keywords of the content description information.
  • step 503 may be executed.
  • the video segmentation device determines whether there is a screen display presentation in the to-be-processed video. In other words, the video segmentation device can determine whether the to-be-processed video includes a presentation, and the presentation is displayed on the screen. If the determination result is yes (that is, the to-be-processed video includes a presentation), step 504 is executed. If the determination result is no (that is, the to-be-processed video does not include a presentation), step 505 is executed. 504, the video segmentation device determines the position of the screen for displaying the presentation. After the video segmentation device determines the position of the screen, step 506 may be executed.
  • the video segmentation device determines whether there is a presentation transmitted through an auxiliary stream.
  • the conference speaker may not display the presentation on the screen, but upload the presentation to the conference control server through the auxiliary stream.
  • the conference terminal in the other conference site can obtain the presentation used by the conference speaker in the speech process according to the auxiliary stream. If the determination result is yes (that is, there is a presentation transmitted through the auxiliary stream), step 506 is executed. If the result of the determination is no (that is, there is no presentation transmitted through the auxiliary stream), the segmentation point of the to-be-processed video can be determined according to the voice information.
  • the video segmentation device determines whether the duration from the previous segment point to the current moment exceeds the first preset duration. If the video segmentation device determines that the duration from the last segment point to the current moment is greater than the first preset duration (that is, the determination result is yes), step 507 is executed. If the video segmentation device determines that the duration from the previous segment point to the current moment is not greater than the first preset duration, step 508 is executed. It can be understood that, if the segment point determined by the video segmentation device is the first segment point, the previous segment point refers to the start time of the video to be processed. For ease of description, the length of time from the point of the upper garment segment to the current moment can be referred to as the presentation time length.
  • the video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information.
  • the video segmentation device determines the segmentation point of the video to be processed according to the presentation and voice information.
  • the specific implementation manner of the video segmentation device determining the segmentation point of the to-be-processed video based on the presentation and voice information reference may be made to the embodiment shown in FIG. 3, and it is unnecessary to repeat it here.
  • the video segmentation apparatus may perform step 509 and step 510 after determining the segmentation point of the video to be processed.
  • the video segmentation device determines segmented voice information and keywords of the segmented voice information, where the segmented voice information is the voice information between the previous segment point of the segment point and the segment point. It is understandable that if the segment point is the first segment point of the video to be processed, the segment voice information is the voice information from the start time of the video to be processed to the segment point.
  • the video segmentation device determines a segmented summary according to the segmented voice information, keywords of the segmented voice information, and keywords of the target text.
  • steps 509 and 510 reference may be made to the embodiment shown in FIG. 3, and details are not required here.
  • the video segmentation device can first determine whether there is a presentation displayed on the screen in the video to be processed during the process of segmenting the video and extracting the summary, and then Determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether the presentation is transmitted through the auxiliary stream.
  • the video segmentation device may first determine whether there is a presentation transmitted through the auxiliary stream, and then determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether there is any text in the video to be processed Presentation through the screen.
  • the video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information will be described below in conjunction with FIG. 6.
  • how the video segmentation apparatus determines the segmentation point of the to-be-processed video according to the voice information can also refer to FIG. 6.
  • Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application.
  • the video segmentation device continuously intercepts voice information segments on the voice information with the window length W and the step size S.
  • the video segmentation device extracts keywords for each voice information segment. Specifically, the video segmentation device extracts N keywords from each voice information segment.
  • step 603 can be performed after step 602; if the video segmentation device has not extracted the keywords of the content description information, step 604 can be performed after step 602 .
  • the keyword of the content description information extracted by the video segmentation device means that the video segmentation device determines that the video to be processed includes the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on content description information and voice information.
  • the video segmentation device has not extracted the keywords of the content description information, which means that the video segmentation device determines that the to-be-processed video does not include the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on the voice information.
  • the video segmentation device determines the keyword in the i-th voice information segment and the word frequency vector C_i of the keyword of the content description information in the i-th voice information segment.
  • the method for determining the keyword in the i-th voice information segment refer to the embodiment shown in FIG. 3. Specifically, reference may be made to the method for determining the keyword of the first voice information segment in the embodiment shown in FIG. 3, which will not be repeated here.
  • the method for determining the keywords of the content description information can participate in the embodiment shown in FIG. 3, and it will not be repeated here.
  • the video segmentation device determines the keyword in the i-th voice information segment and the implementation manner of the keyword of the content description information in the word frequency vector in the i-th voice information segment. There is no need to go into details about how to determine the keyword vector here.
  • the video segmentation device determines the word frequency vector C_i of the keyword in the i-th voice information segment in the i-th voice information segment.
  • the method of determining the word frequency vector of the keyword in the i-th voice information segment in the i-th voice information segment and the keyword in the i-th voice information segment and the keywords of the content description information are in the i-th voice information segment
  • the method for determining the word frequency vector C_i in is similar, so there is no need to repeat it here.
  • the video segmentation apparatus may perform step 605 and step 606 in sequence.
  • the video segmentation device determines the distance between C_i and C_(i-1).
  • C_(i-1) is that the video segmentation device determines that the keyword of the i-1th voice information fragment (or the keyword of the i-1th voice information fragment and the keyword of the content description information) is in the i-th
  • the word frequency vector in a piece of speech information is a voice information segment before the i-th voice information segment
  • the segment point can be determined according to the pause point.
  • the video segmentation device determines that the segment point is located before and after the i-th voice information segment, the segment point can be determined according to the pause point.
  • the word frequency vector of the next speech information segment and the word frequency vector of the i-th speech information segment can be determined continuously.
  • Fig. 7 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
  • the video segmentation device 700 includes an acquisition unit 701 and a processing unit 702.
  • the acquiring unit 701 is configured to acquire text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.
  • the processing unit 702 is configured to determine the segment point of the video to be processed according to the text information and the voice information.
  • the processing unit 702 is further configured to segment the to-be-processed video according to the segmentation point.
  • the specific functions and beneficial effects of the acquiring unit 701 and the processing unit 702 can be referred to the methods shown in FIG. 3 to FIG. 6, which will not be repeated here.
  • Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
  • the video segmentation device 800 shown in FIG. 8 includes a processor 801, a memory 802, and a transceiver 803.
  • the processor 801, the memory 802, and the transceiver 803 communicate with each other through internal connection paths, and transfer control and/or data signals.
  • the method disclosed in the foregoing embodiment of the present application may be applied to the processor 801 or implemented by the processor 801.
  • the processor 801 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 801 or instructions in the form of software.
  • the aforementioned processor 801 may be a general-purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory (RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory, or electrically erasable programmable memory, registers, etc. mature in the field Storage medium.
  • the storage medium is located in the memory 802, and the processor 801 reads the instructions in the memory 802 and completes the steps of the above method in combination with its hardware.
  • the memory 802 may store instructions for executing the method executed by the video segmentation apparatus in the methods shown in FIGS. 3 to 6.
  • the processor 801 can execute the instructions stored in the memory 802 in combination with other hardware (such as the transceiver 803) to complete the steps of the video segmentation device in the method shown in FIG. 3 to FIG. 6.
  • the specific working process and beneficial effects can be seen in FIGS. 3 to 6 shows the description in the embodiment.
  • An embodiment of the present application also provides a chip, which includes a transceiver unit and a processing unit.
  • the transceiver unit may be an input/output circuit or a communication interface
  • the processing unit is a processor or microprocessor or integrated circuit integrated on the chip.
  • the chip can execute the method of the video segmentation device in the above method embodiment.
  • the embodiment of the present application also provides a computer-readable storage medium on which an instruction is stored, and when the instruction is executed, the method of the video segmentation device in the foregoing method embodiment is executed.
  • the embodiment of the present application also provides a computer program product containing instructions that, when executed, execute the method of the video segmentation device in the foregoing method embodiment.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

La présente invention concerne un procédé et un dispositif de segmentation de vidéo. Au cours du procédé, un dispositif de segmentation de vidéo segmente une vidéo devant être traitée en fonction d'informations de description de contenu destinées à décrire un contenu de la vidéo devant être traitée et/ou d'une présentation présentée dans la vidéo devant être traitée et/ou d'informations vocales de la vidéo devant être traitée, les informations de description de contenu et la présentation étant téléchargées à l'avance. La solution technique peut segmenter la vidéo devant être traitée en référence à des informations, à l'exception du contenu de la vidéo devant être traitée, ce qui accroît la précision de la segmentation.
PCT/CN2020/083397 2019-05-07 2020-04-05 Procédé et dispositif de segmentation de vidéo WO2020224362A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910376477.2 2019-05-07
CN201910376477.2A CN111918145B (zh) 2019-05-07 2019-05-07 视频分段方法和视频分段装置

Publications (1)

Publication Number Publication Date
WO2020224362A1 true WO2020224362A1 (fr) 2020-11-12

Family

ID=73051391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083397 WO2020224362A1 (fr) 2019-05-07 2020-04-05 Procédé et dispositif de segmentation de vidéo

Country Status (2)

Country Link
CN (1) CN111918145B (fr)
WO (1) WO2020224362A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114051154A (zh) * 2021-11-05 2022-02-15 新华智云科技有限公司 一种新闻视频拆条方法和系统
CN114173191A (zh) * 2021-12-09 2022-03-11 上海开放大学 一种基于人工智能的多语言答疑方法和系统
CN114245229A (zh) * 2022-01-29 2022-03-25 北京百度网讯科技有限公司 一种短视频制作方法、装置、设备以及存储介质
CN114363695A (zh) * 2021-11-11 2022-04-15 腾讯科技(深圳)有限公司 视频处理方法、装置、计算机设备和存储介质
CN115209233A (zh) * 2022-06-25 2022-10-18 平安银行股份有限公司 视频播放方法以及相关装置、设备
CN118012979A (zh) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 一种普通外科手术智能采集存储系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187642A1 (en) * 2002-03-29 2003-10-02 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
CN102547139A (zh) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 一种新闻视频节目切分方法、新闻视频编目方法及系统
US20130028574A1 (en) * 2011-07-29 2013-01-31 Xerox Corporation Systems and methods for enriching audio/video recordings
WO2013097101A1 (fr) * 2011-12-28 2013-07-04 华为技术有限公司 Procédé et dispositif d'analyse d'un fichier vidéo
CN104519401A (zh) * 2013-09-30 2015-04-15 华为技术有限公司 视频分割点获得方法及设备
CN104540044A (zh) * 2014-12-30 2015-04-22 北京奇艺世纪科技有限公司 一种视频分段方法及装置
CN106982344A (zh) * 2016-01-15 2017-07-25 阿里巴巴集团控股有限公司 视频信息处理方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187642A1 (en) * 2002-03-29 2003-10-02 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
CN102547139A (zh) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 一种新闻视频节目切分方法、新闻视频编目方法及系统
US20130028574A1 (en) * 2011-07-29 2013-01-31 Xerox Corporation Systems and methods for enriching audio/video recordings
WO2013097101A1 (fr) * 2011-12-28 2013-07-04 华为技术有限公司 Procédé et dispositif d'analyse d'un fichier vidéo
CN104519401A (zh) * 2013-09-30 2015-04-15 华为技术有限公司 视频分割点获得方法及设备
CN104540044A (zh) * 2014-12-30 2015-04-22 北京奇艺世纪科技有限公司 一种视频分段方法及装置
CN106982344A (zh) * 2016-01-15 2017-07-25 阿里巴巴集团控股有限公司 视频信息处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHU YINGYING, ZHOU DONGRU: "Vision, Speech and Text for Video Segmentation", COMPUTER ENGINEERING AND APPLICATIONS, no. 3, 1 February 2001 (2001-02-01), pages 85 - 87, XP055752287, ISSN: 1002-8331 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114051154A (zh) * 2021-11-05 2022-02-15 新华智云科技有限公司 一种新闻视频拆条方法和系统
CN114363695A (zh) * 2021-11-11 2022-04-15 腾讯科技(深圳)有限公司 视频处理方法、装置、计算机设备和存储介质
CN114363695B (zh) * 2021-11-11 2023-06-13 腾讯科技(深圳)有限公司 视频处理方法、装置、计算机设备和存储介质
CN114173191A (zh) * 2021-12-09 2022-03-11 上海开放大学 一种基于人工智能的多语言答疑方法和系统
CN114173191B (zh) * 2021-12-09 2024-03-19 上海开放大学 一种基于人工智能的多语言答疑方法和系统
CN114245229A (zh) * 2022-01-29 2022-03-25 北京百度网讯科技有限公司 一种短视频制作方法、装置、设备以及存储介质
CN114245229B (zh) * 2022-01-29 2024-02-06 北京百度网讯科技有限公司 一种短视频制作方法、装置、设备以及存储介质
CN115209233A (zh) * 2022-06-25 2022-10-18 平安银行股份有限公司 视频播放方法以及相关装置、设备
CN115209233B (zh) * 2022-06-25 2023-08-25 平安银行股份有限公司 视频播放方法以及相关装置、设备
CN118012979A (zh) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 一种普通外科手术智能采集存储系统

Also Published As

Publication number Publication date
CN111918145B (zh) 2022-09-09
CN111918145A (zh) 2020-11-10

Similar Documents

Publication Publication Date Title
WO2020224362A1 (fr) Procédé et dispositif de segmentation de vidéo
US10497382B2 (en) Associating faces with voices for speaker diarization within videos
US10037313B2 (en) Automatic smoothed captioning of non-speech sounds from audio
WO2021109678A1 (fr) Procédé et appareil de génération de vidéo, dispositif électronique et support de stockage
US9672827B1 (en) Real-time conversation model generation
US20170371496A1 (en) Rapidly skimmable presentations of web meeting recordings
WO2023011094A1 (fr) Procédé et appareil de montage vidéo, dispositif électronique et support de stockage
US11281707B2 (en) System, summarization apparatus, summarization system, and method of controlling summarization apparatus, for acquiring summary information
JP6339529B2 (ja) 会議支援システム、及び会議支援方法
CN111050201A (zh) 数据处理方法、装置、电子设备及存储介质
US20190199939A1 (en) Suggestion of visual effects based on detected sound patterns
US9563704B1 (en) Methods, systems, and media for presenting suggestions of related media content
CN105590627A (zh) 图像显示装置、用于驱动图像显示装置的方法和计算机可读记录介质
JP2016046705A (ja) 会議録編集装置、その方法とプログラム、会議録再生装置、および会議システム
US20200403816A1 (en) Utilizing volume-based speaker attribution to associate meeting attendees with digital meeting content
US20200151208A1 (en) Time code to byte indexer for partial object retrieval
JP6690442B2 (ja) プレゼンテーション支援装置、プレゼンテーション支援システム、プレゼンテーション支援方法及びプレゼンテーション支援プログラム
CN113301382B (zh) 视频处理方法、设备、介质及程序产品
US10123090B2 (en) Visually representing speech and motion
WO2023142590A1 (fr) Procédé et appareil de génération de vidéo en langue des signes, dispositif informatique et support de stockage
KR102226427B1 (ko) 호칭 결정 장치, 이를 포함하는 대화 서비스 제공 시스템, 호칭 결정을 위한 단말 장치 및 호칭 결정 방법
US11128927B2 (en) Content providing server, content providing terminal, and content providing method
EP4322090A1 (fr) Dispositif de traitement d'informations et procédé de traitement d'informations
CN116088675A (zh) 虚拟形象交互方法及相关装置、设备、系统和介质
KR20160055511A (ko) 리듬을 이용하여 동영상을 검색하는 장치, 방법 및 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20802337

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20802337

Country of ref document: EP

Kind code of ref document: A1