CN111918145B - Video segmentation method and video segmentation device - Google Patents

Video segmentation method and video segmentation device Download PDF

Info

Publication number
CN111918145B
CN111918145B CN201910376477.2A CN201910376477A CN111918145B CN 111918145 B CN111918145 B CN 111918145B CN 201910376477 A CN201910376477 A CN 201910376477A CN 111918145 B CN111918145 B CN 111918145B
Authority
CN
China
Prior art keywords
video
point
information
segment
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910376477.2A
Other languages
Chinese (zh)
Other versions
CN111918145A (en
Inventor
苏芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910376477.2A priority Critical patent/CN111918145B/en
Priority to PCT/CN2020/083397 priority patent/WO2020224362A1/en
Publication of CN111918145A publication Critical patent/CN111918145A/en
Application granted granted Critical
Publication of CN111918145B publication Critical patent/CN111918145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application provides a video segmentation method and a video segmentation device, wherein the method comprises the following steps: the video segmenting device segments the video to be processed according to at least one of content description information used for describing the content of the video to be processed and a presentation demonstrated in the video to be processed uploaded in advance and voice information of the video to be processed. The technical scheme can be combined with information except the content of the video to be processed to segment the video to be processed, so that the accuracy of segmentation can be improved.

Description

Video segmentation method and video segmentation device
Technical Field
The present application relates to the field of information technology, and more particularly, to a video segmentation method and a video segmentation apparatus.
Background
To facilitate easy viewing of the video, a complete video may be divided into a plurality of segments. In this way, the user can directly view the segment of interest.
At present, a common video segmentation method is to segment a video based on text information in the video. The text information in the video may be a subtitle in the video or text obtained by performing voice recognition on the video. In other words, the basis for segmenting a video is currently derived from the video itself. In addition, currently, such video segmentation based on text information in video needs to acquire all text information of video. The video stream of live video is generated in real time. Therefore, only after the video live broadcast is finished, the whole text information of the video can be obtained. Therefore, the above method cannot segment live video in real time. In addition, the method just segments the video according to the text information of the video. This may result in that the determined segmentation point is not necessarily a suitable segmentation point.
Disclosure of Invention
The application provides a video segmentation method and a video segmentation device, which can improve the accuracy of video segmentation.
In a first aspect, an embodiment of the present application provides a video segmentation method, including: the video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information comprises at least one of a presentation in the video to be processed and content description information of the video to be processed; the video segmentation device determines segmentation points of the video to be processed according to the text information and the voice information; the video segmenting device segments the video to be processed according to the segmenting point. The technical scheme can be combined with information except the content of the video to be processed to segment the video to be processed, so that the accuracy of segmentation can be improved.
With reference to the first aspect, in a possible implementation manner of the first aspect, in a case that the text information includes the presentation, the video segmenting device determines a segmentation point of the video to be processed according to the text information and the voice information, including: determining a switching point of the presentation, wherein the contents of the presentation before and after the switching point are different; determining at least one stop point according to the voice information; the segment point is determined according to the switching point and the at least one suspension point. The switching of the presentation often means that the content of the speech of the speaker is changed. Therefore, the technical scheme divides the video to be processed into different segments by considering the change of the presentation, and can reasonably and quickly determine the segmentation points of the video to be processed. In addition, when the segmentation point of the video to be processed is determined, the switching point and the stop point near the switching point are only required to be based on the presentation. Therefore, the technical scheme can segment the video without acquiring the finished video file. In other words, the technical scheme can be used for segmenting the video to be processed in real time. Therefore, the technical scheme can be applied to the segmentation processing of the live video.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining the segmentation point according to the switching point and the at least one suspension point includes: determining the switching point as the segmentation point in case it is determined that the switching point is identical to one of the at least one suspension point; in a case that it is determined that any one of the at least one pause point is different from the switching point, one of the at least one pause point closest to the switching point is determined as the segment point.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining a switching point of the presentation includes: the timing at which the switching signal for instructing switching of the content of the presentation is acquired is determined as the switching point.
With reference to the first aspect, in a possible implementation manner of the first aspect, the text information further includes the content description information, and before the video segmenting apparatus determines the segmentation point of the video to be processed according to the text information and the speech information, the method further includes: and determining that the presentation time length of the current page of the presentation is less than or equal to a first preset time length and greater than a second preset time length.
With reference to the first aspect, in a possible implementation manner of the first aspect, in a case that the text information includes the content description information, the video segmenting device determines a segmentation point of the video to be processed according to the text information and the speech information, including: and determining the segmentation point of the video to be processed according to the voice information, the keyword of the content description information and the pause point in the voice information. The content description information is information for describing a video to be processed, which is input in advance by a user. The content description information may generally include some key information in the video to be processed, such as keywords, important content, and the like. Therefore, the key contents described in different segments of the video to be processed can be more accurately determined based on the content description information, so that the video to be processed is more accurately segmented.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining, by the speech information, a segment point of the video to be processed according to the speech information, the keyword of the content description information, and a pause point in the speech information includes: and determining a first segmentation point according to the first voice information segment, the second voice information segment, the keyword of the content description information and the stop point in the voice information, wherein the segmentation point of the video to be processed comprises the first segmentation point. In addition, when the technical scheme is used for determining the segmentation point of the video to be processed, the position of the segmentation point can be determined only based on the key words of the content description information and the voice information in two adjacent video segments. The division of the video segments can be done in fixed time and step sizes. Therefore, in the process of video playing, video clips can be divided from the played video. In this way, the video can be segmented without acquiring the completed video file. In other words, the technical scheme can be used for segmenting the video to be processed in real time. Therefore, the technical scheme can be applied to the segmentation processing of the live video.
With reference to the first aspect, in a possible implementation manner of the first aspect, determining a first segmentation point according to the first speech information segment, the second speech information segment, the keyword of the content description information, and a pause point in the speech information includes: determining the similarity of the first voice information segment and the second voice information segment according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment and the keywords of the content description information; determining that the similarity of the first voice information segment and the second voice information segment is smaller than a similarity threshold; and determining the first segmentation point according to the pause point in the voice information.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining the first segmentation point according to the pause point in the speech information includes: and determining the first segmentation point according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, pause duration and words adjacent to the pause points.
With reference to the first aspect, in a possible implementation manner of the first aspect, the pause points in the first speech information segment include K, or the pause points adjacent to the first speech information segment include K. Determining the first segmentation point according to at least one of the number of pause points in the first speech information segment, the number of pause points adjacent to the first speech information segment, the pause duration, and the words adjacent to the pause point includes: in the case that K is equal to 1, determining the K stop points as the segmentation points; determining a pause point adjacent to a preset word as the segmentation point when K is a positive integer greater than or equal to 2 and K words adjacent to the K pause points include the preset word; determining a pause point with the longest pause duration in at least two pause points adjacent to at least two preset words as the segmentation point under the condition that K is a positive integer greater than or equal to 2 and the K words comprise at least two preset words; and under the condition that K is a positive integer greater than or equal to 2 and the preset word is not included in the K words, determining a pause point with the longest pause time length in the K pause points as the segmentation point.
With reference to the first aspect, in a possible implementation manner of the first aspect, the text information further includes the presentation, and before the video segmenting device determines the segmentation point of the video to be processed according to the text information and the voice information, the method further includes: determining that the presentation time length of the current page of the presentation is greater than a first preset time length; or determining that the presentation time length of the current page of the presentation is less than or equal to a second preset time length. The technical scheme can avoid the situation that the section is not appropriate due to the fact that the presentation is unchanged for a long time or changes very quickly.
With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: the video segmentation apparatus determines a digest of the segment based on content of the segmented voice information, keywords of the segmented voice information, and keywords of a target text, wherein the target text includes at least one of the presentation and the content description information. Based on the technical scheme, the user can quickly determine the position which is expected to be watched back by using the abstract when watching back the video. In addition, the technical scheme takes information except the video to be processed into consideration in the process of determining the abstract. Therefore, the accuracy of the determined abstract can be improved, and the speed of determining the abstract can be improved.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining, by the video segmentation apparatus, the summary of the segment according to the content of the segmented voice information, the keyword of the segmented voice information, and the keyword of the target text includes: determining a third key word vector according to the content of the segmented voice information, the key words of the segmented voice information and the key words of the target text; and determining the abstract of the segment according to the third key word vector.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining, by the video segmentation apparatus, a summary of the segment according to the third key word vector includes: determining a reference text according to the target text and the segmented voice information, wherein the reference text comprises J sentences, and J is a positive integer greater than or equal to 1; determining J keyword vectors according to the keywords of the segmented voice information, the keywords of the target text and each sentence in the J sentences; and determining the abstract of the segment according to the third keyword vector and the J keyword vectors.
With reference to the first aspect, in a possible implementation manner of the first aspect, determining a reference text according to the target text and the segmented speech information includes: under the condition that the target text comprises redundant sentences, deleting the redundant sentences in the target text to obtain a corrected target text, and combining the corrected target text and the segmented voice information to obtain the reference text; and under the condition that the target text does not comprise the redundant sentence, combining the target text and the segmented voice information to obtain the reference text.
With reference to the first aspect, in a possible implementation manner of the first aspect, determining the summary of the segment according to the third keyword vector and the J keyword vectors includes: determining J distances according to the third key word vector and the J key word vectors, wherein the jth distance in the J distances is determined according to the third key word vector and the jth key word vector in the J key word vectors, and J is a positive integer greater than or equal to 1 and less than or equal to J; determining R distances with the shortest distance in the J distances, wherein R is a positive integer which is greater than or equal to 1 and less than J; a digest of the segment is determined, wherein the digest of the segment includes sentences corresponding to the R distances.
With reference to the first aspect, in a possible implementation manner of the first aspect, the video to be processed is a real-time video stream, and the voice information of the video to be processed is voice information of the real-time video stream from a start time or a last segmentation point of the real-time video stream to a current time. The technical scheme can realize real-time segmentation of the video. In other words, when the above technical solution is used to segment a video, it is not necessary to obtain the entire content of the video to be processed. Therefore, the technical scheme can realize real-time segmentation of the live video.
In a second aspect, an embodiment of the present application provides a video segmentation apparatus that includes means for performing the first aspect or any one of the possible implementations of the first aspect.
Alternatively, the video segmentation apparatus of the second aspect may be a computer device, or may be a component (e.g., a chip or a circuit, etc.) usable with a computer device.
In a third aspect, an embodiment of the present application provides a storage medium, where the storage medium stores instructions for implementing the method described in the first aspect or any one of the possible implementation manners of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method according to the first aspect or any one of the possible implementation manners of the first aspect.
Drawings
FIG. 1 is a schematic diagram of a system to which a video segmentation method provided by an embodiment of the present application may be applied;
FIG. 2 is a schematic diagram of another system to which a video segmentation method provided by an embodiment of the present application may be applied;
FIG. 3 is a schematic flow chart diagram of a video segmentation method provided according to an embodiment of the present application;
fig. 4 is a schematic diagram of a video conference flow provided according to an embodiment of the present application;
FIG. 5 is a schematic flow chart diagram of a video segmentation method provided according to an embodiment of the present application;
FIG. 6 is a schematic flow chart diagram of a method for video segmentation provided in accordance with an embodiment of the present application;
fig. 7 is a block diagram of a video segmentation apparatus provided according to an embodiment of the present application;
fig. 8 is a block diagram of a video segmentation apparatus provided according to an embodiment of the present application.
Detailed Description
The technical solution in the present application will be described below with reference to the accompanying drawings.
In this application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a alone, A and B together, and B alone, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a. b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple. In addition, in the embodiments of the present application, the words "first", "second", and the like do not limit the number and the execution order.
It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
Various aspects or features of the present application may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. For example, computer-readable media may include, but are not limited to: magnetic storage devices (e.g., hard disk, floppy disk, or magnetic strips), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD), or the like), smart cards, and flash memory devices (e.g., erasable programmable read-only memory (EPROM), card, stick, or key drive, or the like). In addition, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media capable of storing, containing, and/or carrying instruction(s) and/or data.
Fig. 1 is a schematic diagram of a system to which the video segmentation method provided by the present application can be applied. Fig. 1 shows a video conference system including a conference control server 101, a conference terminal 111, a conference terminal 112, and a conference terminal 113. The conference terminal 111, the conference terminal 112, and the conference terminal 113 can establish a conference through the conference control server 101.
A video conference will typically comprise at least two conference sites. Each conference place can access the conference control server through one conference terminal. The conference terminal may be a device for accessing a video conference. The conference terminal may be configured to receive conference data and present conference content on a display device according to the conference data. The conference terminal may include a host and a display device. The host computer can receive conference data through the communication interface, generate a video signal according to the received conference data, and output the video signal to the display device in a wired or wireless mode. The display device presents the conference content according to the received video signal. Alternatively, in some embodiments, the display device may be built into the host. For example, the conference terminal may be an electronic device having a display device built therein, such as a notebook computer, a tablet computer, and a smart phone. Alternatively, in other embodiments, the display device may be a display device external to the host. For example, the host may be a computer host and the display device may be a display, a television, or a projector. For example, even if the host computer has a display device built therein, the display device for presenting the conference content may be a display device externally mounted to the host computer. For example, the host may be a notebook computer, and the display device may be a display, a television or a projector externally connected to the notebook computer.
In some cases, a video conference may include a main conference site and at least one branch conference site. In this case, a conference terminal (for example, the conference terminal 111) in the main conference room may upload the collected media stream of the main conference room to the conference control server 101. The conference control server 101 may generate conference data from the received media streams and transmit the conference data to conference terminals (e.g., the conference terminal 112 and the conference terminal 113) in the conference branch. The conference terminal 112 and the conference terminal 113 can present conference contents on the display device according to the received conference data.
In other cases, at least two sites in a video conference may not have a primary and secondary score. The conference terminals in each conference room may upload the collected media streams to the conference control server 101. For example, it is assumed that the conference terminal 111 is a conference terminal for accessing a video conference in the field 1, the conference terminal 112 is a conference terminal for accessing a video conference in the field 2, and the conference terminal 113 is a conference terminal for accessing a video conference in the field 3. The conference terminal 111 may upload the collected media stream of the conference site 1 to the conference control server 101, the conference control server 101 may generate conference data 1 according to the media stream of the conference site 1, and send the conference data 1 to the conference terminal 112 and the conference terminal 113, and the conference terminal 112 and the conference terminal 113 may present conference content on a display device according to the received conference data 1. Similarly, the conference terminal 112 may also upload the collected media stream of the conference site 2 to the conference control server, the conference control server 101 may generate the conference data 2 according to the media stream of the conference site 2, and send the conference data 2 to the conference terminal 111 and the conference terminal 113, and the conference terminal 111 and the conference terminal 113 may present the conference content on the display device according to the received conference data 2; the conference terminal 113 may also upload the acquired media stream of the conference room 3 to the conference control server, the conference control server 101 may generate the conference data 3 according to the media stream of the conference room 3, and send the conference data 3 to the conference terminal 111 and the conference terminal 112, and the conference terminal 111 and the conference terminal 112 may present the conference content on the display device according to the received conference data 3.
Optionally, in some embodiments, the media stream may be an audio stream. Optionally, in other embodiments, the media stream may be a video stream. The media device responsible for collecting the media stream may be built in the conference terminal (for example, a camera and a microphone in the conference terminal), or may be external to the conference terminal, which is not limited in this embodiment of the application.
Optionally, in some embodiments, the presenter of the conference uses the presentation during the speaking. In this case, the media stream may be an audio stream of the speaker speaking. The presentation used by the speaker during the speaking process may be uploaded to the conference control server 101 via a secondary stream (also referred to as a data stream, computer screen stream). The conference control server 101 generates conference data from the received audio stream and auxiliary stream. Optionally, in some possible implementations, the conference data may include the received audio stream and the secondary stream. Optionally, in other possible implementations, the conference data may include a processed audio stream obtained by processing the received audio stream and the auxiliary stream. The processing of the received audio stream may be a transcoding operation of the received audio stream, for example, the bitrate of the audio stream may be reduced in order to reduce the amount of data required to transmit the audio stream to other conference terminals. Optionally, in other possible implementations, the conference data may include the received audio stream, an audio stream with a different bitrate than the received audio stream, and the secondary stream. In this way, the conference terminal may select an appropriate audio stream based on network conditions and/or the manner in which the conference is accessed. For example, if the network condition of the conference terminal is good or the Wi-Fi is used to access the conference, the audio stream with higher code rate can be selected, so that clearer sound can be heard. For another example, if the network condition of the conference terminal is poor, the audio stream with a lower bitrate may be selected, so that the occurrence of interruption of live conference caused by poor network condition may be reduced. For another example, if the conference terminal uses the mobile network to access the conference, the audio stream with a lower code rate may be selected, so that the consumption of the traffic may be reduced. Optionally, in other possible implementations, the conference data may include, in addition to the audio stream with at least one bitrate and the auxiliary stream, subtitles corresponding to the speaker speaking. The subtitle can be generated by performing voice-to-text conversion on the speech of the speaker based on a voice recognition technology, can also be generated by manually recording the speech of the speaker, or can also be generated by combining manual modification on the basis of the voice-to-text conversion.
Optionally, in other embodiments, the media stream may be a video stream of the speaker during the speaking process. In other words, the media stream may include both sound information and picture information of the speaker during the speaking process. Accordingly, the media stream uploaded to the conference control server 101 is the video stream. In some cases, it is assumed that the speaker uses the presentation during the speaking and the presentation is presented using an output device (e.g., a projector, a television, etc.). The picture information in the media stream includes the presentation shown by the speaker. Therefore, the presentation is included in the video stream uploaded to the conference control server 101. In this case, the conference control server 101 may determine conference data directly from the video stream. In other cases, the presentation used by the speaker during the speaking process may be uploaded to the conference control server 101 by way of a secondary stream. The conference control server 101 may generate conference data from the captured video stream and the auxiliary stream. Optionally, in some possible implementations, the conference data may include the captured video stream and the auxiliary stream. Optionally, in other possible implementations, the conference data may include a processed video obtained by processing the acquired video stream and the auxiliary stream. Processing the captured video stream may be a transcoding operation of the captured video stream, for example, the resolution of the video stream may be reduced to reduce the amount of data required to transmit the video stream to other conference terminals. Optionally, in other possible implementations, the conference data may include the captured video stream, a video stream with a resolution different from that of the captured video stream, and the secondary stream. In this way, the conference terminal can select an appropriate video stream according to the network conditions and/or the manner of accessing the conference. For example, if the network condition of the conference terminal is good or the Wi-Fi is used to access the conference, a video stream with higher resolution may be selected, which may make the viewer see a clearer picture. For another example, if the network condition of the conference terminal is poor, a video stream with a lower resolution may be selected, so that the occurrence of interruption of live conference caused by poor network condition may be reduced. For another example, if the conference terminal accesses the conference by using the mobile network, a video stream with a lower resolution may be selected, so that the consumption of traffic may be reduced. Optionally, in other possible implementations, the conference data may further include, in addition to the video stream with at least one resolution and the auxiliary stream, a subtitle corresponding to a speaker speaking. The subtitle may be generated by performing voice-to-text conversion on the speech of the speaker based on a voice recognition technology, may be generated by manually recording the speech of the speaker, or may be generated by combining manual modification on the basis of the voice-to-text conversion.
Fig. 2 is a schematic diagram of another system to which the video segmentation method provided by the present application may be applied. Fig. 2 shows a distance education system including a course server 201, a main device 211, a client device 212, and a client device 213.
The main device 211 may upload the collected media stream to the lesson server 201. The lesson server 201 can generate lesson data from the media stream and transmit the lesson data to the client device 212 and the client device 213, and the client device 212 and the client device 213 can render lesson content on the display device based on the received lesson data.
The host device 211 may be a laptop computer, desktop computer. Client devices 212 and 213 may be laptops, desktops, tablets, smartphones, etc.
Optionally, in some embodiments, the teacher in charge of the lecture uses the presentation during the course of the lecture. In this case, the media stream may be an audio stream of the teacher lecture. The presentation used by the teacher during the lecture may be uploaded to the course server 201 through the subsidiary stream. The lesson server 201 generates lesson data from the received audio stream and auxiliary stream.
Alternatively, in other embodiments, the media stream may be a video stream of a teacher during a lecture. In other words, the media stream may include both the sound information and the picture information of the teacher during the course of the lecture. Accordingly, the media stream uploaded to the lesson server 201 is the video stream. In some cases, it is assumed that the teacher uses a presentation during the lecture and the presentation is presented using an output device (e.g., a projector, a television, etc.). The picture information in the media stream includes the presentation shown by the teacher. Thus, the presentation is included in the video stream uploaded to the lesson server 201. In this case, the lesson server 201 can determine the lesson data directly from the video stream. In other cases, the presentation used by the teacher during the course may be uploaded to the lesson server 201 by way of a secondary stream. The lesson server 201 can generate lesson data based on the captured video stream and the secondary stream.
The specific content of the course data is similar to the specific content of the conference data, and is not described again for brevity.
Fig. 3 is a schematic flow chart of a video segmentation method provided according to an embodiment of the present application. The method illustrated in fig. 3 may be performed by a video segmentation apparatus. The video segmentation apparatus may be a computer device, such as a personal computer, a notebook computer, a tablet computer, a server, and the like, which can implement the method provided by the embodiment of the present application, may also be hardware, such as a video card and a Graphics Processing Unit (GPU), arranged inside the computer device, which can implement the method provided by the embodiment of the present application, or may also be a dedicated apparatus for implementing the method provided by the embodiment of the present application. For example, in some embodiments, the video segmentation apparatus may be the conference control server 101 in the system shown in fig. 1 or a piece of hardware provided in the conference control server 101. For another example, in other embodiments, the video segmentation apparatus may be a conference terminal uploading a media stream or a piece of hardware in the conference terminal in the system shown in fig. 1. As another example, in other embodiments, the video segmentation apparatus may be the master device 211 in the system shown in fig. 2 or a piece of hardware provided in the master device 211. For another example, in other embodiments, the video segmentation apparatus may be the lesson server 201 or one of the hardware of the lesson server 201 as in the embodiment shown in fig. 2.
For convenience of description, it is assumed that the method shown in fig. 3 is applied to the system shown in fig. 1.
301, a video segmentation apparatus obtains text information of a video to be processed and voice information of the video to be processed, wherein the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.
The presentation is a presentation of a speaker of the conference during the speaking process. The file format of the presentation is not limited in the embodiment of the present application, and the presentation may be any presentation as long as the presentation is displayed on the display device during the speech of the speaker. For example, the presentation may be a ppt format or pptx format. As another example, the presentation may be a PDF formatted presentation. As another example, the presentation may also be a word format or a txt format.
The content description information is information for describing the content of the utterance, which is uploaded by a speaker of the conference or a moderator of the conference before the conference is started. Optionally, in some embodiments, the content description information includes synopsis, abstract and/or key information of the content of the speaker in the video conference. For example, the content description information may include a keyword of the speech content of the speaker. As another example, the content description information may include a summary of the speaking content of the speaker. As another example, the speaking content of the speaker may include a plurality of portions, and the content description information may include a subject, a summary and/or keywords of each of the plurality of portions.
The voice information may include a corresponding word obtained by performing voice-to-word conversion on the utterance of the speaker. The embodiment of the present application does not limit the specific implementation manner of the speech-to-text conversion, as long as the recognized speech can be converted into corresponding text. The voice information may also include at least one pause point resulting from voice recognition of the speaker's utterance. A pause point represents a natural pause of the speaker during speaking.
302, the video segmenting device determines the segmentation point of the video to be processed according to the text information and the voice information.
As described above, the text information may include at least one of the presentation and the content description information. In other words, the text information may have the following three cases:
case 1: only the presentation is included in the text information;
case 2: only the content description information is included in the text information;
case 3: the text information includes the presentation and the content description information.
In other words, in some cases, the speaker may only present the presentation during the speaking session, without having to upload the content description information in advance. The above-described case 1 may occur. In other cases, the speaker may simply upload the content description information in advance, without presenting the presentation during the speaking process. Thus, the above case 2 may occur. In other cases, the speaker may also upload the content description information in advance, even though the presentation is presented during the speaking process. Thus, it is possible to write the above case 3.
For case 1 above, the video segmentation apparatus may determine the segmentation points of the video to be processed according to the presentation.
For the above case 2, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information.
Alternatively, in some embodiments, for case 3 above, the video segmentation apparatus may determine the segmentation point of the video to be processed according to one of the presentation and the content description information. In other words, in a case where the presentation and the content description information are included in the text information, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the presentation or the content description information.
Alternatively, in some embodiments, in a case where the text information includes the presentation and the content description information, the video segmentation apparatus may determine a presentation time length of a current page of the presentation, and determine whether to determine the segmentation point of the video to be processed according to the presentation or to determine the segmentation point of the video to be processed according to the content description information, according to the presentation time length of the current page of the presentation.
Optionally, in some embodiments, the video segmenting device may determine the segmentation point of the video to be processed according to the content description information and the speech information when the presentation time of the current page of the presentation is greater than a first preset time duration. In this way, it is possible to avoid the situation where a segment of the video is too long due to the speaker presenting the same content for a long time. The first preset time period can be set according to needs. For example, the first preset time period may be 10 minutes. As another example, the first predetermined period of time may be 15 minutes.
Optionally, in some embodiments, the video segmenting apparatus may determine the segmentation point of the video to be processed according to the content description information and the speech information in a case that a presentation duration of a current page of the presentation is less than or equal to a second preset duration. In this way, it is possible to avoid the occurrence of a situation where one segment of the video is too short due to the speaker frequently switching the display content of the presentation. Similar to the first preset time, the second preset time can be set according to needs. For example, the second preset time period may be 20 seconds. For another example, the second preset time period may be 10 seconds.
The first preset duration is longer than the second preset duration.
Optionally, in some embodiments, the video segmenting device may determine the segmentation point of the video to be processed according to the presentation and the voice information when the presentation duration of the current page of the presentation is greater than the second preset duration and less than or equal to the first preset duration.
Optionally, in other embodiments, only the first preset time period may be set. And if the demonstration duration of the demonstration manuscript on the current page is longer than the first preset duration, determining the segmentation point of the video to be processed according to the content description information and the voice information. If the presentation time of the presentation document on the current page is not longer than the first preset time, the segmentation point of the video to be processed can be determined according to the presentation document and the voice information. The presentation duration of the current page of the presentation is the duration that the presentation stays on the current page.
Optionally, in some embodiments, the starting time of the presentation duration of the current page of the presentation is the time when the presentation is switched to the current page, and the ending time of the presentation duration of the current page of the presentation is the time when the presentation is switched from the current page to another page.
For example, if the presentation is at T 1 The video segmentation apparatus can switch from T to the nth page (n is a positive integer greater than or equal to 1) at a moment 1 The time begins to count. If the presentation has not been switched to the (n +1) th page under the condition that the timing duration exceeds the first preset duration, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information and the voice information. If at T 2 Time of day (T) 2 Greater than T 1 ) The presentation switches to page n +1 and from T 1 Time to T 2 If the time duration is less than or equal to the second preset time duration, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information and the voice information. If from T 1 Time of dayTo T 2 The time duration is less than or equal to the first preset time duration and greater than the second preset time duration, and the video segmentation apparatus may determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation means may determine the segmentation point of the video to be processed based on the presentation of the nth page and the voice information.
Optionally, in other embodiments, the starting time of the presentation time duration of the current page of the presentation may be a last segmentation point, and the ending time of the presentation time duration of the current page of the presentation is a time when the presentation is switched from the current page to other pages.
For example, assume that the presentation is at T 3 And switching to the nth page (n is a positive integer greater than or equal to 1) at any moment, wherein the stay time of the presentation on the nth page is longer than the first preset time. In this case, the video segmentation apparatus determines a segmentation point of the video to be processed as T based on the content description information and the speech information 4 The moment of time. The video segmentation means may be from T 4 The time begins to count. If the presentation has not been switched to the (n +1) th page under the condition that the timing duration exceeds the first preset duration, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information and the voice information. If at T 5 Time of day (T) 5 Greater than T 4 ) The presentation is switched to page n +1 and from T 4 Time to T 5 The time duration of the time is not greater than the first preset time duration and greater than the second preset time duration, and the video segmentation device can determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation apparatus may determine a segmentation point of the video to be processed based on the presentation of the nth page and the voice information.
Alternatively, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the presentation and the voice information. In other words, even if the text information includes both the presentation and the content description information, the video segmentation apparatus can determine the segmentation point of the video to be processed with reference to only the presentation and the voice information (i.e., without using the content description information).
Alternatively, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information and the speech information. In other words, even if text information includes both the presentation and the content description information, the video segmentation apparatus can determine the segmentation points of the video to be processed with reference to only the content description information and the speech information (i.e., without using the presentation).
The video segmentation device determining the segmentation point of the video to be processed according to the presentation and the voice information may include: the video segmentation device determines a switching point of the presentation, and the contents of the presentation before and after the switching point are different; the video segmentation means determines at least one stop point based on the speech information; the video segmentation means determines the segmentation point based on the switching point and the at least one pause point.
The switching point of the presentation refers to the moment when the presentation is switched. The presentation occurrence switching may refer to page turning of the presentation. For example, from page 1 to page 2. The switching of the presentation may mean that the content of the presentation is changed without turning pages. For example, where the presentation is a text document, the speaker may only present a portion (e.g., the top half) of a page of the presentation and then scroll to the remaining portion (e.g., the bottom half) of the page. Although the presentation is not page-turned at this time, the contents in the presentation are changed.
Alternatively, in some embodiments, the video segmentation apparatus may acquire a switch signal for instructing to switch the content of the presentation. In this case, the video segmentation apparatus may determine that the timing at which the switching signal is acquired is the switching point.
Optionally, in some embodiments, the video segmentation apparatus may retrieve the content of the presentation. In this case, the video segmentation apparatus may determine the switching point according to a change in the content of the presentation. For example, the video segmentation apparatus may determine that the first time is the switching point when it is determined that the content of the presentation presented by the video to be processed at the first time is different from the content of the presentation presented at the second time. Optionally, in some embodiments, the first time and the second time are adjacent times, and the first time is before the second time. Optionally, in other embodiments, the first time is before the second time and the first time is separated from the second time by less than a preset time. In other words, in this case, the video segmentation apparatus may detect whether the content of the presentation has changed at intervals.
Alternatively, in some embodiments, the video segmentation apparatus may determine the switch point in conjunction with acquiring a switch signal indicating switching of content of the presentation and content presented by the presentation. For example, the video segmentation apparatus is at T 1 The switching signal is acquired at any time. The video segmentation apparatus can acquire the presentation at T 1 Before time F 1 Content of frame presentation and T 1 F after time 2 Content of frame presentation, F 1 And F 2 Is a positive integer greater than or equal to 1. Optionally, in some embodiments, F 1 And F 2 May take on a smaller value, e.g. F 1 And F 2 May be equal to 2. This can reduce the amount of calculation. If the presentation is in F 1 Frame sum F 2 If the contents presented by two consecutive frames in the frames are different, the time at which the frame with the changed presentation content of the presentation is located can be determined as the switching point. For example, suppose F 1 And F 2 The values of (A) are all 2. If the contents of the 2 nd frame and the 3 rd frame of the four frames of the presentation are different, the time at which the 2 nd frame is located may be determined as the switching point. The switching signal and the content presented by the presentation are used for determining the switching point, so that the determination caused by asynchronous switching of the switching signal and the picture of the presentation can be avoidedAn inaccurate switching point occurs.
Optionally, in some embodiments, the video segmentation apparatus may determine whether the content of the presentation presented at different times (or different frames) is the same according to the following: the video segmentation device compares the number P of pixel values of the presentation at the same position at different moments (or different frames) with the number P of the pixel values exceeding a preset change value, and if P is greater than a first preset threshold P 1 The content of the presentation presented by the video segmentation apparatus has changed. Alternatively, in some embodiments, the change in pixel value may be determined by calculating the absolute value of the difference in pixel gray scale values. Alternatively, in other embodiments, the change in pixel value may be determined by calculating the sum of the absolute values of the differences in the three color channels.
Optionally, in some embodiments, if P is greater than the second predetermined threshold P 2 (P 2 Less than P 1 ) The video segmentation apparatus may determine the keywords from the subsequent presentation. For example, the video segmentation apparatus determines T 1 Presentation and T of moments 2 Presentation of moments (T) 2 Time later than T 1 Time) the number of changes of the pixel values at the same position exceeding the preset change value is more than P 2 And is less than P 1 . In this case, the video segmentation means may be based on T 2 The presentation at the moment determines the keywords.
As described above, at least one pause point may also be included in the speech information. Optionally, in some embodiments, the at least one pause point for determining the segmentation point may be all pause points from the start time to the current time. If the segmentation point determined in step 302 is the first segmentation point of the video to be processed, the start time is the start time of the video to be processed. If the segmentation point determined in step 302 is the kth segmentation point of the video to be processed (k is a positive integer greater than or equal to 2), the start time is the time of the kth-1 segmentation point. Optionally, in other embodiments, the video segmentation apparatus may further determine a pause point in a time range according to a time instant of the switching point, where the switching point is in the time rangeWithin this time frame. For example, if the switching point is at T 1 At that moment, the video segmentation apparatus can determine T 1 -T to T 1 A pause point at time + t.
The video segmentation apparatus determines the switching point to be the segment point if it is determined that the switching point is identical to one of the at least one stop point. The video segmentation apparatus determines one of the at least one stop points that is closest to the switching point as the segmentation point, in case it is determined that the switching point is not identical to any of the at least one stop points. The distance between the pause point and the switching point is the time difference between the pause point and the switching point. For example, assume that the switch point is at T 1 At a time, one of the at least one stop points is at T 2 Time of day, T 2 And T 1 The difference of (d) is t. Assuming that T is divided by the at least one pause point 2 Pause point to T outside of time pause point 1 If the difference of the time is greater than T, the T is 2 The stop point of a moment is the segment point. If two of the at least one stop points are the same distance from the switching point and are smaller than the distances from other stop points except the two stop points to the switching point, it may be determined that any one of the two stop points is the switching point.
The video segmenting device may determine the segmentation point of the video to be processed according to the content description information and the voice information, and may include: the video segmentation device determines the segmentation point of the video to be processed according to the voice information, the keyword of the content description information and the pause point in the voice information.
Optionally, in some possible implementations, the video to be processed may be divided into a plurality of voice information segments. The first and second pieces of speech information are two consecutive pieces of speech information of the plurality of pieces of speech information. The first segment of speech information follows the second segment of speech information. The video segmentation apparatus may determine a first segmentation point according to the first voice information segment, the second voice information segment, the keyword of the content description information, and a stop point in the voice information, where the first segmentation point is one of at least one segmentation point included in the video to be processed.
The video segmentation means may intercept a text field on the speech information with a window length W and a step size S. The video segmentation apparatus may intercept at least one text field of length W. Each text segment of length W is a speech message segment.
The video segmentation apparatus may determine whether the first segment of voice information is similar to the second segment of voice information. If the first segment of speech information is not similar to the second segment of speech information, a segmentation point of the video to be processed may be determined to be near the first segment of speech information. If the second speech information segment is similar to the first speech information segment, then a determination is made as to whether a third speech information segment adjacent to and following the first speech information segment is similar to the first speech information segment.
The similarity may be used as a criterion for measuring whether the first piece of speech information is similar to the second piece of speech information. If the similarity between the first voice message segment and the second voice message segment is greater than or equal to a similarity threshold, the first voice message segment is considered to be similar to the second voice message segment; if the similarity between the first speech information segment and the second speech information segment is smaller than the similarity threshold, the first speech information segment and the second speech information segment are considered to be dissimilar.
Optionally, in some possible implementations, the video segmenting device may determine the similarity between the first voice message segment and the second voice message segment according to the keyword of the first voice message segment, the keyword of the second voice message segment, the content of the first voice message segment, the content of the second voice message segment, and the keyword of the content description information.
The video segmentation means may determine keywords of the first speech information segment. Assuming that the number of keywords determined from the first voice message segment is N, the number of keywords determined from the content description information is M, and there are no duplicate keywords in the M keywords and the N keywords.
The video segmentation apparatus may determine the keyword according to the following manner:
step 1, removing words which do not represent actual meanings, such as 'the', 'then' and the like, according to a preset deactivation word list or according to the part of speech of each word in the text. Stop Words (Stop Words) are some Words or phrases that are manually entered, not automatically generated. These words do not represent actual meanings and may be filtered out before or after processing the natural language data. A set of stop words consisting of stop words may be referred to as a stop word list.
And 2, counting the frequency of each word in the remaining words in the text. The frequency of occurrence of each word in the text may be determined according to the following formula:
tf (N) ═ N (N)/All _ N, equation 1.1
Wherein tf (N) represents the frequency of the nth word in the remaining words of the text occurring in the text after step 1, N (N) represents the number of times the nth word occurs, and All _ N represents the total number of the remaining words.
And 3, determining at least one word with the highest occurrence frequency as a keyword of the text.
For example, if the text is content description information, M words with the highest frequency of occurrence may be determined as keywords of the content description information, where M is a positive integer greater than or equal to 1. If the text is the first voice message segment, it may be determined that N words with the highest frequency of occurrence are keywords of the first voice message segment, where N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the first voice message segment occur with the same frequency as the keyword of the content description information, the repeated words are deleted from the N words, and the following words are selected as the keyword of the first voice message segment. For example, assuming that N is equal to 2 and M is equal to 1, the keyword of the content description information includes "student". Assuming that the word with the highest frequency of occurrence in the first voice information piece is determined as "student", the word with the second highest frequency of occurrence is continuously determined. If the word with the second highest frequency of occurrence is "school", it can be determined that "school" is a keyword of the first voice message segment, and the word with the third highest frequency of occurrence is continuously determined. Assuming that the word with the third highest frequency of occurrence is "lesson", it can be determined that "lesson" is another keyword of the first voice message section. If the text is the second speech information segment, it may be determined that N words with the highest frequency of occurrence are keywords of the second speech information segment, where N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the second speech information segment occur at the same frequency as the keyword of the content description information, the repeated words are deleted from the N words, and the following words are selected as the keyword of the second speech information segment.
Optionally, in some embodiments, the video segmentation apparatus may determine the first keyword vector according to the keyword of the first voice information segment, the keyword of the content description information, and the content of the first voice information segment. Specifically, the video segmentation apparatus can determine the keywords of the first segment of voice information, and the frequency of occurrence of the keywords of the content description information in the content of the first segment of voice information, which is the first keyword vector. The content of the voice message segment refers to all words included in the voice message segment. For example, it is assumed that the keywords of the content description information are "student", and the keywords of the first voice information piece are "course" and "school". Assuming that the frequency of occurrence of the above three keywords in the first speech information segment is 0.1, 0.2 and 0.3, respectively, the first keyword vector is (0.3,0.2, 0.1).
Similarly, the video segmentation apparatus may also determine a second keyword vector according to the keyword of the second speech information segment, the keyword of the content description information, and the content of the second speech information segment. Specifically, the video segmentation apparatus may determine the keyword of the second speech information segment, and the frequency of occurrence of the keyword of the content description information in the content of the second speech information segment, where the frequency is the second keyword vector. For example, assume that the keywords of the content description information are "student", and the keywords of the first voice information piece are "breakfast" and "nutrition". Assuming that the three keywords appear in the second speech message segment with frequencies of 0.3,0.25 and 0.05, respectively, the second keyword vector is (0.3,0.25, 0.05).
The video segmentation device determines a distance according to the first keyword vector and the second keyword vector, and if the distance is greater than a preset distance, the similarity between the first voice message segment and the second voice message segment is considered to be smaller than a similarity threshold. In this case, the video segmentation means determines the segmentation point from the first speech information segment.
The video segmentation means may determine a distance based on the first keyword vector and the second keyword vector according to the following manner:
step 1, expanding the first keyword vector into a first vector, and expanding the second keyword vector into a second vector, wherein the keywords corresponding to the first vector and the keywords corresponding to the second vector comprise the keywords of the first voice message segment, the keywords of the second voice message segment and the keywords of the content description information, and the keywords corresponding to the first vector do not have repeated keywords and the keywords corresponding to the second vector do not have repeated keywords.
For example, assume that the first keyword vector is (0.3,0.2,0.1), the corresponding keywords are "school", "course", and "student", assume that the second keyword vector is (0.3,0.25,0.05), and the corresponding keywords are "student", "breakfast", and "nutrition". In this case, the first vector is (0.3,0.1,0,0.2,0), the corresponding keywords are "school", "student", "breakfast", "course", "nutrition", the second vector is (0,0.3,0.25,0,0.05), and the corresponding keywords are "school", "student", "breakfast", "course", "nutrition".
And 2, calculating the distance between the first vector and the second vector. The distance between the first vector and the second vector is the distance determined from the first keyword vector and the second keyword vector.
Optionally, in some embodiments, the distance between the first vector and the second vector may be a euclidean distance. Since the same keyword may be very few in the two preceding and succeeding speech information segments. Therefore, if the distance between the first vector and the second vector is a cosine distance, many 0 values may occur in the calculation result. Therefore, it may be more appropriate to select the euclidean distance as the distance between the first vector and the second vector.
Optionally, in other embodiments, the distance between the first vector and the second vector may be a cosine distance.
In addition to using the word frequency vectors of two adjacent speech information segments to determine whether the two speech information segments are similar, other ways of determining whether the two speech information segments are similar may be used.
For example, the first keyword vector and the second keyword vector may be word frequency — inverse document frequency, binary word frequency, or the like. Determining the distance between the first keyword vector and the second keyword vector may be determining an n-norm distance (n is a positive integer greater than or equal to 1) between the first keyword vector and the second keyword vector, determining a relative entropy distance between the first keyword vector and the second keyword vector.
Also taking the first vector (i.e., (0.3,0.1,0,0.2,0)) and the second vector (i.e., (0,0.3,0.25,0,0.05)) as an example, the first vector and the second vector may be subjected to a 2-valued process. The first vector after 2-valued processing is (1,1,0,1,0), and the second vector after 2-valued processing is (0,1,1,0, 1). And then calculating the 1-norm distance to obtain the repetition degree of the keywords of the first voice information segment and the keywords of the second voice information segment. The degree of repetition of a keyword may be considered a special form of distance. The degree of repetition of the keyword may be used to determine whether the first piece of speech information and the second piece of speech information are similar. If the repetition degree of the keyword is greater than or equal to a preset repetition degree, the first voice message segment and the second voice message segment can be considered to be similar; if the repetition degree of the keyword is less than the preset repetition degree, the first voice message segment and the second voice message segment are considered to be dissimilar. It can be seen that in this case the preset repetition degree can be considered as a similarity threshold.
Optionally, in other embodiments, the extraction of the keywords may also be determined according to the word frequency-inverse document frequency. The word frequency may be determined based on equation 1.1. The inverse document frequency may be determined according to the following formula:
idf (n) ═ log (Num _ Doc/(Doc (n) +1), equation 1.2
Wherein IDF (n) represents the inverse document frequency of the nth word, Num _ Doc represents the total number of documents in the corpus, and Doc (n) represents the number of documents containing the nth word in the corpus.
The word frequency-inverse document frequency can be determined according to the following formula:
TF-idf (n) × idf (n), formula 1.3
Wherein TF-IDF (n) represents the word frequency of the nth word-the inverse document frequency. If the keyword is determined according to the word frequency-inverse document frequency, the first keyword vector is composed of the word frequency-inverse document frequency of the keyword.
When determining keywords according to word frequency-inverse document frequency, it may not be necessary to remove meaningless words first.
Optionally, in other embodiments, the extraction of the keywords may also be based on a text ranking (TextRank) method of the word graph. If the keywords are determined according to the TextRank based on the word graph, the first keyword vector may be composed of weight values of the words.
In the case where the first segment of speech information and the second segment of speech information are not similar, the video segmentation means may determine the segmentation point based on the first segment of speech information.
The video segmentation means may first determine whether a pause point is included in the first speech information segment. If a pause point is included in the first speech information segment, it can be determined that the pause point is the segmentation point. If a plurality of stop points are included in the first voice information segment, it may be determined whether a word following each of the plurality of stop points is a preset word. The predetermined words include conjunctions having a segmented meaning, such as "next," "below," "next," and the like. The word after the stop point refers to the word adjacent to the stop point after the stop point. If only a word after a stop point of the plurality of stop points is a preset word, it can be determined that the stop point is a segment point. If a word following at least two of the plurality of pause points is a preset word, a pause point of the pause duration of the at least two pause points may be determined to be the segment point. If none of the words following the plurality of pause points is the preset word, a pause point having a longest pause duration among the plurality of pause points may be determined as the segment point. If no pause point is included in the first speech information segment, the segmentation point can be determined from pause points adjacent to the first speech segment. It will be appreciated that there may be two pause points adjacent to the first speech segment, one located before the first speech segment and the other located after the first speech segment. The video segmentation means may determine the segmentation point on the basis of the distance between the two stop points and the first speech information segment. If the pause point precedes the first speech information piece, the distance between the pause point and the first speech information piece may be the number of words or the time difference between the pause point and the start position of the first speech information piece. If the pause point is after the first speech information piece, the distance between the pause point and the first speech information piece may be the number of words or the time difference between the pause point and the end position of the first speech information piece. For convenience of description, a pause point adjacent to the first voice information segment before the first voice information segment is referred to as a pre-pause point, and a distance between the pre-pause point and the first voice information segment is referred to as a distance 1; the pause point adjacent to the first speech segment after the first speech segment is called a post-pause point, and the distance from the post-pause point to the front of the first speech information segment is called a distance 2. If the distance 1 is less than the distance 2, the front stop point can be determined as the segmentation point; if the distance 1 is greater than the distance 2, the back stop point can be determined to be the segment point. If distance 1 equals distance 2, words after the front stop point (hereinafter, word 1) and words after the rear stop point (hereinafter, word 2) can be determined; if the word 1 is the preset word and the word2 is not the preset word, determining the front stop point as the segmentation point; if the word 1 is not the preset word and the word2 is the preset word, determining the post-stop point as the segmentation point; if both the word 1 and the word2 are the preset word or neither is the preset word, it may be determined that one of the front stop point and the rear stop point having the longest stop time is the segment point.
As described above, a pause point is a natural pause of a speaker. Thus, the pause point is of a certain duration. Alternatively, in some embodiments, if the stop point is determined to be a segment point, the middle time of the stop point may be determined to be the segment point. Alternatively, in other embodiments, if the stop point is determined to be a segment point, the end time of the stop point may be determined to be the segment point. Alternatively, in other embodiments, if the stop point is determined to be a segment point, the start time of the stop point may be determined to be the segment point.
303, the video segmenting device segments the video to be processed according to the segmentation point.
If the segmentation point is the first segmentation point of the video to be processed, the starting time of the segmentation is the starting time of the video to be processed, and the ending time of the segmentation is the segmentation point. If the segmentation point is the kth segment of the video to be processed (k is a positive integer greater than or equal to 2), the starting time of the segment is the (k-1) th segmentation point, and the ending time of the segment is the segmentation point.
After determining the segment, the video segmentation apparatus may also determine a summary of the segment.
The video segmentation means may determine a summary of the segment based on the content of the segmented speech information, the keywords of the segmented speech information, and the keywords of the target text 304. The target text includes at least one of the presentation and the content description information.
Alternatively, in some embodiments, the video segmentation apparatus may determine a third keyword vector, and then determine the summary of the segment according to the third keyword vector.
The video segmentation apparatus may determine a third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text, wherein the content of the segmented voice information refers to all sentences constituting the segmented voice information.
It is to be understood that, if only the presentation is included in the text information, the target text includes the presentation; if the text information only comprises the content description information, the target text comprises the content description information; if the text information includes the presentation and the content description information, the target text includes the presentation and the content description information.
The video segmentation means determines the implementation of the keywords of the segmented speech information, the implementation of the keywords of the target text being determined by the video segmentation means is similar to the implementation of the keywords of the first speech information segment being determined by the video segmentation means.
Optionally, in some embodiments, if the video segmentation apparatus compares that the number P of changes of pixel values of the presentation at the same position at different times (or different frames) exceeds a preset change value is greater than a second preset threshold P 2 (P 2 Less than P 1 ) The video segmentation apparatus may determine keywords of the target text from the subsequent presentation. For example, the video segmentation apparatus determines T 1 Presentation of moments and T 2 Presentation of moments (T) 2 Time later than T 1 Time) the number of changes of the pixel values at the same position exceeding the preset change value is more than P 2 And is less than P 1 . In this case, the video segmentation apparatus may be based on T 2 The presentation at that time determines the keywords of the target text.
For example, assume that the number of keywords determined from the presentation is L, the number of keywords determined from the content description information is M, the number of keywords determined from the segmented speech information is Q, and there are no duplicate keywords among the L keywords, the M keywords, and the Q keywords.
Specifically, the video segmentation apparatus may determine M keywords from the content description information, and then determine L words with the highest frequency of occurrence in the presentation. If one or more words in the L words also belong to the M keywords, deleting the one or more words from the L words, and then continuing to determine the words with the second highest occurrence frequency from the presentation until the determined L keywords and the M keywords do not intersect. After that, the video segmentation means determines Q words from the segmented speech information. If one or more words in the Q words belong to the M keywords or the L keywords, deleting the one or more words from the Q words, and then continuously determining the words with the second highest occurrence frequency from the segmented voice information until the determined Q keywords are not intersected with the L keywords and the M keywords.
The third keyword vector includes the Q keywords, the L keywords, and the frequency of occurrence of the M keywords in the segmented speech information. It is understood that if the content description information is not included in the target text, the value of M is 0; if the presentation is not included in the target text, then L has a value of 0.
The video segmentation means may determine a summary of the segment based on the determined third keyword vector.
Specifically, the video segmentation apparatus may determine a reference text according to the target text and the content of the segmented speech information, where the reference text includes J sentences, J being a positive integer greater than or equal to 1; determining J keyword vectors according to the keywords of the segmented voice information, the keywords of the target text and each sentence in the J sentences; and determining the abstract of the segment according to the third keyword vector and the J keyword vectors. The jth keyword vector of the J keyword vectors is a frequency of occurrence of the keywords of the segmented speech information and the keywords of the target text in the jth sentence.
Under the condition that the target text comprises redundant sentences, deleting the redundant sentences in the target text to obtain a corrected target text, and combining the corrected target text and the content of the segmented voice information to obtain the reference text; and under the condition that the target text does not comprise the redundant sentence, combining the target text with the content of the segmented voice information to obtain the reference text. In other words, where the target text includes the presentation and the content description information, one or more sentences in the presentation may also appear in the content description information. In this case, one or more sentences in the presentation that are identical to the content description information are deleted, and then the presentation from which redundant sentences have been deleted, the content description information, and the content of the segmented speech information are combined to obtain the reference text. If the target text does not include redundant sentences, for example, any sentence in the presentation appears in the content description information, or only one of the presentation and the content description information is included in the target text, the target text can be directly combined with the content of the segmented speech information to obtain the reference text.
The video segmentation apparatus determines the summary of the segment according to the third keyword vector and the J keyword vectors, including: the video segmentation device determines J distances according to the third keyword vector and the J keyword vectors, wherein the jth distance in the J distances is determined according to the third keyword vector and the jth keyword vector in the J keyword vectors, and J is a positive integer greater than or equal to 1 and less than or equal to J; determining R distances with the shortest distance in the J distances, wherein R is a positive integer which is greater than or equal to 1 and less than J; determining a summary of the segment, wherein the summary of the segment includes sentences corresponding to the R distances. The specific implementation manner of the video segmentation apparatus for determining the jth distance according to the third keyword vector and the jth keyword vector is similar to the implementation manner of the video segmentation apparatus for determining the distance according to the first keyword vector and the second keyword vector, and the difference is that: determining that the jth distance is the Euclidean distance according to the third keyword vector and the jth keyword vector; the distance determined from the first keyword vector and the second keyword vector may be a euclidean distance or a cosine distance. The reason why the jth distance determined by the third keyword vector and the jth keyword vector cannot be a cosine distance is that the jth keyword vector is normalized when the cosine distance is calculated. But the jth pipe detects that the modulo length of the your vector just reflects the overall frequency of sentence j and also the keywords and therefore cannot be normalized.
The vectors described above (e.g., the first keyword vector, the second keyword vector, the third keyword vector, and the jth keyword vector) are all frequencies of occurrence of keywords in a particular text (i.e., word frequencies). In other embodiments, the vector may also be determined according to a word vector determined by a word to vector (word 2 vec). For example, the first keyword vector may be determined by: determining a word vector of each keyword by using word2 vex; and adding the word vectors of all the keywords and averaging to obtain the first keyword vector. The second keyword vector is determined in a similar manner to the first keyword vector, and thus, the description thereof is not repeated. As another example, the third keyword vector may be determined by: determining a word vector of each keyword by using word2 vex; determining the word frequency of each keyword; and according to the word frequency of each keyword, taking weighted average of word vectors of all keywords to obtain the third keyword vector. As another example, the jth keyword vector may be determined by: carrying out word segmentation and removal of stop words on the jth sentence; determining a word vector of each remaining word by using word2 vex; and adding all the word vectors and averaging to obtain the jth keyword vector. In case the keyword vector is determined based on word2vex, the direct distance between the third keyword vector and the jth keyword vector may be a cosine distance.
Fig. 4 is a schematic diagram of a conference flow provided according to an embodiment of the present application.
401, the conference terminal 1 transmits audio and video stream 1 to the conference control server.
At 402, the conference terminal 2 transmits audio/video stream 2 to the conference control server.
The conference terminal 3 transmits 403 the audio video stream 3 to the conference control server.
The conference control server determines 404 the main meeting place.
It is assumed that the main conference place determined by the conference control server is the conference place where the conference terminal 1 is located.
The conference control server transmits the conference data to the conference terminal 2 and the conference terminal 3 405.
406, the conference terminal 2 and the conference terminal 3 store conference data.
Optionally, in some embodiments, the conference control server may also send the conference data to the conference terminal 1, and the conference terminal 1 may also store the conference data.
The conference control server segments audio video stream 1 in real time (i.e. determines segmentation points) and extracts the summary of each segment 407.
The conference control server transmits 408 the segmentation point and the summary to conference terminal 2 and conference terminal 3. In this way, conference terminal 2 and conference terminal 3 can autonomously select review on-demand playback video. Of course, in some implementations, the conference control server may also send the segmentation points and the summary to the conference terminal 1.
Fig. 5 is a schematic flow chart of a video segmentation method provided according to an embodiment of the present application.
The video segmentation apparatus determines 501 whether the meeting appointment includes meeting content related text. In other words, the video segmentation apparatus may determine whether the video to be processed includes content description information. If the determination result is yes (i.e. the video to be processed includes the content description information), step 502 is executed; if the determination result is no (i.e. the video to be processed does not include the content description information), step 503 is executed.
502, the video segmentation apparatus extracts keywords of the text related to the conference content. In other words, the video segmentation apparatus determines the keywords of the content description information.
After determining the keywords of the content description information, step 503 may be performed.
The video segmentation means determines if there is a screen presentation in the video to be processed 503. In other words, the video segmentation apparatus may determine whether the video to be processed includes a presentation, and the presentation is presented through a screen. If the determination result is yes (i.e., the video to be processed includes a presentation), step 504 is performed. If the determination result is no (i.e., the video to be processed does not include the presentation), step 505 is executed. The video segmentation apparatus determines the location of the screen on which the presentation is presented 504. The video segmentation apparatus may perform step 506 after determining the location of the screen.
The video segmentation apparatus determines 505 whether there is a presentation transmitted through the secondary stream. In other words, in some possible implementations, the conference speaker may not present the presentation through the screen, but may upload the presentation to the conference control server through the secondary stream. And conference terminals in other conference places can acquire the presentation used by the conference speaker in the speaking process according to the auxiliary stream. If the determination is yes (i.e., there is a presentation being transmitted via the secondary stream), step 506 is performed. If the determination result is no (i.e. there is no presentation transmitted through the auxiliary stream), the segmentation point of the video to be processed may be determined according to the voice information.
The video segmentation apparatus determines 506 if the time duration from the last segmentation point to the current time exceeds a first preset time duration. If the video segmentation apparatus determines that the time duration from the last segmentation point to the current time is greater than the first preset time duration (i.e., the determination result is yes), step 507 is executed. If the video segmentation apparatus determines that the time duration from the last segmentation point to the current time is not greater than the first preset time duration, step 508 is executed. It is to be understood that, if the segmentation point determined by the video segmentation apparatus is the first segmentation point, the last segmentation point refers to the start time of the video to be processed. For convenience of description, the duration from the jacket segment point to the current time may be referred to as a presentation duration.
507, the video segmentation device determines the segmentation points of the video to be processed according to the content description information and the voice information.
And 508, the video segmentation device determines the segmentation points of the video to be processed according to the presentation and the voice information. The video segmentation apparatus determines a specific implementation manner of the segmentation point of the video to be processed according to the presentation and the voice information, which may refer to the embodiment shown in fig. 3, and thus, details are not needed here.
The video segmentation apparatus may perform step 509 and step 510 after determining the segmentation points of the video to be processed.
509, the video segmentation apparatus determines segmented speech information, which is speech information between a segment point immediately preceding the segment point and the segment point, and a keyword of the segmented speech information. It can be understood that if the segmentation point is the first segmentation point of the video to be processed, the segmented speech information is the speech information between the starting time of the video to be processed and the segmentation point.
The video segmentation apparatus determines a segment summary based on the segmented speech information, keywords of the segmented speech information, and keywords of the target text 510. The specific implementation of steps 509 and 510 can refer to the embodiment shown in fig. 3, and thus, a detailed description thereof is omitted.
It is understood that in other possible implementations, the video segmenting apparatus may determine whether there is a presentation presented on the screen in the video to be processed, then determine whether there are words related to the meeting content in the meeting appointment, and finally determine whether there is a presentation transmitted by auxiliary streaming in the process of segmenting and abstracting the video. In other possible implementations, the video segmentation apparatus may further determine whether there is a presentation that is transmitted through the auxiliary stream, determine whether there are words related to the conference content in the conference reservation, and determine whether there is a presentation that is displayed through the screen in the video to be processed.
How the video segmentation apparatus determines the segmentation points of the video to be processed according to the content description information and the speech information will be described below with reference to fig. 6. In addition, how the video segmentation apparatus determines the implementation manner of the segmentation point of the video to be processed according to the voice information can also refer to fig. 6.
Fig. 6 is a schematic flow chart of a method for video segmentation provided according to an embodiment of the present application.
601, the video segmentation apparatus continuously cuts the voice message segment on the voice message with the window length W and the step size S.
The video segmentation means extracts the keywords of each segment of speech information 602. Specifically, the video segmentation apparatus extracts N keywords from each piece of voice information.
If the video segmentation apparatus has extracted the keyword of the content description information, step 603 may be performed after step 602; if the video segmentation apparatus has not extracted the keywords of the content description information, step 604 may be executed after step 602. The video segmentation apparatus having extracted the keyword of the content description information means that the video segmentation apparatus determines that the video to be processed includes the content description information. In this case, the segmentation point determined by the video segmentation means is determined based on the content description information and the speech information. That the video segmentation apparatus has not extracted the keyword of the content description information means that the video segmentation apparatus determines that the video to be processed does not include the content description information. In this case, the segmentation point determined by the video segmentation means is determined based on the speech information.
603, the video segmentation apparatus determines the word frequency vector C _ i of the keyword in the ith speech information segment and the keyword of the content description information in the ith speech information segment.
The method for determining the keyword in the ith speech information segment can be seen in the embodiment shown in fig. 3. Specifically, reference may be made to the method for determining the keyword of the first speech information segment in the embodiment shown in fig. 3, which is not described herein again. The method for determining the keywords of the content description information may be referred to as the embodiment shown in fig. 3, and thus, the details are not repeated herein. The implementation manner of determining the word frequency vector of the keyword in the ith speech information segment and the keyword of the content description information in the ith speech information segment by the video segmentation apparatus may refer to the determination manner of determining the first keyword vector in the embodiment shown in fig. 3, which is not described herein again.
604, the video segmentation apparatus determines a word frequency vector C _ i of a keyword in an ith speech information segment in the ith speech information segment. The determination manner of the word frequency vector of the keyword in the ith voice information segment is similar to the determination manner of the keyword in the ith voice information segment and the word frequency vector C _ i of the keyword of the content description information in the ith voice information segment, and thus, the description is not repeated here.
The video segmentation apparatus may perform step 605 and step 606 in sequence after performing step 603 or step 604.
605, the video segmentation means determines the distance between C _ i and C _ (i-1). C _ (i-1) is the word frequency vector of the video segmentation apparatus for determining the keyword of the (i-1) th speech information segment (or the keyword of the (i-1) th speech information segment and the keyword of the content description information) in the (i-1) th speech information segment. The i-1 st speech information segment is a speech information segment preceding the i-th speech information segment
If the distance between C _ i and C _ (i-1) is greater than the preset distance, it can be determined that the segmentation point is located before or after the ith speech information segment 606. The video segmentation means may determine the segmentation point from the stop point in case it is determined that the segmentation point is located before or after the ith speech information segment. The specific implementation manner of the video segmentation apparatus for determining the segmentation point according to the pause point may refer to the embodiment shown in fig. 3, and thus details are not repeated here.
If the distance between C _ i and C _ (i-1) is less than or equal to the preset distance, the segmentation point may be considered not to be in the ith speech information segment and the (i-1) th speech information segment. In this case, the word frequency vector of the next speech information segment and the word frequency vector of the ith speech information segment may be determined continuously.
Fig. 7 is a block diagram of a video segmentation apparatus provided according to an embodiment of the present application. As shown in fig. 7, the video segmentation apparatus 700 includes an acquisition unit 701 and a processing unit 702.
An obtaining unit 701, configured to obtain text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.
A processing unit 702, configured to determine a segmentation point of the video to be processed according to the text information and the speech information.
The processing unit 702 is further configured to segment the video to be processed according to the segmentation point.
Specific functions and advantageous effects of the obtaining unit 701 and the processing unit 702 can refer to the methods shown in fig. 3 to fig. 6, and are not described herein again.
Fig. 8 is a block diagram of a video segmentation apparatus provided according to an embodiment of the present application. The video segmentation apparatus 800 shown in fig. 8 includes: a processor 801, a memory 802, and a transceiver 803.
The processor 801, memory 802 and transceiver 803 communicate with each other, control and/or data signals, via internal connection paths.
The method disclosed in the embodiments of the present application may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 801. The processor 801 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a Random Access Memory (RAM), a flash memory, a read-only memory (ROM), a programmable ROM, an electrically erasable programmable memory, a register, or other storage media that are well known in the art. The storage medium is located in a memory 802, and the processor 801 reads instructions in the memory 802, and combines hardware thereof to complete the steps of the method.
Optionally, in some embodiments, the memory 802 may store instructions for performing the methods performed by the video segmentation apparatus as in the methods shown in fig. 3-6. The processor 801 may execute the instructions stored in the memory 802 to perform the steps of the video segmentation apparatus in the method shown in fig. 3 to 6 in combination with other hardware (e.g., the transceiver 803), and the specific working process and beneficial effects can be referred to the description in the embodiment shown in fig. 3 to 6.
The embodiment of the application further provides a chip, which comprises a transceiving unit and a processing unit. The transceiver unit can be an input/output circuit and a communication interface; the processing unit is a processor or a microprocessor or an integrated circuit integrated on the chip. The chip can execute the method of the video segmentation device in the method embodiment.
Embodiments of the present application further provide a computer-readable storage medium, on which instructions are stored, and when executed, the instructions perform the method of the video segmentation apparatus in the above method embodiments.
Embodiments of the present application further provide a computer program product containing instructions, which when executed, perform the method of the video segmentation apparatus in the above method embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (24)

1. A method of video segmentation, comprising:
the video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information comprises at least one of a presentation in the video to be processed and content description information of the video to be processed, and the content description information is information for describing speaking content;
the video segmentation device determines segmentation points of the video to be processed according to the text information and the voice information;
and the video segmenting device segments the video to be processed according to the segmenting points.
2. The method of claim 1, wherein in the case that the text information includes the presentation, the video segmentation apparatus determining a segmentation point of the video to be processed according to the text information and the voice information comprises:
determining a switching point of the presentation, wherein the contents of the presentation before and after the switching point are different;
determining at least one stop point according to the voice information;
determining the segment point according to the switching point and the at least one pause point.
3. The method of claim 2, wherein the determining the segmentation point from the switching point and the at least one suspension point comprises:
determining the switching point as the segment point if it is determined that the switching point is the same as one of the at least one pause point;
and under the condition that any stopping point of the at least one stopping point is determined to be different from the switching point, determining one stopping point which is closest to the switching point in the at least one stopping point as the segmentation point.
4. The method of claim 2 or 3, wherein the determining a switch point for the presentation comprises: and determining the moment when the switching signal for instructing the switching of the content of the presentation is acquired as the switching point.
5. The method according to claim 2 or 3, wherein the text information further includes the content description information, and before the video segmentation apparatus determines the segmentation point of the video to be processed according to the text information and the speech information, the method further comprises:
and determining that the presentation time length of the current page of the presentation is less than or equal to a first preset time length and greater than a second preset time length.
6. The method of claim 1, wherein in the case that the text information includes the content description information, the video segmentation apparatus determining a segmentation point of the video to be processed according to the text information and the speech information comprises:
and determining the segmentation point of the video to be processed according to the voice information, the key word of the content description information and the stop point in the voice information.
7. The method of claim 6, wherein the voice information includes a first voice information segment and a second voice information segment, wherein the second voice information segment is a voice information segment preceding and adjacent to the first voice information segment,
the determining the segmentation point of the video to be processed according to the voice information, the keyword of the content description information and the stop point in the voice information comprises:
and determining a first segmentation point according to the first voice information segment, the second voice information segment, the keyword of the content description information and the pause point in the voice information, wherein the segmentation point of the video to be processed comprises the first segmentation point.
8. The method of claim 7, wherein determining a first segmentation point based on the first segment of speech information, the second segment of speech information, the keyword of the content description information, and a pause point in the speech information comprises:
determining the similarity of the first voice information segment and the second voice information segment according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment and the keywords of the content description information;
determining that the similarity of the first voice information segment and the second voice information segment is smaller than a similarity threshold;
and determining the first segmentation point according to the pause point in the voice information.
9. The method of claim 8, wherein the pause point in the speech information comprises a pause point within the first segment of speech information or a pause point adjacent to the first segment of speech information, and wherein determining the first segmentation point based on the pause point in the speech information comprises:
and determining the first segmentation point according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, pause duration and words adjacent to the pause points.
10. The method according to any of claims 6 to 9, wherein the text information further comprises the presentation, and before the video segmentation means determines the segmentation points of the video to be processed from the text information and the speech information, the method further comprises:
determining that the presentation time length of the current page of the presentation is greater than a first preset time length; or alternatively
And determining that the presentation time length of the current page of the presentation is less than or equal to a second preset time length.
11. The method of any of claims 1-3, 6-9, further comprising: the video segmentation apparatus determines the digest of the segment according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of a target text, wherein the target text includes at least one of the presentation and the content description information.
12. The method according to any one of claims 1 to 3 and 6 to 9, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is the voice information of the real-time video stream from a start time or a last segmentation point of the real-time video stream to a current time.
13. A video segmentation apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring text information of a video to be processed and voice information of the video to be processed, the text information comprises at least one of a presentation in the video to be processed and content description information of the video to be processed, and the content description information is information used for describing speaking content;
the processing unit is used for determining the segmentation points of the video to be processed according to the text information and the voice information;
and the processing unit is further used for segmenting the video to be processed according to the segmentation points.
14. The video segmentation apparatus as set forth in claim 13, wherein the processing unit is specifically configured to determine a switching point of the presentation based on the text information and the speech information if the text information includes the presentation, the presentation being presented with different contents before and after the switching point;
determining at least one stop point according to the voice information;
determining the segmentation point according to the switching point and the at least one pause point.
15. Video segmentation unit as claimed in claim 14, characterized in that the processing unit is specifically adapted to
Determining the switching point as the segment point if it is determined that the switching point is the same as one of the at least one pause point;
and in the case that any one of the at least one pause point is determined to be different from the switching point, determining one of the at least one pause point closest to the switching point as the segmentation point.
16. The video segmenting device according to claim 14 or 15, characterized in that the processing unit is specifically configured to determine a time when a switch signal instructing to switch the content of the presentation is acquired as the switch point.
17. The video segmentation apparatus according to claim 14 or 15, wherein the processing unit is further configured to determine that a presentation time length of a current page of the presentation is less than or equal to a first preset time length and greater than a second preset time length before determining a segmentation point of the video to be processed according to the text information and the speech information in a case that the text information further includes the content description information.
18. The video segmentation apparatus as claimed in claim 13, wherein the processing unit is specifically configured to determine the segmentation point of the video to be processed according to the speech information, the keyword of the content description information, and a pause point in the speech information, if the text information includes the content description information.
19. The video segmentation apparatus of claim 18 wherein the voice information includes a first segment of voice information and a second segment of voice information, wherein the second segment of voice information is a segment of voice information preceding and adjacent to the first segment of voice information,
the processing unit is specifically configured to determine a first segmentation point according to the first speech information segment, the second speech information segment, the keyword of the content description information, and a pause point in the speech information, where the segmentation point of the video to be processed includes the first segmentation point.
20. The video segmentation apparatus as set forth in claim 19, wherein the processing unit is specifically configured to determine the similarity between the first voice message segment and the second voice message segment according to the keyword of the first voice message segment, the keyword of the second voice message segment, the content of the first voice message segment, the content of the second voice message segment, and the keyword of the content description information;
determining that the similarity of the first voice information segment and the second voice information segment is smaller than a similarity threshold;
and determining the first segmentation point according to the pause point in the voice information.
21. The video segmentation apparatus as set forth in claim 20, wherein the pause point in the speech information comprises a pause point in the first segment of speech information or a pause point adjacent to the first segment of speech information, and wherein the processing unit is specifically configured to determine the first segmentation point based on at least one of a number of pause points in the first segment of speech information, a number of pause points adjacent to the first segment of speech information, a pause duration, and a word adjacent to the pause point.
22. The video segmentation apparatus according to any one of claims 18 to 21, wherein the processing unit is further configured to determine that a presentation duration of a current page of the presentation is greater than a first preset duration before determining a segmentation point of the video to be processed according to the text information and the voice information in a case that the text information further includes the presentation; or
And determining that the presentation time length of the current page of the presentation is less than or equal to a second preset time length.
23. The video segmentation apparatus as set forth in any one of claims 13 to 15 and 18 to 21, wherein the processing unit is further configured to determine the abstract of the segment according to a content of the segmented voice information, a keyword of the segmented voice information, and a keyword of a target text, wherein the target text comprises at least one of the presentation and the content description information.
24. The video segmentation apparatus as claimed in any one of claims 13 to 15 and 18 to 21, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is the voice information of the real-time video stream from a start time or a last segmentation point of the real-time video stream to a current time.
CN201910376477.2A 2019-05-07 2019-05-07 Video segmentation method and video segmentation device Active CN111918145B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910376477.2A CN111918145B (en) 2019-05-07 2019-05-07 Video segmentation method and video segmentation device
PCT/CN2020/083397 WO2020224362A1 (en) 2019-05-07 2020-04-05 Video segmentation method and video segmentation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910376477.2A CN111918145B (en) 2019-05-07 2019-05-07 Video segmentation method and video segmentation device

Publications (2)

Publication Number Publication Date
CN111918145A CN111918145A (en) 2020-11-10
CN111918145B true CN111918145B (en) 2022-09-09

Family

ID=73051391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910376477.2A Active CN111918145B (en) 2019-05-07 2019-05-07 Video segmentation method and video segmentation device

Country Status (2)

Country Link
CN (1) CN111918145B (en)
WO (1) WO2020224362A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114051154A (en) * 2021-11-05 2022-02-15 新华智云科技有限公司 News video strip splitting method and system
CN114363695B (en) * 2021-11-11 2023-06-13 腾讯科技(深圳)有限公司 Video processing method, device, computer equipment and storage medium
CN114173191B (en) * 2021-12-09 2024-03-19 上海开放大学 Multi-language answering method and system based on artificial intelligence
CN114245229B (en) * 2022-01-29 2024-02-06 北京百度网讯科技有限公司 Short video production method, device, equipment and storage medium
CN115209233B (en) * 2022-06-25 2023-08-25 平安银行股份有限公司 Video playing method, related device and equipment
CN118012979B (en) * 2024-04-10 2024-06-14 济南宝林信息技术有限公司 Intelligent acquisition and storage system for common surgical operation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547139A (en) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 Method for splitting news video program, and method and system for cataloging news videos
WO2013097101A1 (en) * 2011-12-28 2013-07-04 华为技术有限公司 Method and device for analysing video file

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928407B2 (en) * 2002-03-29 2005-08-09 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
US8693842B2 (en) * 2011-07-29 2014-04-08 Xerox Corporation Systems and methods for enriching audio/video recordings
CN104519401B (en) * 2013-09-30 2018-04-17 贺锦伟 Video segmentation point preparation method and equipment
CN104540044B (en) * 2014-12-30 2017-10-24 北京奇艺世纪科技有限公司 A kind of video segmentation method and device
CN106982344B (en) * 2016-01-15 2020-02-21 阿里巴巴集团控股有限公司 Video information processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547139A (en) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 Method for splitting news video program, and method and system for cataloging news videos
WO2013097101A1 (en) * 2011-12-28 2013-07-04 华为技术有限公司 Method and device for analysing video file

Also Published As

Publication number Publication date
CN111918145A (en) 2020-11-10
WO2020224362A1 (en) 2020-11-12

Similar Documents

Publication Publication Date Title
CN111918145B (en) Video segmentation method and video segmentation device
US11625920B2 (en) Method for labeling performance segment, video playing method, apparatus and system
CN109218629B (en) Video generation method, storage medium and device
US20170201478A1 (en) Systems and methods for manipulating and/or concatenating videos
CN111050201B (en) Data processing method and device, electronic equipment and storage medium
US20170278525A1 (en) Automatic smoothed captioning of non-speech sounds from audio
JP2009076970A (en) Summary content generation device and computer program
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
CN108073572B (en) Information processing method and device, simultaneous interpretation system
US11871084B2 (en) Systems and methods for displaying subjects of a video portion of content
US20110235859A1 (en) Signal processor
US20200151208A1 (en) Time code to byte indexer for partial object retrieval
JP6690442B2 (en) Presentation support device, presentation support system, presentation support method, and presentation support program
JP2010039877A (en) Apparatus and program for generating digest content
US11128927B2 (en) Content providing server, content providing terminal, and content providing method
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
KR102049688B1 (en) User-customized contents providing system using AI
US11736744B2 (en) Classifying segments of media content using closed captioning
US9402047B2 (en) Method and apparatus for image display
CN115665508A (en) Video abstract generation method and device, electronic equipment and storage medium
RU2654126C2 (en) Method and device for highly efficient compression of large-volume multimedia information based on the criteria of its value for storing in data storage systems
US20160275967A1 (en) Presentation support apparatus and method
CN110727854B (en) Data processing method and device, electronic equipment and computer readable storage medium
US20210089577A1 (en) Systems and methods for displaying subjects of a portion of content and displaying autocomplete suggestions for a search related to a subject of the content
US20210089781A1 (en) Systems and methods for displaying subjects of a video portion of content and displaying autocomplete suggestions for a search related to a subject of the video portion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant