WO2020224362A1 - Video segmentation method and video segmentation device - Google Patents

Video segmentation method and video segmentation device Download PDF

Info

Publication number
WO2020224362A1
WO2020224362A1 PCT/CN2020/083397 CN2020083397W WO2020224362A1 WO 2020224362 A1 WO2020224362 A1 WO 2020224362A1 CN 2020083397 W CN2020083397 W CN 2020083397W WO 2020224362 A1 WO2020224362 A1 WO 2020224362A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice information
point
video
segment
presentation
Prior art date
Application number
PCT/CN2020/083397
Other languages
French (fr)
Chinese (zh)
Inventor
苏芸
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020224362A1 publication Critical patent/WO2020224362A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Definitions

  • This application relates to the field of information technology, and more specifically, to a video segmentation method and a video segmentation device.
  • a complete video can be divided into multiple segments. In this way, the user can directly watch the segment of interest.
  • a common video segmentation method is to segment the video based on the text information in the video.
  • the text information in the above video may be subtitles in the video, or text obtained by performing voice recognition on the video.
  • the basis for segmenting a video currently comes from the video itself.
  • the current video segmentation based on text information in the video needs to obtain all text information of the video.
  • the video stream of the live video is generated in real time. Therefore, all text information of the video can be obtained only after the live video broadcast ends. Therefore, the above method cannot segment the live video in real time.
  • the above method only segments the video according to the text information of the video. This may cause the determined segmentation point to not necessarily be a suitable segmentation point.
  • the present application provides a video segmentation method and a video segmentation device, which can improve the accuracy of video segmentation.
  • an embodiment of the present application provides a video segmentation method, including: a video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information includes a presentation in the video to be processed At least one of the document and the content description information of the video to be processed; the video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information; the video segmentation device determines the segmentation point of the video to be processed according to the segmentation point, Segment the video to be processed.
  • the above technical solutions can combine information other than the content of the to-be-processed video to segment the to-be-processed video, thereby improving the accuracy of segmentation.
  • the video segmentation device determines the video segmentation of the to-be-processed video according to the text information and the voice information.
  • the segmentation point includes: determining the switching point of the presentation, which presents different content before and after the switching point; determining at least one pause point according to the voice information; determining according to the switching point and the at least one pause point The segment point.
  • the switching of the presentation often means that the content of the speaker's speech has changed. Therefore, the above technical solution divides the to-be-processed video into unused segments by considering the change of the presentation, and can reasonably quickly determine the segmentation point of the to-be-processed video.
  • the above technical solution only needs to be based on the switching point of the presentation and the pause point near the switching point when determining the segmentation point of the video to be processed. Therefore, the above technical solution does not need to obtain the completed video file, and the video can be segmented. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.
  • the determining the segment point according to the switching point and the at least one stopping point includes: determining the switching point and the at least one stopping point If one of the stopping points of is the same, the switching point is determined to be the segment point; when it is determined that any one of the at least one stopping point is different from that of the switching point, the at least one stopping point is determined The stop point closest to the switching point is the segment point.
  • the determining the switching point of the presentation includes: determining that the moment when a switching signal for instructing to switch the content of the presentation is acquired is the switching point .
  • the text information further includes the content description information
  • the video segmentation device determines the segmentation of the to-be-processed video based on the text information and the voice information.
  • the method further includes: determining that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.
  • the video segmentation device determines the to-be-processed video according to the text information and the voice information
  • the segmentation point includes: determining the segmentation point of the video to be processed according to the voice information, keywords of the content description information, and pause points in the voice information.
  • the content description information is information input by the user in advance to describe the video to be processed.
  • the content description information can usually include some key information in the video to be processed, such as keywords, key content, etc. Therefore, the key content described in different segments of the video to be processed can be determined more accurately based on the content description information, so that the video to be processed can be segmented more accurately.
  • the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before the first voice information fragment
  • determining the segment point of the to-be-processed video includes: according to the The first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information, determine the first segment point, where the segment point of the to-be-processed video includes the first segment Segment point.
  • the location of the segmentation point can be determined only based on the keywords of the content description information and the voice information in two adjacent video segments.
  • the division of video clips can be implemented according to fixed time and step size. Therefore, during the video playback process, video segments can be divided into the played video. In this way, the video can be segmented without obtaining the completed video file. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.
  • the first segmentation point includes: according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the content description Key words of information, determine the similarity between the first voice information segment and the second voice information segment; determine that the similarity between the first voice information segment and the second voice information segment is less than the similarity threshold; according to the voice information
  • the stop point is to determine the first segment point.
  • the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment
  • determining the first segment point includes: according to the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and At least one of the words adjacent to the pause point determines the first segment point.
  • the pause points in the first voice information segment include K, or, the pause points adjacent to the first voice information segment include K.
  • the first segment point is determined according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and the words adjacent to the pause point Including: when K is equal to 1, determine the K pause points as the segment point; include a preset in K words that are a positive integer greater than or equal to 2 and adjacent to the K pause points In the case of words, determine the pause point adjacent to the predetermined word as the segment point; in the case where K is a positive integer greater than or equal to 2 and the K words include at least two predetermined words , Determine the pause point with the longest pause duration among the at least two pause points adjacent to at least two preset words as the segment point; K is a positive integer greater than or equal to 2 and the K words do not include In the case of
  • the text information further includes the presentation
  • the video segmentation device determines the segmentation of the to-be-processed video according to the text information and the voice information
  • the method further includes: determining that the presentation duration of the current page of the presentation is greater than the first preset duration; or determining that the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
  • the method further includes: the video segmentation device according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text , Determine the summary of the segment, where the target text includes at least one of the presentation and the content description information.
  • the user can use the summary to quickly determine the desired location when reviewing the video.
  • the foregoing technical solution takes into account information other than the video to be processed in the process of determining the summary. This can improve the accuracy of the determined abstract and the speed of determining the abstract.
  • the video segmentation device determines the segment based on the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text
  • the abstract includes: determining the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text; and determining the abstract of the segment according to the third keyword vector.
  • the video segmentation device determines the summary of the segment according to the third keyword vector, including: according to the target text and the segmented voice information , Determine the reference text, where the reference text includes J sentences, where J is a positive integer greater than or equal to 1; according to the key words of the segmented voice information, the key words of the target text and each sentence in the J sentences , Determine J keyword vectors; according to the third keyword vector and the J keyword vectors, determine the summary of the segment.
  • determining the reference text according to the target text and the segmented voice information includes: if the target text includes redundant sentences, adding The redundant sentence in the target text is deleted, the revised target text is obtained, and the revised target text is combined with the segmented speech information to obtain the reference text; in the case that the target text does not include the redundant sentence, The target text is combined with the segmented voice information to obtain the reference text.
  • determining the summary of the segment according to the third keyword vector and the J keyword vectors includes: according to the third keyword vector and The J keyword vectors determine J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j Is a positive integer greater than or equal to 1 and less than or equal to J; determine the R distances with the shortest distance among the J distances, where R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the score The summary of the segment includes sentences corresponding to the R distances.
  • the to-be-processed video is a real-time video stream
  • the voice information of the to-be-processed video is the real-time video stream from the start time of the real-time video stream or the previous The voice information from the segment point to the current moment.
  • an embodiment of the present application provides a video segmentation device, and the device includes a unit for executing the first aspect or any possible implementation manner of the first aspect.
  • the video segmentation apparatus of the second aspect may be a computer device, or may be a component (such as a chip or a circuit, etc.) that can be used in a computer device.
  • an embodiment of the present application provides a storage medium, and the storage medium stores instructions for implementing the first aspect or any one of the possible implementation manners of the first aspect.
  • the embodiments of the present application provide a computer program product containing instructions.
  • the computer program product When the computer program product is run on a computer, the computer can execute the first aspect or any one of the possible implementations of the first aspect. The method described.
  • FIG. 1 is a schematic diagram of a system that can apply the video segmentation method provided by the embodiments of the present application;
  • FIG. 2 is a schematic diagram of another system that can apply the video segmentation method provided by the embodiments of the present application;
  • Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • Fig. 4 is a schematic diagram of a video conference process provided according to an embodiment of the present application.
  • Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application.
  • Fig. 7 is a structural block diagram of a video segmentation device according to an embodiment of the present application.
  • Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • And/or describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, both A and B exist, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects are in an "or” relationship.
  • At least one item (a) in the following” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • At least one of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c may be single or multiple.
  • words such as “first” and “second” do not limit the number and execution order.
  • computer-readable media may include, but are not limited to: magnetic storage devices (for example, hard disks, floppy disks, or tapes, etc.), optical disks (for example, compact discs (CD), digital versatile discs (DVD)) Etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.).
  • various storage media described herein may represent one or more devices and/or other machine-readable media for storing information.
  • machine-readable medium may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
  • FIG. 1 is a schematic diagram of a system that can apply the video segmentation method provided by this application.
  • FIG. 1 shows a video conference system, which includes a conference control server 101, a conference terminal 111, a conference terminal 112, and a conference terminal 113.
  • the conference terminal 111, the conference terminal 112, and the conference terminal 113 can establish a conference through the conference control server 101.
  • Video conferences usually include at least two conference rooms. Each conference site can access the conference control server through a conference terminal.
  • the conference terminal may be a device used to access a video conference.
  • the conference terminal can be used to receive conference data and present the conference content on the display device according to the conference data.
  • the conference terminal may include a host and a display device.
  • the host can receive conference data through a communication interface, generate a video signal according to the received conference data, and output the video signal to the display device in a wired or wireless manner.
  • the display device presents the content of the meeting according to the received video signal.
  • the display device may be built in the host.
  • the conference terminal may be an electronic device with a built-in display device, such as a notebook computer, a tablet computer, or a smart phone.
  • the display device may be a display device externally placed on the host.
  • the host may be a computer host, and the display device may be a display, a television, or a projector.
  • the display device used for presenting conference content may also be a display device external to the host.
  • the host may be a notebook computer, and the display device may be a monitor, a television or a projector externally connected to the notebook computer.
  • a video conference may include a main venue and at least one branch venue.
  • the conference terminal for example, the conference terminal 111 in the main conference site can upload the collected media stream of the main conference site to the conference control server 101.
  • the conference control server 101 may generate conference data according to the received media stream, and send the conference data to the conference terminals (for example, the conference terminal 112 and the conference terminal 113) in the branch venue.
  • the conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data.
  • the conference terminal in each conference site can upload the collected media stream to the conference control server 101.
  • the conference terminal 111 is the conference terminal used to access the video conference in the conference room 1
  • the conference terminal 112 is the conference terminal used to access the video conference in the conference room 2
  • the conference terminal 113 is the conference terminal used to access the video in the conference room 3.
  • the conference terminal for the conference can upload the collected media stream of the conference site 1 to the conference control server 101.
  • the conference control server 101 can generate conference data 1 according to the media stream of the conference site 1, and send the conference data 1 to the conference terminal 112 and the conference terminal 113.
  • the conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data 1.
  • the conference terminal 112 can also upload the collected media stream of the conference site 2 to the conference control server, and the conference control server 101 can generate conference data 2 according to the media stream of the conference site 2, and send the conference data 2 to the conference terminal 111
  • the conference terminal 111 and the conference terminal 113 can present the conference content on the display device according to the received conference data 2;
  • the conference terminal 113 can also upload the collected media stream of the conference site 3 to the conference control server, and the conference control
  • the server 101 can generate conference data 3 according to the media stream of the conference site 3, and send the conference data 3 to the conference terminal 111 and the conference terminal 112, and the conference terminal 111 and the conference terminal 112 can display the conference data 3 on the display device according to the received conference data 3 Content of meeting.
  • the media stream may be an audio stream.
  • the media stream may be a video stream.
  • the media device responsible for collecting media streams may be built in the conference terminal (for example, a camera and microphone in the conference terminal), or may be externally connected to the conference terminal, which is not limited in this embodiment of the application.
  • the speakers of the conference use presentations during the speech.
  • the media stream may be an audio stream of the speaker.
  • the presentation used by the speaker during the speaking process can be uploaded to the conference control server 101 through an auxiliary stream (also called a data stream or a computer screen stream).
  • the conference control server 101 generates conference data based on the received audio stream and auxiliary stream.
  • the conference data may include the received audio stream and auxiliary stream.
  • the conference data may include a processed audio stream obtained after processing the received audio stream and the auxiliary stream.
  • Processing the received audio stream can be a transcoding operation on the received audio stream, for example, the bit rate of the audio stream can be reduced, so as to reduce the amount of data required to transmit the audio stream to other conference terminals.
  • the conference data may include the received audio stream, an audio stream with a different bit rate from the received audio stream, and the auxiliary stream.
  • the conference terminal can select a suitable audio stream according to the network condition and/or the way of accessing the conference. For example, if the network condition of the conference terminal is good or Wi-Fi is used to access the conference, you can choose an audio stream with a higher bit rate, so that you can listen to a clearer sound.
  • the conference data may include subtitles corresponding to the speaker's speech in addition to at least one bit rate audio stream and auxiliary stream.
  • the subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.
  • the media stream may be a video stream of the speaker during the speech.
  • the media stream can include both the voice information and picture information of the speaker during the speech.
  • the media stream uploaded to the conference control server 101 is the video stream.
  • the speaker uses a presentation during the speech and uses an output device (such as a projector, a television, etc.) to show the presentation.
  • the screen information in the media stream includes the presentation displayed by the speaker. Therefore, the video stream uploaded to the conference control server 101 includes the presentation.
  • the conference control server 101 can directly determine the conference data according to the video stream.
  • the presentation used by the speaker during the speech can be uploaded to the conference control server 101 in the form of auxiliary streams.
  • the conference control server 101 may generate conference data according to the collected video stream and the auxiliary stream.
  • the conference data may include collected video streams and auxiliary streams.
  • the conference data may include a processed video obtained after processing the collected video stream and the auxiliary stream. Processing the captured video stream can be a transcoding operation on the captured video stream, for example, the resolution of the video stream can be reduced, so as to reduce the amount of data required to transmit the video stream to other conference terminals.
  • the conference data may include a collected video stream, a video stream with a resolution different from the collected video stream, and the auxiliary stream.
  • the conference terminal can select an appropriate video stream according to the network conditions and/or the way to access the conference. For example, if the network of the conference terminal is good or Wi-Fi is used to access the conference, a video stream with a higher resolution can be selected so that the audience can see a clearer picture. For another example, if the network condition of the conference terminal is poor, a video stream with a lower resolution can be selected, which can reduce the interruption of the conference live broadcast caused by the bad network condition. For another example, if the conference terminal uses a mobile network to access the conference, a video stream with a lower resolution can be selected, which can reduce data consumption.
  • the conference data may also include subtitles corresponding to the speaker's speech.
  • the subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.
  • FIG. 2 is a schematic diagram of another system to which the video segmentation method provided by this application can be applied.
  • FIG. 2 shows a distance education system, which includes a course server 201, a main device 211, a client device 212, and a client device 213.
  • the main device 211 can upload the collected media stream to the course server 201.
  • the course server 201 can generate course data according to the media stream, and send the course data to the client device 212 and the client device 213, and the client device 212 and the client device 213 can present the course data on the display device according to the received course data.
  • Course content
  • the main device 211 may be a notebook computer or a desktop computer.
  • the client device 212 and the client device 213 may be notebook computers, desktop computers, tablet computers, smart phones, and so on.
  • the teacher in charge of the lecture uses the presentation during the lecture.
  • the media stream may be an audio stream of the teacher's lecture.
  • the presentation used by the teacher during the lecture can be uploaded to the course server 201 through auxiliary streams.
  • the course server 201 generates course data according to the received audio stream and auxiliary stream.
  • the media stream may be a video stream of the teacher during the lecture.
  • the media stream can include both the audio information and the picture information of the teacher during the lecture.
  • the media stream uploaded to the course server 201 is the video stream.
  • the teacher uses a presentation during the lecture and uses an output device (such as a projector, a television, etc.) to show the presentation.
  • the picture information in the media stream includes the presentation presented by the teacher. Therefore, the presentation is included in the video stream uploaded to the course server 201.
  • the course server 201 can directly determine the course data according to the video stream.
  • the presentation used by the teacher during the lecture can be uploaded to the course server 201 by way of auxiliary streams.
  • the course server 201 can generate course data according to the collected video stream and the auxiliary stream.
  • the specific content of the course data is similar to the specific content of the conference data, so I won’t repeat it for brevity.
  • Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • the method shown in FIG. 3 can be executed by a video segmentation device.
  • the video segmentation device may be a computer device that can implement the method provided in the embodiment of the application, such as a personal computer, a notebook computer, a tablet computer, a server, etc., or it may be a computer device that can implement the method provided in the embodiment of the application.
  • Internal hardware such as a graphics card, a graphics processing unit (GPU), or a dedicated device for implementing the method provided in the embodiment of the present application.
  • the video segmentation device may be the conference control server 101 in the system shown in FIG.
  • the video segmentation device may be a conference terminal uploading a media stream in the system shown in FIG. 1 or a piece of hardware in the conference terminal.
  • the video segmentation apparatus may be the main device 211 in the system shown in FIG. 2 or a piece of hardware provided in the main device 211.
  • the video segmentation device may be the course server 201 or a piece of hardware in the course server 201 in the embodiment shown in FIG. 2.
  • the video segmentation device acquires text information of a to-be-processed video and voice information of the to-be-processed video, where the text information includes at least one of a presentation in the to-be-processed video and content description information of the to-be-processed video.
  • the presentation refers to the presentation presented by the speaker of the conference during the speech.
  • the embodiment of the present application does not limit the file format of the presentation, as long as the presentation is displayed on the display device during the speech of the speaker, it can be the presentation.
  • the presentation may be in ppt format or pptx format.
  • the presentation can be a PDF format.
  • the presentation may also be in word format or txt format.
  • the content description information is the information used to describe the content of the speech uploaded by the speaker or the host of the meeting before starting the meeting.
  • the content description information includes an outline, abstract, and/or key information of the speaker's speech content in the video conference.
  • the content description information may include keywords of the speaker's speech content.
  • the content description information may include a summary of the content of the speaker's speech.
  • the content of the speaker's speech may include multiple parts, and the content description information may include the subject, abstract, and/or keywords of each part of the multiple parts.
  • the voice information may include voice-text conversion of the speaker's speech to obtain the corresponding text.
  • the embodiment of the present application does not limit the specific implementation of the voice-text conversion, as long as the recognized voice can be converted into the corresponding text.
  • the voice information may also include at least one pause point obtained by performing voice recognition on the speaker's speech. The pause point represents the natural pause of the speaker in the process of speaking.
  • the video segmentation device determines a segmentation point of the video to be processed according to the text information and the voice information.
  • the text information may include at least one of the presentation and the content description information.
  • the text information can have the following three situations:
  • Case 2 The text information only includes the content description information
  • Case 3 The text information includes the presentation and the content description information.
  • the speaker may only show the presentation during the speech without uploading the content description information in advance. Therefore, the above situation 1 may occur. In other cases, the speaker may only upload the content description information in advance without showing the presentation during the speech. Therefore, the above situation 2 may occur. In other cases, the speaker may show the presentation during the speech and upload the content description information in advance. Therefore, it is possible to write case 3 above.
  • the video segmentation device may determine the segmentation point of the video to be processed according to the presentation.
  • the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information.
  • the video segmentation device may determine the segmentation point of the video to be processed according to one of the presentation and the content description information.
  • the video segmentation device can determine the segmentation point of the video to be processed according to the presentation or the content description information.
  • the video segmentation device may determine the presentation duration of the current page of the presentation, and based on the presentation time The presentation duration of the current page determines whether to determine the segmentation point of the video to be processed according to the presentation or to determine the segmentation point of the video to be processed according to the content description information.
  • the video segmentation device may determine the to-be-processed based on the content description information and the voice information when the presentation of the current page of the presentation is longer than the first preset duration.
  • the segment point of the video In this way, it is possible to avoid the situation that a segment of the video is too long due to the speaker showing the same content for a long time.
  • the first preset duration can be set as required.
  • the first preset duration may be 10 minutes.
  • the first preset duration may be 15 minutes.
  • the video segmentation device may determine the content description information and the voice information when the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
  • the second preset duration can be set as needed.
  • the second preset duration may be 20 seconds.
  • the second preset duration may be 10 seconds.
  • the first preset duration is greater than the second preset duration.
  • the video segmentation device may, in the case where the presentation duration of the current page of the presentation is greater than the second preset duration and less than or equal to the first preset duration, according to the presentation The document and the voice information determine the segmentation point of the video to be processed.
  • only the first preset duration may be set. If the presentation duration of the presentation on the current page is greater than the first preset duration, the segmentation point of the video to be processed is determined according to the content description information and the voice information. If the presentation duration of the current page of the presentation is not greater than the first preset duration, the segmentation point of the video to be processed may be determined according to the presentation and the voice information. The presentation duration of the current page of the presentation is the duration of the presentation staying on the current page.
  • the start moment of the presentation duration of the current page of the presentation is the moment when the presentation is switched to the current page
  • the end moment of the presentation duration of the current page of the presentation is the moment of the presentation from the current page. The moment when the page is switched to another page.
  • the device may start counting the video segment from the time T 1. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point. If the presentation is switched to page n+1 at time T 2 (T 2 is greater than T 1 ), and the time length from time T 1 to time T 2 is less than or equal to the second preset time length, the video segmentation device The segmentation point of the video to be processed can be determined according to the content description information and the voice information.
  • the video segmentation device may determine the to-be-processed video based on the presentation and the voice information The segmentation point. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.
  • the start time of the presentation duration of the current page of the presentation may be the previous segment point
  • the end time of the presentation duration of the current page of the presentation is the beginning of the presentation from the current page The moment to switch to another page.
  • n is a positive integer equal to or greater than 1
  • the video segment of the device descriptive information and the voice information according to the content determines to be treated as a divide point of video time T 4.
  • the video segments from the apparatus may be timed time T 4. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point.
  • the video segmentation device can determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.
  • the video segmentation device may determine the segmentation of the to-be-processed video based on the presentation and the voice information. Segment point. In other words, even if the text information includes the presentation and the content description information at the same time, the video segmentation device can only refer to the presentation and the voice information (that is, the content description information is not used) to determine the video to be processed The segmentation point.
  • the video segmentation device may determine the content of the to-be-processed video according to the content description information and the voice information. Segment point.
  • the video segmentation device can only refer to the content description information and the voice information (that is, the presentation will not be used) to determine the video to be processed The segmentation point.
  • the video segmentation device determining the segmentation point of the video to be processed according to the presentation and the voice information may include: the video segmentation device determines the switching point of the presentation, and the content of the presentation before and after the switching point Different; the video segmentation device determines at least one pause point based on the voice information; the video segmentation device determines the segment point based on the switching point and the at least one pause point.
  • the switching point of the presentation refers to the moment when the presentation is switched.
  • the switching of the presentation can refer to the page turning of the presentation. For example, switch from page 1 to page 2.
  • the switching of the presentation can also mean that the content of the presentation changes without turning pages.
  • the speaker may only show a part (for example, the upper half) of a certain page of the presentation, and then scroll to the remaining part (for example, the lower half) of the page.
  • the presentation is not page turning at this time, the content in the presentation has changed.
  • the video segmentation device may obtain a switching signal for instructing to switch the content of the presentation.
  • the video segmentation device may determine that the moment when the switching signal is acquired is the switching point.
  • the video segmentation device may obtain the content of the presentation.
  • the video segmentation device may determine the switching point according to the change of the content of the presentation. For example, the video segmentation device may determine that the first moment is the switching when it is determined that the content of the presentation presented at the first moment of the video to be processed is different from the content of the presentation presented at the second moment. point.
  • the first moment and the second moment are adjacent moments, and the first moment is before the second moment.
  • the first time is before the second time and the interval between the first time and the second time is less than a preset time length. In other words, in this case, the video segmentation device can detect whether the content presented by the presentation has changed every period of time.
  • the video segmentation device may determine the switching point in combination with the acquired switching signal for instructing to switch the content of the presentation and the content presented by the presentation.
  • the video segment unit acquires the switching signal at time T 1.
  • the video segment presentation means may obtain the content after the content F F before time T 1 to a time T 1 and the presentation of presentation 2, F 1 and F 2 is greater than or equal to a positive integer.
  • F 1 and F 2 may take smaller values, for example, F 1 and F 2 may be equal to 2. This can reduce the amount of calculation. If the content of the presentation. 1 in the frame F and F presented in two different two consecutive frames, it may be determined that the presentation time of the presentation frame where the content changes for the switching point.
  • the video segmentation device may determine whether the content presented by the presentation at different moments (or different frames) is the same in the following manner: the video segmentation device compares the presentation at different moments (or The change of the pixel value at the same position in different frames) exceeds the number P of the preset change value. If P is greater than the first preset threshold P 1 , the content presented by the presentation of the video segmentation device has changed.
  • the change in the pixel value can be determined by calculating the absolute value of the difference between the pixel gray values.
  • the change of the pixel value can be determined by calculating the sum of the absolute values of the differences in the three color channels.
  • the video segmentation device may determine keywords based on subsequent presentations. For example, the video segmentation device determines that the presentation at time T 1 and the presentation at time T 2 (time T 2 is later than time T 1 ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P 2 and less than P 1 . In this case, the device may determine keywords video segment presentation time T 2 according to the.
  • the voice information may also include at least one pause point.
  • the at least one pause point used to determine the segment point may be all the pause points from the start time to the current time. If the segment point determined in step 302 is the first segment point of the to-be-processed video, the start time is the start time of the to-be-processed video. If the segment point determined in step 302 is the k-th segment point of the to-be-processed video (k is a positive integer greater than or equal to 2), then the start time is the time where the k-1 segment point is located.
  • the video segmentation device may also determine a pause point within a time range according to the moment when the switching point is located, and the switching point is within this time range. For example, if the switching point is at time T 1 , the video segmentation device can determine the pause point from time T 1 -t to time T 1 +t.
  • the video segmentation device determines that the switching point is the segmentation point when it is determined that the switching point is the same as one of the at least one pause point. When it is determined that the switching point is not the same as any one of the at least one stopping point, the video segmentation device determines that the one of the at least one stopping point that is closest to the switching point is the segment point .
  • the distance between the stopping point and the switching point refers to the time difference between the stopping point and the switching point. For example, assuming that the switching point is at time T 1, the at least one stop point a point positioned pause time T 2, T 2 T 1 as a difference as t.
  • the stopping point at time T 2 is the segment point. If the distance from two of the at least one stopping point to the switching point is the same and less than the distance from the switching point to the other stopping points except the two stopping points, then any of the two stopping points can be determined.
  • a pause point is the switching point.
  • the video segmentation device determining the segmentation point of the video to be processed according to the content description information and the voice information may include: the video segmentation device according to the voice information, the keywords of the content description information, and the information in the voice information Pause point, determine the segment point of the video to be processed.
  • the to-be-processed video may be divided into multiple voice information segments.
  • the first voice information segment and the second voice information segment are two consecutive voice information segments among the multiple voice information segments.
  • the first voice information segment follows the second voice information segment.
  • the video segmentation device can determine the first segmentation point according to the first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information. Is one of at least one segment point included in the to-be-processed video.
  • the video segmentation device can intercept text fields on the voice information with a window length W and a step size S.
  • the video segmentation device can cut out at least one text field of length W.
  • Each text field of length W is a piece of voice information.
  • the video segmentation device can determine whether the first voice information segment is similar to the second voice information segment. If the first voice information segment is not similar to the second voice information segment, it can be determined that a segment point of the to-be-processed video is near the first voice information segment. If the second voice information segment is similar to the first voice information segment, continue to determine that a third voice information segment adjacent to the first voice information segment and located after the first voice information segment and the first voice information segment Is it similar.
  • the similarity degree can be used as a criterion for measuring whether the first voice information segment is similar to the second voice information segment. If the similarity between the first voice information segment and the second voice information segment is greater than or equal to a similarity threshold, it can be considered that the first voice information segment is similar to the second voice information segment; if the first voice information segment is If the similarity of the second voice information segment is less than the similarity threshold, it can be considered that the first voice information segment is not similar to the second voice information segment.
  • the video segmentation device may be based on keywords of the first voice information segment, keywords of the second voice information segment, content of the first voice information segment, and the first voice information segment.
  • the content of the voice information segment and the keywords of the content description information determine the similarity between the first voice information segment and the second voice information segment.
  • the video segmentation device can determine the keywords of the first voice information segment. Assuming that the number of keywords determined from the first voice information segment is N, the number of keywords determined from the content description information is M, and there are no duplicate keywords among the M keywords and the N keywords.
  • the video segmentation device can determine keywords according to the following methods:
  • Step 1 According to the preset stop word list or according to the part of speech of each word in the text, remove the words that do not represent the actual meaning, such as " ⁇ ", “this", “then”, etc. Stop words (Stop Words) are manually input, some characters or words that are not automatically generated. These words do not represent actual meaning and will be filtered out before or after processing natural language data. A set of stop words composed of stop words can be called a stop word list.
  • Step 2 Count the frequency of each of the remaining words in the text.
  • the frequency of each word in the text can be determined according to the following formula:
  • TF(n) represents the frequency of the nth word in the remaining words of the text in the text after step 1
  • N(n) represents the number of times the nth word appears
  • All_N represents the remaining The total number of words.
  • Step 3 Determine at least one word with the highest frequency as a keyword of the text.
  • the text is content description information
  • the M words with the highest appearance frequency are keywords of the content description information, where M is a positive integer greater than or equal to 1.
  • the text is the first voice information segment
  • the N words with the highest occurrence frequency are keywords of the first voice information segment
  • N is a positive integer greater than or equal to 1.
  • the repeated words are deleted from the N words, and the following words are selected as the The keyword of the first voice information segment. For example, assuming that N is equal to 2 and M is equal to 1, the keywords of the content description information include "student".
  • the determined word with the highest frequency in the first voice information segment is "student”, then continue to determine the word with the second highest frequency. If the word with the second highest appearance frequency is "school”, it can be determined that "school” is a keyword of the first voice information segment, and continue to determine the word with the third highest appearance frequency. Assuming that the word with the third highest frequency is "course”, it can be determined that "course” is another keyword of the first voice information segment. If the text is the second voice information segment, it can be determined that the N words with the highest occurrence frequency are keywords of the second voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the second voice information fragment are the same as the keywords of the content description information, the repeated words are deleted from the N words, and the following words are selected as the The keyword of the second voice information segment.
  • the video segmentation device may determine the first keyword vector based on the keywords of the first voice information segment, the keywords of the content description information, and the content of the first voice information segment . Specifically, the video segmentation device may determine the frequency of the keywords of the first voice information segment and the keywords of the content description information in the content of the first voice information segment, and the frequency is the first keyword vector .
  • the content of the voice information fragment refers to all the words included in the voice information fragment. For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "course" and "school”. Assuming that the frequencies of the above three keywords in the first voice information segment are 0.1, 0.2, and 0.3, respectively, the first keyword vector is (0.3, 0.2, 0.1).
  • the video segmentation device may also determine the second keyword vector based on the keywords of the second voice information segment, the keywords of the content description information, and the content of the second voice information segment. Specifically, the video segmentation device may determine the keyword of the second voice information segment and the frequency of the keyword of the content description information in the content of the second voice information segment, and the frequency is the second keyword vector . For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "breakfast" and "nutrition”. Assuming that the frequencies of the above three keywords appearing in the second voice information segment are 0.3, 0.25, and 0.05, respectively, the second keyword vector is (0.3, 0.25, 0.05).
  • the video segmentation device determines the distance based on the first keyword vector and the second keyword vector. If the distance is greater than a preset distance, the similarity between the first voice information segment and the second voice information segment can be considered Less than the similarity threshold. In this case, the video segmentation device determines the segmentation point according to the first voice information segment.
  • the video segmentation device may determine a distance based on the first keyword vector and the second keyword vector in the following manner:
  • Step 1 Expand the first keyword vector into a first vector, and expand the second keyword vector into a second vector, wherein the keyword corresponding to the first vector and the keyword corresponding to the second vector include the first vector
  • the first keyword vector is (0.3,0.2,0.1)
  • the corresponding keywords are “school”, “course” and “student”
  • the second keyword vector is (0.3,0.25,0.05)
  • the corresponding keywords are “student”, “breakfast” and "nutrition”.
  • the first vector is (0.3,0.1,0,0.2,0)
  • the corresponding keywords are "school”, “student”, “breakfast”, “course”, “nutrition”
  • the second The vector is (0,0.3,0.25,0,0.05)
  • the corresponding keywords are "school”, “student”, “breakfast”, “course”, and “nutrition”.
  • Step 2 Calculate the distance between the first vector and the second vector.
  • the distance between the first vector and the second vector is the distance determined according to the first keyword vector and the second keyword vector.
  • the distance between the first vector and the second vector may be the Euclidean distance. Because the same keywords in the two voice information fragments may be few. Therefore, if the distance between the first vector and the second vector is a cosine distance, there may be many zero values in the calculation result. Therefore, it may be more appropriate to select the Euclidean distance as the distance between the first vector and the second vector.
  • the distance between the first vector and the second vector may be a cosine distance.
  • the first keyword vector and the second keyword vector may also be term frequency-inverse document frequency, binary term frequency, etc.
  • Determining the distance between the first keyword vector and the second keyword vector may be determining the n-norm distance of the first keyword vector and the second keyword vector (n is a positive integer greater than or equal to 1), Determine the relative entropy distance between the first keyword vector and the second keyword vector.
  • the first vector and the first vector are binarized.
  • the first vector after the binarization process is (1,1,0,1,0)
  • the second vector after the binarization process is (0,1,1,0,1).
  • the 1-norm distance is calculated to obtain the repetition degree of the keywords of the first voice information segment and the keywords of the second voice information segment.
  • the repetition of keywords can be considered a special form of distance.
  • the degree of repetition of the keywords can be used to determine whether the first voice information segment is similar to the second voice information segment.
  • the repetition degree of the keyword is greater than or equal to a preset repetition degree, it can be considered that the first voice information segment is similar to the second voice information segment; if the repetition degree of the keyword is less than the preset repetition degree, it can be considered The first voice information segment and the second voice information segment are not similar. It can be seen that in this case, the preset degree of repetition can be regarded as the similarity threshold.
  • the extraction of keywords may also be determined according to the term frequency-inverse document frequency.
  • the word frequency can be determined based on formula 1.1.
  • the inverse document frequency can be determined according to the following formula:
  • IDF(n) log(Num_Doc/(Doc(n)+1), formula 1.2
  • IDF(n) represents the inverse document frequency of the nth word
  • Num_Doc represents the total number of documents in the corpus
  • Doc(n) represents the number of documents containing the nth word in the corpus.
  • Term frequency-inverse document frequency can be determined according to the following formula:
  • TF-IDF(n) represents the word frequency of the nth word-the inverse document frequency. If the keyword is determined based on the term frequency-inverse document frequency, the first keyword vector is composed of the term frequency of the keyword-inverse document frequency.
  • the extraction of keywords may also be based on a text ranking (TextRank) method of word maps. If the keyword is determined according to the TextRank based on the word graph, the first keyword vector may be composed of the weight of the word.
  • TextRank text ranking
  • the video segmentation device may determine the segmentation point according to the first voice information segment.
  • the video segmentation device may first determine whether a pause point is included in the first voice information segment. If the first voice information segment includes a pause point, it can be determined that the pause point is the segment point. If the first voice information segment includes multiple pauses, it can be determined whether the word after each of the multiple pauses is a preset word.
  • the presupposition includes conjunctions with segmentation meaning, such as "next", “below”, "next point” and so on.
  • the word after the pause refers to the word adjacent to the pause after the pause. If only one word after one of the multiple pause points is a preset word, it can be determined that the pause point is a segmentation point.
  • the word following at least two of the multiple pauses is a preset word, it can be determined that the pause of the pause duration among the at least two pauses is the segment point. If none of the words following the multiple pause points is the preset word, it can be determined that the pause point with the longest pause duration among the multiple pause points is the segment point. If the first voice information segment does not include a pause point, the segment point may be determined according to the pause point adjacent to the first voice segment. It can be understood that there may be two pause points adjacent to the first speech segment, one is located before the first speech segment, and the other is located after the first speech segment. The video segmentation device may determine the segmentation point according to the distance between the two pause points and the first voice information segment.
  • the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the start position of the first voice information segment. If the pause point is after the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the end position of the first voice information segment.
  • the pause point located before the first voice information segment and adjacent to the first voice information segment is referred to as the front pause point, and the distance between the front pause point and the first voice information segment is referred to as Distance 1:
  • the pause point located after the first voice segment and adjacent to the first voice segment is called a post-pause point, and the distance between the post-pause point and before the first voice information segment is called a distance 2. If the distance 1 is less than the distance 2, the front stop point can be determined to be the segment point; if the distance 1 is greater than the distance 2, then the back stop point can be determined as the segment point.
  • word 1 the word after the pre-pause
  • word 2 the word after the post-pause
  • word 1 the word after the pre-pause
  • word 2 the word after the post-pause
  • the pause point is the natural pause of the speaker. Therefore, the pause point is a certain time long.
  • the intermediate moment of the pause point may be determined as the segment point.
  • the end time of the pause point may be determined as the segment point.
  • the starting moment of the pause point may be determined as the segment point.
  • the video segmentation device segments the to-be-processed video according to the segmentation point.
  • the start time of the segment is the start time of the video to be processed, and the end time of the segment is the segment point. If the segment point is the k-th segment of the video to be processed (k is a positive integer greater than or equal to 2), then the start time of the segment is the k-1 segment point, and the The end time is the segment point.
  • the video segmentation device can also determine the summary of the segment.
  • the video segmentation device may determine a summary of the segment according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text.
  • the target text includes at least one of the presentation and the content description information.
  • the video segmentation device may first determine a third keyword vector, and then determine the segment summary according to the third keyword vector.
  • the video segmentation device can determine the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text, where the content of the segmented voice information refers to the composition of the segment All sentences of the voice information of the segment.
  • the target text includes the presentation; if the text information only includes the content description information, the target text includes the content description information; if the text The information includes the presentation and the content description information, and the target text includes the presentation and the content description information.
  • the video segmentation device determines the implementation manner of the keyword of the segmented voice information, the video segmentation device determines the implementation manner of the keyword of the target text, and the video segmentation device determines the keyword of the first voice information segment The implementation is similar.
  • the video segmentation device can determine the keyword of the target text according to the subsequent presentation. For example, the video segmentation device determines that the presentation at time T 1 and the presentation at time T 2 (time T 2 is later than time T 1 ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P 2 and less than P 1 . In this case, the video segment of the target device may determine the keywords text presentation based on the time T 2.
  • the number of keywords determined from the presentation is L
  • the number of keywords determined from the content description information is M
  • the number of keywords determined from the segmented voice information is Q
  • the L keywords There are no duplicate keywords among the M keywords and the Q keywords.
  • the video segmentation device may first determine M keywords from the content description information, and then determine the L words that appear most frequently in the presentation. If one or more words in the L words also belong to the M keywords, delete the one or more words from the L words, and then continue to determine the word with the second highest frequency from the presentation , Until the determined L keywords and the M keywords have no intersection. After that, the video segmentation device determines Q words from the segmented voice information. If one or more words in the Q words belong to the M keywords or the L keywords, delete the one or more words from the Q words, and then continue from the segmented voice information Determine the word with the second highest frequency until the determined Q keywords do not overlap with L keywords and M keywords.
  • the third keyword vector includes the Q keywords, the L keywords and the frequency of the M keywords in the segmented voice information. It is understandable that if the target text does not include the content description information, the value of M is 0; if the target text does not include the presentation, the value of L is 0.
  • the video segmentation device may determine the summary of the segment according to the determined third keyword vector.
  • the video segmentation device may determine the reference text according to the content of the target text and the segmented voice information, where the reference text includes J sentences, and J is a positive integer greater than or equal to 1; according to the segmented voice
  • the keywords of the information, the keywords of the target text, and each sentence in the J sentences determine J keyword vectors; determine the abstract of the segment according to the third keyword vector and the J keyword vectors .
  • the j-th keyword vector in the J keyword vectors is the frequency of occurrence of the keywords of the segmented voice information and the keywords of the target text in the j-th sentence.
  • the target text includes redundant sentences
  • the target text does not include the redundant sentence
  • the target text is combined with the content of the segmented voice information to obtain the reference text.
  • one or more sentences in the presentation may also appear in the content description information.
  • delete one or more sentences in the presentation that are the same as the content description information and then merge the presentation with the redundant sentences deleted, the content description information, and the content of the segmented voice information to obtain The reference text.
  • the target text does not include redundant sentences, for example, any sentence in the presentation appears in the content description information, or the target text includes only one of the presentation and the content description information. Then, the target text can be directly combined with the content of the segmented voice information to obtain the reference text.
  • the video segmentation device determines the summary of the segment according to the third keyword vector and the J keyword vectors, including: the video segmentation device determines according to the third keyword vector and the J keyword vectors J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j is greater than or equal to 1 and less than or A positive integer equal to J; determine the R distances with the shortest distance among the J distances, and R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the summary of the segment includes the R distances The sentence corresponding to the distance.
  • the specific implementation manner of the video segmentation device determining the j-th distance according to the third keyword vector and the j-th keyword vector and the video segmentation device determining the distance according to the first keyword vector and the second keyword vector The implementation of is similar, the difference is: the j-th distance determined according to the third keyword vector and the j-th keyword vector is the Euclidean distance; the distance determined according to the first keyword vector and the second keyword vector It can be Euclidean distance or cosine distance.
  • the reason why the j-th distance determined by the third keyword vector and the j-th keyword vector cannot be the cosine distance is that the j-th keyword vector is normalized when the cosine distance is calculated. But the j-th tube detects that the modulus length of your vector just reflects the overall frequency of sentence j and keywords, so it cannot be normalized.
  • the above-mentioned vectors are all the frequency (that is, the word frequency) of the keywords in the specific text.
  • the aforementioned vector may also be determined according to a word vector (word to vector, word2vec) determined.
  • the first keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; add the word vectors of all keywords and average them to obtain the first keyword vector.
  • the second keyword vector and the first keyword vector are determined in a similar manner, so there is no need to repeat them here.
  • the third keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; determine the word frequency of each keyword; weight the word vectors of all keywords according to the word frequency of each keyword On average, the third keyword vector is obtained.
  • the j-th keyword vector can be determined by the following steps: segment the j-th sentence and remove stop words; use word2vex to determine the word vector of each remaining word; add all the word vectors to average, Get the j-th keyword vector.
  • the direct distance between the third keyword vector and the j-th keyword vector may be a cosine distance.
  • Fig. 4 is a schematic diagram of a conference process provided according to an embodiment of the present application.
  • the conference terminal 1 transmits audio and video stream 1 to the conference control server.
  • the conference terminal 2 transmits the audio and video stream 2 to the conference control server.
  • the conference terminal 3 transmits the audio and video stream 3 to the conference control server.
  • the conference control server determines the main conference site.
  • the main conference site determined by the conference control server is the conference site where the conference terminal 1 is located.
  • the conference control server sends the conference data to the conference terminal 2 and the conference terminal 3.
  • the conference terminal 2 and the conference terminal 3 store conference data.
  • the conference control server may also send conference data to the conference terminal 1, and the conference terminal 1 may also store the conference data.
  • the conference control server segments the audio and video stream 1 in real time (that is, determines the segment point) and extracts a summary of each segment.
  • the conference control server sends the segment point and summary to the conference terminal 2 and the conference terminal 3. In this way, the conference terminal 2 and the conference terminal 3 can independently select the review point to play the review video. Of course, in some implementations, the conference control server may also send the segment point and summary to the conference terminal 1.
  • Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
  • the video segmentation device determines whether the meeting reservation includes text related to the meeting content. In other words, the video segmentation device can determine whether the to-be-processed video includes content description information. If the result of the determination is yes (that is, the video to be processed includes content description information), step 502 is executed; if the result of the determination is no (that is, the video to be processed does not include content description information), then step 503 is executed.
  • the video segmentation device extracts keywords related to the content of the conference. In other words, the video segmentation device determines the keywords of the content description information.
  • step 503 may be executed.
  • the video segmentation device determines whether there is a screen display presentation in the to-be-processed video. In other words, the video segmentation device can determine whether the to-be-processed video includes a presentation, and the presentation is displayed on the screen. If the determination result is yes (that is, the to-be-processed video includes a presentation), step 504 is executed. If the determination result is no (that is, the to-be-processed video does not include a presentation), step 505 is executed. 504, the video segmentation device determines the position of the screen for displaying the presentation. After the video segmentation device determines the position of the screen, step 506 may be executed.
  • the video segmentation device determines whether there is a presentation transmitted through an auxiliary stream.
  • the conference speaker may not display the presentation on the screen, but upload the presentation to the conference control server through the auxiliary stream.
  • the conference terminal in the other conference site can obtain the presentation used by the conference speaker in the speech process according to the auxiliary stream. If the determination result is yes (that is, there is a presentation transmitted through the auxiliary stream), step 506 is executed. If the result of the determination is no (that is, there is no presentation transmitted through the auxiliary stream), the segmentation point of the to-be-processed video can be determined according to the voice information.
  • the video segmentation device determines whether the duration from the previous segment point to the current moment exceeds the first preset duration. If the video segmentation device determines that the duration from the last segment point to the current moment is greater than the first preset duration (that is, the determination result is yes), step 507 is executed. If the video segmentation device determines that the duration from the previous segment point to the current moment is not greater than the first preset duration, step 508 is executed. It can be understood that, if the segment point determined by the video segmentation device is the first segment point, the previous segment point refers to the start time of the video to be processed. For ease of description, the length of time from the point of the upper garment segment to the current moment can be referred to as the presentation time length.
  • the video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information.
  • the video segmentation device determines the segmentation point of the video to be processed according to the presentation and voice information.
  • the specific implementation manner of the video segmentation device determining the segmentation point of the to-be-processed video based on the presentation and voice information reference may be made to the embodiment shown in FIG. 3, and it is unnecessary to repeat it here.
  • the video segmentation apparatus may perform step 509 and step 510 after determining the segmentation point of the video to be processed.
  • the video segmentation device determines segmented voice information and keywords of the segmented voice information, where the segmented voice information is the voice information between the previous segment point of the segment point and the segment point. It is understandable that if the segment point is the first segment point of the video to be processed, the segment voice information is the voice information from the start time of the video to be processed to the segment point.
  • the video segmentation device determines a segmented summary according to the segmented voice information, keywords of the segmented voice information, and keywords of the target text.
  • steps 509 and 510 reference may be made to the embodiment shown in FIG. 3, and details are not required here.
  • the video segmentation device can first determine whether there is a presentation displayed on the screen in the video to be processed during the process of segmenting the video and extracting the summary, and then Determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether the presentation is transmitted through the auxiliary stream.
  • the video segmentation device may first determine whether there is a presentation transmitted through the auxiliary stream, and then determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether there is any text in the video to be processed Presentation through the screen.
  • the video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information will be described below in conjunction with FIG. 6.
  • how the video segmentation apparatus determines the segmentation point of the to-be-processed video according to the voice information can also refer to FIG. 6.
  • Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application.
  • the video segmentation device continuously intercepts voice information segments on the voice information with the window length W and the step size S.
  • the video segmentation device extracts keywords for each voice information segment. Specifically, the video segmentation device extracts N keywords from each voice information segment.
  • step 603 can be performed after step 602; if the video segmentation device has not extracted the keywords of the content description information, step 604 can be performed after step 602 .
  • the keyword of the content description information extracted by the video segmentation device means that the video segmentation device determines that the video to be processed includes the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on content description information and voice information.
  • the video segmentation device has not extracted the keywords of the content description information, which means that the video segmentation device determines that the to-be-processed video does not include the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on the voice information.
  • the video segmentation device determines the keyword in the i-th voice information segment and the word frequency vector C_i of the keyword of the content description information in the i-th voice information segment.
  • the method for determining the keyword in the i-th voice information segment refer to the embodiment shown in FIG. 3. Specifically, reference may be made to the method for determining the keyword of the first voice information segment in the embodiment shown in FIG. 3, which will not be repeated here.
  • the method for determining the keywords of the content description information can participate in the embodiment shown in FIG. 3, and it will not be repeated here.
  • the video segmentation device determines the keyword in the i-th voice information segment and the implementation manner of the keyword of the content description information in the word frequency vector in the i-th voice information segment. There is no need to go into details about how to determine the keyword vector here.
  • the video segmentation device determines the word frequency vector C_i of the keyword in the i-th voice information segment in the i-th voice information segment.
  • the method of determining the word frequency vector of the keyword in the i-th voice information segment in the i-th voice information segment and the keyword in the i-th voice information segment and the keywords of the content description information are in the i-th voice information segment
  • the method for determining the word frequency vector C_i in is similar, so there is no need to repeat it here.
  • the video segmentation apparatus may perform step 605 and step 606 in sequence.
  • the video segmentation device determines the distance between C_i and C_(i-1).
  • C_(i-1) is that the video segmentation device determines that the keyword of the i-1th voice information fragment (or the keyword of the i-1th voice information fragment and the keyword of the content description information) is in the i-th
  • the word frequency vector in a piece of speech information is a voice information segment before the i-th voice information segment
  • the segment point can be determined according to the pause point.
  • the video segmentation device determines that the segment point is located before and after the i-th voice information segment, the segment point can be determined according to the pause point.
  • the word frequency vector of the next speech information segment and the word frequency vector of the i-th speech information segment can be determined continuously.
  • Fig. 7 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
  • the video segmentation device 700 includes an acquisition unit 701 and a processing unit 702.
  • the acquiring unit 701 is configured to acquire text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.
  • the processing unit 702 is configured to determine the segment point of the video to be processed according to the text information and the voice information.
  • the processing unit 702 is further configured to segment the to-be-processed video according to the segmentation point.
  • the specific functions and beneficial effects of the acquiring unit 701 and the processing unit 702 can be referred to the methods shown in FIG. 3 to FIG. 6, which will not be repeated here.
  • Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
  • the video segmentation device 800 shown in FIG. 8 includes a processor 801, a memory 802, and a transceiver 803.
  • the processor 801, the memory 802, and the transceiver 803 communicate with each other through internal connection paths, and transfer control and/or data signals.
  • the method disclosed in the foregoing embodiment of the present application may be applied to the processor 801 or implemented by the processor 801.
  • the processor 801 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 801 or instructions in the form of software.
  • the aforementioned processor 801 may be a general-purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory (RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory, or electrically erasable programmable memory, registers, etc. mature in the field Storage medium.
  • the storage medium is located in the memory 802, and the processor 801 reads the instructions in the memory 802 and completes the steps of the above method in combination with its hardware.
  • the memory 802 may store instructions for executing the method executed by the video segmentation apparatus in the methods shown in FIGS. 3 to 6.
  • the processor 801 can execute the instructions stored in the memory 802 in combination with other hardware (such as the transceiver 803) to complete the steps of the video segmentation device in the method shown in FIG. 3 to FIG. 6.
  • the specific working process and beneficial effects can be seen in FIGS. 3 to 6 shows the description in the embodiment.
  • An embodiment of the present application also provides a chip, which includes a transceiver unit and a processing unit.
  • the transceiver unit may be an input/output circuit or a communication interface
  • the processing unit is a processor or microprocessor or integrated circuit integrated on the chip.
  • the chip can execute the method of the video segmentation device in the above method embodiment.
  • the embodiment of the present application also provides a computer-readable storage medium on which an instruction is stored, and when the instruction is executed, the method of the video segmentation device in the foregoing method embodiment is executed.
  • the embodiment of the present application also provides a computer program product containing instructions that, when executed, execute the method of the video segmentation device in the foregoing method embodiment.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The present application provides a video segmentation method and a video segmentation device. The method comprises: a video segmentation device segments a video to be processed according to at least one of content description information for describing content of the video to be processed and a presentation presented in the video to be processed as well as voice information of the video to be processed, wherein the content description information and the presentation are uploaded in advance. The technical solution can segment the video to be processed with reference to information except the content of the video to be processed, thereby improving the accuracy of segmentation.

Description

视频分段方法和视频分段装置Video segmentation method and video segmentation device 技术领域Technical field
本申请涉及信息技术领域,更具体地,涉及视频分段方法和视频分段装置。This application relates to the field of information technology, and more specifically, to a video segmentation method and a video segmentation device.
背景技术Background technique
为了便于方便地观看视频,可以将一个完整的视频划分为多个分段。这样,用户可以直接观看感兴趣的分段。In order to watch the video conveniently, a complete video can be divided into multiple segments. In this way, the user can directly watch the segment of interest.
目前一种常见的视频分段方法是基于视频中的文字信息对视频分段的。上述视频中的文字信息可以是视频中的字幕,或者是对视频进行语音识别得到的文字。换句话说,目前对视频进行分段的基础都是来自于视频本身。此外,目前这种基于视频中的文字信息视频分段需要获取视频的全部文字信息。直播视频的视频流是实时产生的。因此,只有在视频直播结束之后,才能得到视频的全部文字信息。因此,上述方法并不能对直播视频进行实时分段。此外上述方法只是根据视频的文字信息对视频进行分段。这样可能会造成确定的分段点并不一定是合适的分段点。At present, a common video segmentation method is to segment the video based on the text information in the video. The text information in the above video may be subtitles in the video, or text obtained by performing voice recognition on the video. In other words, the basis for segmenting a video currently comes from the video itself. In addition, the current video segmentation based on text information in the video needs to obtain all text information of the video. The video stream of the live video is generated in real time. Therefore, all text information of the video can be obtained only after the live video broadcast ends. Therefore, the above method cannot segment the live video in real time. In addition, the above method only segments the video according to the text information of the video. This may cause the determined segmentation point to not necessarily be a suitable segmentation point.
发明内容Summary of the invention
本申请提供一种视频分段方法和视频分段装置,能够提高视频分段的准确性。The present application provides a video segmentation method and a video segmentation device, which can improve the accuracy of video segmentation.
第一方面,本申请实施例提供一种视频分段方法,包括:视频分段装置获取待处理视频的文本信息和该待处理视频的语音信息,其中该文本信息包括该待处理视频中的演示文稿和该待处理视频的内容描述信息中的至少一个;该视频分段装置根据该文本信息和该语音信息,确定该待处理视频的分段点;该视频分段装置根据该分段点,对该待处理视频进行分段。上述技术方案可以结合除待处理视频本身的内容以外的信息,对该待处理视频进行分段,从而可以提高分段的准确性。In a first aspect, an embodiment of the present application provides a video segmentation method, including: a video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information includes a presentation in the video to be processed At least one of the document and the content description information of the video to be processed; the video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information; the video segmentation device determines the segmentation point of the video to be processed according to the segmentation point, Segment the video to be processed. The above technical solutions can combine information other than the content of the to-be-processed video to segment the to-be-processed video, thereby improving the accuracy of segmentation.
结合第一方面,在第一方面的一种可能的实现方式中,在该文本信息包括该演示文稿的情况下,该视频分段装置根据该文本信息和该语音信息,确定该待处理视频的分段点,包括:确定该演示文稿的切换点,该演示文稿在该切换点前后呈现的内容不同;根据该语音信息,确定至少一个停顿点;根据该切换点和该至少一个停顿点,确定该分段点。演示文稿发生切换往往意味着演讲人的演讲的内容发生了变化。因此,上述技术方案通过考虑演示文稿的变化,将待处理视频划分为不用的分段,可以合理地快速确定待处理视频的分段点。另外,上述技术方案在确定待处理视频的分段点时,只需要基于演示文稿的切换点以及切换点附近的停顿点。因此,上述技术方案不需要获取完成的视频文件,就可以对视频进行分段。换句话说,利用上述技术方案可以实时对待处理视频进行分段。因此,上述技术方案可应用于直播视频的分段处理。With reference to the first aspect, in a possible implementation of the first aspect, in a case where the text information includes the presentation, the video segmentation device determines the video segmentation of the to-be-processed video according to the text information and the voice information. The segmentation point includes: determining the switching point of the presentation, which presents different content before and after the switching point; determining at least one pause point according to the voice information; determining according to the switching point and the at least one pause point The segment point. The switching of the presentation often means that the content of the speaker's speech has changed. Therefore, the above technical solution divides the to-be-processed video into unused segments by considering the change of the presentation, and can reasonably quickly determine the segmentation point of the to-be-processed video. In addition, the above technical solution only needs to be based on the switching point of the presentation and the pause point near the switching point when determining the segmentation point of the video to be processed. Therefore, the above technical solution does not need to obtain the completed video file, and the video can be segmented. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.
结合第一方面,在第一方面的一种可能的实现方式中,该根据该切换点和该至少一个停顿点,确定该分段点,包括:在确定该切换点与该至少一个停顿点中的一个停顿点相同的情况下,确定该切换点为该分段点;在确定该至少一个停顿点中的任一个停顿点与该切换点的均不相同的情况下,确定该至少一个停顿点中距离该切换点最近的一个停顿点为该分段点。With reference to the first aspect, in a possible implementation of the first aspect, the determining the segment point according to the switching point and the at least one stopping point includes: determining the switching point and the at least one stopping point If one of the stopping points of is the same, the switching point is determined to be the segment point; when it is determined that any one of the at least one stopping point is different from that of the switching point, the at least one stopping point is determined The stop point closest to the switching point is the segment point.
结合第一方面,在第一方面的一种可能的实现方式中,该确定该演示文稿的切换点,包括:确定获取到用于指示切换该演示文稿的内容的切换信号的时刻为该切换点。With reference to the first aspect, in a possible implementation of the first aspect, the determining the switching point of the presentation includes: determining that the moment when a switching signal for instructing to switch the content of the presentation is acquired is the switching point .
结合第一方面,在第一方面的一种可能的实现方式中,该文本信息还包括该内容描述信息,在该视频分段装置根据该文本信息和该语音信息,确定该待处理视频的分段点之前,该方法还包括:确定该演示文稿的当前页的演示时长小于或等于第一预设时长且大于第二预设时长。With reference to the first aspect, in a possible implementation of the first aspect, the text information further includes the content description information, and the video segmentation device determines the segmentation of the to-be-processed video based on the text information and the voice information. Before the segment point, the method further includes: determining that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.
结合第一方面,在第一方面的一种可能的实现方式中,在该文本信息包括该内容描述信息的情况下,该视频分段装置根据该文本信息和该语音信息,确定该待处理视频的分段点,包括:根据该语音信息、该内容描述信息的关键词和该语音信息中的停顿点,确定该待处理视频的分段点。内容描述信息是用户提前输入的用于描述待处理视频的信息。内容描述信息通常可以包括待处理视频中的一些关键信息,例如关键词,重点内容等。因此,基于内容描述信息可以更准确地确定待处理视频不同分段中描述的重点内容,从而更准确的对待处理视频进行分段。With reference to the first aspect, in a possible implementation of the first aspect, in a case where the text information includes the content description information, the video segmentation device determines the to-be-processed video according to the text information and the voice information The segmentation point includes: determining the segmentation point of the video to be processed according to the voice information, keywords of the content description information, and pause points in the voice information. The content description information is information input by the user in advance to describe the video to be processed. The content description information can usually include some key information in the video to be processed, such as keywords, key content, etc. Therefore, the key content described in different segments of the video to be processed can be determined more accurately based on the content description information, so that the video to be processed can be segmented more accurately.
结合第一方面,在第一方面的一种可能的实现方式中,该语音信息包括第一语音信息片段和第二语音信息片段,其中该第二语音信息片段是在该第一语音信息片段之前且与该第一语音信息片段相邻的语音信息片段,根据该语音信息、该内容描述信息的关键词和该语音信息中的停顿点,确定该待处理视频的分段点,包括:根据该第一语音信息片段、该第二语音信息片段、该内容描述信息的关键词和该语音信息中的停顿点,确定第一分段点,其中该待处理视频的分段点包括该第一分段点。另外,上述技术方案在确定待处理视频的分段点时,只需要基于内容描述信息的关键词和两个相邻视频片段中的语音信息就可以确定分段点的位置。视频片段的划分可以按照固定时间和步长实现。因此,在视频播放过程中就可以对已播放的视频划分出视频片段。这样,可以不需要获取完成的视频文件,就可以对视频进行分段。换句话说,利用上述技术方案可以实时对待处理视频进行分段。因此,上述技术方案可应用于直播视频的分段处理。With reference to the first aspect, in a possible implementation of the first aspect, the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before the first voice information fragment And the voice information segment adjacent to the first voice information segment, according to the voice information, the keywords of the content description information, and the pause point in the voice information, determining the segment point of the to-be-processed video includes: according to the The first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information, determine the first segment point, where the segment point of the to-be-processed video includes the first segment Segment point. In addition, when determining the segmentation point of the video to be processed in the above technical solution, the location of the segmentation point can be determined only based on the keywords of the content description information and the voice information in two adjacent video segments. The division of video clips can be implemented according to fixed time and step size. Therefore, during the video playback process, video segments can be divided into the played video. In this way, the video can be segmented without obtaining the completed video file. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.
结合第一方面,在第一方面的一种可能的实现方式中,根据该第一语音信息片段、该第二语音信息片段、该内容描述信息的关键词和该语音信息中的停顿点,确定第一分段点,包括:根据该第一语音信息片段的关键词、该第二语音信息片段的关键词、该第一语音信息片段的内容、该第二语音信息片段的内容和该内容描述信息的关键词,确定该第一语音信息片段和该第二语音信息片段的相似度;确定该第一语音信息片段和该第二语音信息片段的相似度小于相似度阈值;根据该语音信息中的停顿点,确定该第一分段点。With reference to the first aspect, in a possible implementation of the first aspect, it is determined according to the first voice information segment, the second voice information segment, the keywords of the content description information, and the pause point in the voice information The first segmentation point includes: according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the content description Key words of information, determine the similarity between the first voice information segment and the second voice information segment; determine that the similarity between the first voice information segment and the second voice information segment is less than the similarity threshold; according to the voice information The stop point is to determine the first segment point.
结合第一方面,在第一方面的一种可能的实现方式中,该语音信息中的停顿点包括该第一语音信息片段内的停顿点或与该第一语音信息片段相邻的停顿点,根据该语音信息中的停顿点,确定该第一分段点,包括:根据该第一语音信息片段内的停顿点数目、与该第一语音信息片段相邻的停顿点数目、停顿时长以及与停顿点相邻的词中的至少一个,确定该第一分段点。With reference to the first aspect, in a possible implementation of the first aspect, the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment, According to the pause points in the voice information, determining the first segment point includes: according to the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and At least one of the words adjacent to the pause point determines the first segment point.
结合第一方面,在第一方面的一种可能的实现方式中,该第一语音信息片段内的停顿点包括K个,或,与该第一语音信息片段相邻的停顿点包括K个。该根据该第一语音信息片段内的停顿点数目、与该第一语音信息片段相邻的停顿点数目、停顿时长以及与停顿点相邻的词中的至少一个,确定该第一分段点包括:在K等于1的情况下,确定该K个停顿点为该分段点;在K为大于或等于2的正整数且与该K个停顿点相邻的K个词中包括一个预设词的情况下,确定与该一个预设词相邻的停顿点为该分段点;在K为大于或等于2的 正整数且该K个词中包括至少两个该预设词的情况下,确定与至少两个该预设词相邻的至少两个停顿点中停顿时长最长的停顿点为该分段点;在K为大于或等于2的正整数且该K个词中不包括该预设词的情况下,确定该K个停顿点中停顿时长最长的停顿点为该分段点。With reference to the first aspect, in a possible implementation of the first aspect, the pause points in the first voice information segment include K, or, the pause points adjacent to the first voice information segment include K. The first segment point is determined according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and the words adjacent to the pause point Including: when K is equal to 1, determine the K pause points as the segment point; include a preset in K words that are a positive integer greater than or equal to 2 and adjacent to the K pause points In the case of words, determine the pause point adjacent to the predetermined word as the segment point; in the case where K is a positive integer greater than or equal to 2 and the K words include at least two predetermined words , Determine the pause point with the longest pause duration among the at least two pause points adjacent to at least two preset words as the segment point; K is a positive integer greater than or equal to 2 and the K words do not include In the case of the preset word, it is determined that the pause point with the longest pause duration among the K pause points is the segment point.
结合第一方面,在第一方面的一种可能的实现方式中,该文本信息还包括该演示文稿,在该视频分段装置根据该文本信息和该语音信息,确定该待处理视频的分段点之前,该方法还包括:确定该演示文稿的当前页的演示时长大于第一预设时长;或者确定该演示文稿的当前页的演示时长小于或等于第二预设时长。上述技术方案可以避免演示文稿长期不变或者变化非常迅速导致的分段不合适的情况下发生。With reference to the first aspect, in a possible implementation of the first aspect, the text information further includes the presentation, and the video segmentation device determines the segmentation of the to-be-processed video according to the text information and the voice information Before the point, the method further includes: determining that the presentation duration of the current page of the presentation is greater than the first preset duration; or determining that the presentation duration of the current page of the presentation is less than or equal to the second preset duration. The above technical solution can avoid the occurrence of inappropriate segmentation caused by long-term unchanged or very rapid changes of the presentation.
结合第一方面,在第一方面的一种可能的实现方式中,该方法还包括:该视频分段装置根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词,确定该分段的摘要,其中该目标文本包括该演示文稿和该内容描述信息中的至少一个。基于上述技术方案,用户在回看视频时可以利用摘要快速确定希望回看的位置。此外,上述技术方案在确定摘要的过程中考虑到了待处理视频以外的信息。这样可以提高确定出的摘要的准确性,以及提高确定摘要的速度。With reference to the first aspect, in a possible implementation of the first aspect, the method further includes: the video segmentation device according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text , Determine the summary of the segment, where the target text includes at least one of the presentation and the content description information. Based on the above technical solution, the user can use the summary to quickly determine the desired location when reviewing the video. In addition, the foregoing technical solution takes into account information other than the video to be processed in the process of determining the summary. This can improve the accuracy of the determined abstract and the speed of determining the abstract.
结合第一方面,在第一方面的一种可能的实现方式中,该视频分段装置根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词,确定该分段的摘要,包括:根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词,确定第三关键词向量;根据该第三关键词向量,确定该分段的摘要。With reference to the first aspect, in a possible implementation of the first aspect, the video segmentation device determines the segment based on the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text The abstract includes: determining the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text; and determining the abstract of the segment according to the third keyword vector.
结合第一方面,在第一方面的一种可能的实现方式中,该视频分段装置根据该第三关键词向量,确定该分段的摘要,包括:根据该目标文本与该分段语音信息,确定参考文本,其中该参考文本包括J个句子,J为大于或等于1的正整数;根据该分段语音信息的关键词、该目标文本的关键词和该J个句子中的每个句子,确定J个关键词向量;根据该第三关键词向量和该J个关键词向量,确定该分段的摘要。With reference to the first aspect, in a possible implementation of the first aspect, the video segmentation device determines the summary of the segment according to the third keyword vector, including: according to the target text and the segmented voice information , Determine the reference text, where the reference text includes J sentences, where J is a positive integer greater than or equal to 1; according to the key words of the segmented voice information, the key words of the target text and each sentence in the J sentences , Determine J keyword vectors; according to the third keyword vector and the J keyword vectors, determine the summary of the segment.
结合第一方面,在第一方面的一种可能的实现方式中,根据该目标文本与该分段语音信息,确定参考文本,包括:在该目标文本中包括冗余的句子的情况下,将该目标文本中的该冗余的句子删除,得到修正目标文本并将该修正目标文本与该分段语音信息合并,得到该参考文本;在该目标文本不包括该冗余的句子的情况下,将该目标文本与该分段语音信息合并,得到该参考文本。With reference to the first aspect, in a possible implementation of the first aspect, determining the reference text according to the target text and the segmented voice information includes: if the target text includes redundant sentences, adding The redundant sentence in the target text is deleted, the revised target text is obtained, and the revised target text is combined with the segmented speech information to obtain the reference text; in the case that the target text does not include the redundant sentence, The target text is combined with the segmented voice information to obtain the reference text.
结合第一方面,在第一方面的一种可能的实现方式中,根据该第三关键词向量和该J个关键词向量,确定该分段的摘要,包括:根据该第三关键词向量和该J个关键词向量,确定J个距离,其中该J个距离中的第j个距离是根据该第三关键词向量和该J个关键词向量中的第j个关键词向量确定的,j为大于或等于1且小于或等于J的正整数;确定该J个距离中距离最短的R个距离,R为大于或等于1且小于J的正整数;确定该分段的摘要,其中该分段的摘要包括与该R个距离对应的句子。With reference to the first aspect, in a possible implementation of the first aspect, determining the summary of the segment according to the third keyword vector and the J keyword vectors includes: according to the third keyword vector and The J keyword vectors determine J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j Is a positive integer greater than or equal to 1 and less than or equal to J; determine the R distances with the shortest distance among the J distances, where R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the score The summary of the segment includes sentences corresponding to the R distances.
结合第一方面,在第一方面的一种可能的实现方式中,待处理视频为实时视频流,该待处理视频的语音信息为该实时视频流从该实时视频流的起始时刻或者上一分段点到当前时刻的语音信息。上述技术方案可以实现对视频的实时分段。换句话说,利用上述技术方案对视频进行分段时,并不需要获取该待处理视频的全部内容。因此,上述技术方案可以实现对直播视频的实时分段。With reference to the first aspect, in a possible implementation of the first aspect, the to-be-processed video is a real-time video stream, and the voice information of the to-be-processed video is the real-time video stream from the start time of the real-time video stream or the previous The voice information from the segment point to the current moment. The above technical solution can realize real-time segmentation of the video. In other words, when the video is segmented using the above technical solution, it is not necessary to obtain all the content of the video to be processed. Therefore, the above technical solution can realize real-time segmentation of live video.
第二方面,本申请实施例提供一种视频分段装置,该装置包括用于执行第一方面或第一方面的任一种可能的实现方式的单元。In a second aspect, an embodiment of the present application provides a video segmentation device, and the device includes a unit for executing the first aspect or any possible implementation manner of the first aspect.
可以选的,第二方面的视频分段装置可以为计算机设备,或者可以为可用于计算机设备的部件(例如芯片或者电路等)。Optionally, the video segmentation apparatus of the second aspect may be a computer device, or may be a component (such as a chip or a circuit, etc.) that can be used in a computer device.
第三方面,本申请实施例提供一种存储介质,该存储介质存储用于实现第一方面或第一方面的任一种可能的实现方式所述的方法的指令。In a third aspect, an embodiment of the present application provides a storage medium, and the storage medium stores instructions for implementing the first aspect or any one of the possible implementation manners of the first aspect.
第四方面,本申请实施例提供了一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第一方面的任一种可能的实现方式所述的方法。In a fourth aspect, the embodiments of the present application provide a computer program product containing instructions. When the computer program product is run on a computer, the computer can execute the first aspect or any one of the possible implementations of the first aspect. The method described.
附图说明Description of the drawings
图1是一个可以应用本申请实施例提供的视频分段方法的系统的示意图;FIG. 1 is a schematic diagram of a system that can apply the video segmentation method provided by the embodiments of the present application;
图2是另一个可以应用本申请实施例提供的视频分段方法的系统的示意图;FIG. 2 is a schematic diagram of another system that can apply the video segmentation method provided by the embodiments of the present application;
图3是根据本申请实施例提供的视频分段方法的示意性流程图;Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application;
图4是根据本申请实施例提供的视频会议流程的示意图;Fig. 4 is a schematic diagram of a video conference process provided according to an embodiment of the present application;
图5是根据本申请实施例提供的视频分段方法的示意性流程图;Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application;
图6是根据本申请实施例提供的一种视频分段的方法的示意性流程图;Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application;
图7是根据本申请实施例提供的视频分段装置的结构框图;Fig. 7 is a structural block diagram of a video segmentation device according to an embodiment of the present application;
图8是根据本申请实施例提供的视频分段装置的结构框图。Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请中的技术方案进行描述。The technical solution in this application will be described below in conjunction with the drawings.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下中的至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a、b、c、a-b、a-c、b-c、或a-b-c,其中a、b、c可以是单个,也可以是多个。另外,在本申请的实施例中,“第一”、“第二”等字样并不对数量和执行次序进行限定。In this application, "at least one" refers to one or more, and "multiple" refers to two or more. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, both A and B exist, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are in an "or" relationship. "At least one item (a) in the following" or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a). For example, at least one of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c may be single or multiple. In addition, in the embodiments of the present application, words such as "first" and "second" do not limit the number and execution order.
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。It should be noted that in this application, words such as "exemplary" or "for example" are used to indicate examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in this application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.
本申请的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。另外,本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可包括但不限 于,无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。Various aspects or features of this application can be implemented as methods, devices, or products using standard programming and/or engineering techniques. The term "article of manufacture" as used in this application encompasses a computer program accessible from any computer-readable device, carrier, or medium. For example, computer-readable media may include, but are not limited to: magnetic storage devices (for example, hard disks, floppy disks, or tapes, etc.), optical disks (for example, compact discs (CD), digital versatile discs (DVD)) Etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.). In addition, various storage media described herein may represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
图1是一个可以应用本申请提供的视频分段方法的系统的示意图。图1示出了一个视频会议系统,该系统中包括会议控制服务器101、会议终端111、会议终端112和会议终端113。会议终端111、会议终端112和会议终端113可以通过会议控制服务器101建立会议。Figure 1 is a schematic diagram of a system that can apply the video segmentation method provided by this application. FIG. 1 shows a video conference system, which includes a conference control server 101, a conference terminal 111, a conference terminal 112, and a conference terminal 113. The conference terminal 111, the conference terminal 112, and the conference terminal 113 can establish a conference through the conference control server 101.
视频会议通常会包括至少两个会场。每个会场可以通过一个会议终端接入会议控制服务器。该会议终端可以是用于接入视频会议的设备。该会议终端可以用于接收会议数据,并根据该会议数据在显示装置上呈现会议内容。该会议终端可以包括主机和显示装置。该主机可以通过通信接口接收会议数据,根据接收到的会议数据,生成视频信号,并将该视频信号通过有线或者无线的方式输出至该显示装置。该显示装置根据接收到的视频信号,呈现会议内容。可选的,在一些实施例中,该显示装置可以是内置在该主机中的。例如,该会议终端可以是笔记本电脑、平板电脑、智能手机等内置有显示装置的电子设备。可选的,在另一些实施例中,该显示装置可以是外置于主机的显示装置。例如,该主机可以是计算机主机,该显示装置可以是显示器、电视机或者投影仪。又如,即使该主机中内置有显示装置,用于呈现会议内容的显示装置也可以是外置于该主机的显示装置。例如,该主机可以是笔记本电脑,该显示装置可以是外接于该笔记本电脑的显示器、电视机或者投影仪。Video conferences usually include at least two conference rooms. Each conference site can access the conference control server through a conference terminal. The conference terminal may be a device used to access a video conference. The conference terminal can be used to receive conference data and present the conference content on the display device according to the conference data. The conference terminal may include a host and a display device. The host can receive conference data through a communication interface, generate a video signal according to the received conference data, and output the video signal to the display device in a wired or wireless manner. The display device presents the content of the meeting according to the received video signal. Optionally, in some embodiments, the display device may be built in the host. For example, the conference terminal may be an electronic device with a built-in display device, such as a notebook computer, a tablet computer, or a smart phone. Optionally, in other embodiments, the display device may be a display device externally placed on the host. For example, the host may be a computer host, and the display device may be a display, a television, or a projector. For another example, even if the host has a built-in display device, the display device used for presenting conference content may also be a display device external to the host. For example, the host may be a notebook computer, and the display device may be a monitor, a television or a projector externally connected to the notebook computer.
在一些情况下,视频会议可能会包括一个主会场和至少一个分会场。在这种情况下,主会场中的会议终端(例如会议终端111)可以将采集到的主会场的媒体流上传至会议控制服务器101。会议控制服务器101可以根据接收到的媒体流,生成会议数据,并将会议数据发送至分会场中的会议终端(例如会议终端112和会议终端113)。会议终端112和会议终端113中可以根据接收到的会议数据在显示装置上呈现会议内容。In some cases, a video conference may include a main venue and at least one branch venue. In this case, the conference terminal (for example, the conference terminal 111) in the main conference site can upload the collected media stream of the main conference site to the conference control server 101. The conference control server 101 may generate conference data according to the received media stream, and send the conference data to the conference terminals (for example, the conference terminal 112 and the conference terminal 113) in the branch venue. The conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data.
在另一些情况下,视频会议中的至少两个会场可能并没有主、次之分。每个会场中的会议终端都可以将采集到的媒体流上传至会议控制服务器101。例如,假设会议终端111是会场1中用于接入视频会议的会议终端,会议终端112是会场2中用于接入视频会议的会议终端,会议终端的113是会场3中用于接入视频会议的会议终端。会议终端111可以将采集到的会场1的媒体流上传至会议控制服务器101,会议控制服务器101可以根据会场1的媒体流生成会议数据1,并将该会议数据1发送至会议终端112和会议终端113,会议终端112和会议终端113可以根据接收到的会议数据1在显示装置上呈现会议内容。类似的,会议终端112也可以将采集到的会场2的媒体流上传至会议控制服务器,会议控制服务器101可以根据会场2的媒体流生成会议数据2,并将该会议数据2发送至会议终端111和会议终端113,会议终端111和会议终端113可以根据接收到的会议数据2在显示装置上呈现会议内容;会议终端113也可以将采集到的会场3的媒体流上传至会议控制服务器,会议控制服务器101可以根据会场3的媒体流生成会议数据3,并将该会议数据3发送至会议终端111和会议终端112,会议终端111和会议终端112可以根据接收到的会议数据3在显示装置上呈现会议内容。In other cases, at least two venues in a video conference may not be distinguished between primary and secondary. The conference terminal in each conference site can upload the collected media stream to the conference control server 101. For example, suppose that the conference terminal 111 is the conference terminal used to access the video conference in the conference room 1, the conference terminal 112 is the conference terminal used to access the video conference in the conference room 2, and the conference terminal 113 is the conference terminal used to access the video in the conference room 3. The conference terminal for the conference. The conference terminal 111 can upload the collected media stream of the conference site 1 to the conference control server 101. The conference control server 101 can generate conference data 1 according to the media stream of the conference site 1, and send the conference data 1 to the conference terminal 112 and the conference terminal 113. The conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data 1. Similarly, the conference terminal 112 can also upload the collected media stream of the conference site 2 to the conference control server, and the conference control server 101 can generate conference data 2 according to the media stream of the conference site 2, and send the conference data 2 to the conference terminal 111 With the conference terminal 113, the conference terminal 111 and the conference terminal 113 can present the conference content on the display device according to the received conference data 2; the conference terminal 113 can also upload the collected media stream of the conference site 3 to the conference control server, and the conference control The server 101 can generate conference data 3 according to the media stream of the conference site 3, and send the conference data 3 to the conference terminal 111 and the conference terminal 112, and the conference terminal 111 and the conference terminal 112 can display the conference data 3 on the display device according to the received conference data 3 Content of meeting.
可选的,在一些实施例中,该媒体流可以是音频流。可选的,在另一些实施例中,该媒体流可以是视频流。负责采集媒体流的媒体设备可以是内置在会议终端内的(例如会议终端内的摄像头和麦克风),也可以是外接于该会议终端的,本申请实施例对此并不限定。Optionally, in some embodiments, the media stream may be an audio stream. Optionally, in other embodiments, the media stream may be a video stream. The media device responsible for collecting media streams may be built in the conference terminal (for example, a camera and microphone in the conference terminal), or may be externally connected to the conference terminal, which is not limited in this embodiment of the application.
可选的,在一些实施例中,会议的发言人在发言过程中使用演示文稿。在此情况下, 该媒体流可以是该发言人发言的音频流。该发言人在发言过程中使用的演示文稿可以通过辅流(也可以称为数据流、计算机屏幕流)上传至会议控制服务器101。会议控制服务器101根据接收到的音频流和辅流,生成会议数据。可选的,在一些可能的实现方式中,该会议数据可以包括接收到的音频流和辅流。可选的,在另一些可能的实现方式中,该会议数据可以包括对接收到的音频流进行处理后得到的处理后的音频流以及该辅流。对接收到的音频流进行处理可以是对接收到的音频流进行转码操作,例如可以降低该音频流的码率,以便减少向其他会议终端传输该音频流所需的数据量。可选的,在另一些可能的实现方式中,该会议数据可以包括接收到的音频流、与接收到的音频流码率不同的音频流以及该辅流。这样,会议终端可以根据网络状况和/或接入会议的方式选择合适的音频流。例如,若会议终端的网络状况较好或者利用Wi-Fi接入会议,则可以选择码率较高的音频流,这样可以收听到更清晰的声音。又如,若会议终端的网络状况较差,则可以选择码率较低的音频流,这样可以减少因网络状况不好导致的会议直播中断的情况发生。又如,若会议终端利用移动网络接入会议,则可以选择码率较低的音频流,这样可以减少流量的消耗。可选的,在另一些可能的实现方式中,该会议数据中除了包括至少一种码率的音频流以及辅流外,还可以包括对应于发言人发言的字幕。该字幕可以是基于语音识别技术,将发言人的发言进行语音-文字转换生成的,也可以是人工记录的发言人的发言,或者,也可以是在语音-文字转换的基础上结合人工修改生成的。Optionally, in some embodiments, the speakers of the conference use presentations during the speech. In this case, the media stream may be an audio stream of the speaker. The presentation used by the speaker during the speaking process can be uploaded to the conference control server 101 through an auxiliary stream (also called a data stream or a computer screen stream). The conference control server 101 generates conference data based on the received audio stream and auxiliary stream. Optionally, in some possible implementation manners, the conference data may include the received audio stream and auxiliary stream. Optionally, in other possible implementation manners, the conference data may include a processed audio stream obtained after processing the received audio stream and the auxiliary stream. Processing the received audio stream can be a transcoding operation on the received audio stream, for example, the bit rate of the audio stream can be reduced, so as to reduce the amount of data required to transmit the audio stream to other conference terminals. Optionally, in other possible implementation manners, the conference data may include the received audio stream, an audio stream with a different bit rate from the received audio stream, and the auxiliary stream. In this way, the conference terminal can select a suitable audio stream according to the network condition and/or the way of accessing the conference. For example, if the network condition of the conference terminal is good or Wi-Fi is used to access the conference, you can choose an audio stream with a higher bit rate, so that you can listen to a clearer sound. For another example, if the network condition of the conference terminal is poor, an audio stream with a lower bit rate can be selected, which can reduce the interruption of the conference live broadcast caused by the bad network condition. For another example, if the conference terminal uses a mobile network to access the conference, an audio stream with a lower bit rate can be selected, which can reduce data consumption. Optionally, in other possible implementation manners, the conference data may include subtitles corresponding to the speaker's speech in addition to at least one bit rate audio stream and auxiliary stream. The subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.
可选的,在另一些实施例中,该媒体流可以是发言人在发言过程中的视频流。换句话说,该媒体流中可以同时包括发言人在发言过程中的声音信息和画面信息。相应的,上传至会议控制服务器101的媒体流是该视频流。在一些情况下,假设该发言人在发言过程中使用了演示文稿,并且使用输出装置(例如投影仪、电视机等)展示演示文稿。该媒体流中的画面信息中包括该发言人展示的演示文稿。因此,上传至会议控制服务器101的视频流中包括该演示文稿。在此情况下,会议控制服务器101可以直接根据该视频流确定会议数据。在另一些情况下,发言人在发言过程中使用的演示文稿可以通过辅流的方式上传至会议控制服务器101。会议控制服务器101可以根据采集到的视频流和该辅流,生成会议数据。可选的,在一些可能的实现方式中,该会议数据可以包括采集到的视频流和辅流。可选的,在另一些可能的实现方式中,该会议数据可以包括对采集到的视频流进行处理后得到的处理后的视频以及该辅流。对采集到的视频流进行处理可以是对采集到的视频流进行转码操作,例如可以降低该视频流的分辨率,以便减少向其他会议终端传输该视频流所需的数据量。可选的,在另一些可能的实现方式中,该会议数据可以包括采集到的视频流、与采集到的视频流的分辨率不同的视频流以及该辅流。这样,会议终端可以根据网络状况和/或接入会议的方式选择合适的视频流。例如,若会议终端的网络状况较好或者利用Wi-Fi接入会议,则可以选择分辨率较高的视频流,这样可以使得观众看到更清晰的画面。又如,若会议终端的网络状况较差,则可以选择分辨率较低的视频流,这样可以减少因网络状况不好导致的会议直播中断的情况发生。又如,若会议终端利用移动网络接入会议,则可以选择分辨率较低的视频流,这样可以减少流量的消耗。可选的,在另一些可能的实现方式中,该会议数据中除了包括至少一种分辨率的视频流以及辅流外,还可以包括对应于发言人发言的字幕。该字幕可以是基于语音识别技术,将发言人的发言进行语音-文字转换生成的,也可以是人工记录的发言人的发言,或者,也可以是在语音-文字转换的基础上结合人 工修改生成的。Optionally, in other embodiments, the media stream may be a video stream of the speaker during the speech. In other words, the media stream can include both the voice information and picture information of the speaker during the speech. Correspondingly, the media stream uploaded to the conference control server 101 is the video stream. In some cases, it is assumed that the speaker uses a presentation during the speech and uses an output device (such as a projector, a television, etc.) to show the presentation. The screen information in the media stream includes the presentation displayed by the speaker. Therefore, the video stream uploaded to the conference control server 101 includes the presentation. In this case, the conference control server 101 can directly determine the conference data according to the video stream. In other cases, the presentation used by the speaker during the speech can be uploaded to the conference control server 101 in the form of auxiliary streams. The conference control server 101 may generate conference data according to the collected video stream and the auxiliary stream. Optionally, in some possible implementation manners, the conference data may include collected video streams and auxiliary streams. Optionally, in other possible implementation manners, the conference data may include a processed video obtained after processing the collected video stream and the auxiliary stream. Processing the captured video stream can be a transcoding operation on the captured video stream, for example, the resolution of the video stream can be reduced, so as to reduce the amount of data required to transmit the video stream to other conference terminals. Optionally, in other possible implementation manners, the conference data may include a collected video stream, a video stream with a resolution different from the collected video stream, and the auxiliary stream. In this way, the conference terminal can select an appropriate video stream according to the network conditions and/or the way to access the conference. For example, if the network of the conference terminal is good or Wi-Fi is used to access the conference, a video stream with a higher resolution can be selected so that the audience can see a clearer picture. For another example, if the network condition of the conference terminal is poor, a video stream with a lower resolution can be selected, which can reduce the interruption of the conference live broadcast caused by the bad network condition. For another example, if the conference terminal uses a mobile network to access the conference, a video stream with a lower resolution can be selected, which can reduce data consumption. Optionally, in other possible implementation manners, in addition to the video stream of at least one resolution and the auxiliary stream, the conference data may also include subtitles corresponding to the speaker's speech. The subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.
图2是另一个可以应用本申请提供的视频分段方法的系统的示意图。图2示出了一个远程教育系统,该系统中包括课程服务器201、主设备211、客户端设备212和客户端设备213。Fig. 2 is a schematic diagram of another system to which the video segmentation method provided by this application can be applied. FIG. 2 shows a distance education system, which includes a course server 201, a main device 211, a client device 212, and a client device 213.
主设备211可以将采集到的媒体流上传至课程服务器201。课程服务器201可以根据该媒体流生成课程数据,并将该课程数据发送至客户端设备212和客户端设备213,客户端设备212和客户端设备213可以根据接收到的课程数据在显示装置上呈现课程内容。The main device 211 can upload the collected media stream to the course server 201. The course server 201 can generate course data according to the media stream, and send the course data to the client device 212 and the client device 213, and the client device 212 and the client device 213 can present the course data on the display device according to the received course data. Course content.
主设备211可以是一个笔记本电脑、台式计算机。客户端设备212和客户端设备213可以是笔记本电脑、台式计算机、平板电脑、智能手机等。The main device 211 may be a notebook computer or a desktop computer. The client device 212 and the client device 213 may be notebook computers, desktop computers, tablet computers, smart phones, and so on.
可选的,在一些实施例中,负责讲课的老师在讲课过程中使用演示文稿。在此情况下,该媒体流可以是该老师讲课的音频流。该老师在讲课过程中使用的演示文稿可以通过辅流上传至课程服务器201。课程服务器201根据接收到的音频流和辅流,生成课程数据。Optionally, in some embodiments, the teacher in charge of the lecture uses the presentation during the lecture. In this case, the media stream may be an audio stream of the teacher's lecture. The presentation used by the teacher during the lecture can be uploaded to the course server 201 through auxiliary streams. The course server 201 generates course data according to the received audio stream and auxiliary stream.
可选的,在另一些实施例中,该媒体流可以是老师在讲课过程中的视频流。换句话说,该媒体流中可以同时包括该老师在讲课过程中的声音信息和画面信息。相应的,上传至课程服务器201的媒体流是该视频流。在一些情况下,假设该老师在讲课过程中使用了演示文稿,并且使用输出装置(例如投影仪、电视机等)展示演示文稿。该媒体流中的画面信息中包括该老师展示的演示文稿。因此,上传至课程服务器201的视频流中包括该演示文稿。在此情况下,课程服务器201可以直接根据该视频流确定课程数据。在另一些情况下,该老师在讲课过程中使用的演示文稿可以通过辅流的方式上传至课程服务器201。课程服务器201可以根据采集到的视频流和该辅流,生成课程数据。Optionally, in other embodiments, the media stream may be a video stream of the teacher during the lecture. In other words, the media stream can include both the audio information and the picture information of the teacher during the lecture. Correspondingly, the media stream uploaded to the course server 201 is the video stream. In some cases, it is assumed that the teacher uses a presentation during the lecture and uses an output device (such as a projector, a television, etc.) to show the presentation. The picture information in the media stream includes the presentation presented by the teacher. Therefore, the presentation is included in the video stream uploaded to the course server 201. In this case, the course server 201 can directly determine the course data according to the video stream. In other cases, the presentation used by the teacher during the lecture can be uploaded to the course server 201 by way of auxiliary streams. The course server 201 can generate course data according to the collected video stream and the auxiliary stream.
课程数据的具体内容与会议数据的具体内容相似,为了简洁,就不再赘述。The specific content of the course data is similar to the specific content of the conference data, so I won’t repeat it for brevity.
图3是根据本申请实施例提供的视频分段方法的示意性流程图。图3所示的方法可以由视频分段装置执行。该视频分段装置可以是能够实现本申请实施例提供的方法的计算机设备,例如个人计算机、笔记本电脑、平板电脑、服务器等,也可以是能够实现本申请实施例提供的方法的设置在计算机设备内部的硬件,例如显卡、图形处理器(Graphics Processing Unit,GPU),或者,也可以是一个用于实现本申请实施例提供的方法的专用装置。例如,在一些实施例中,该视频分段装置可以是如图1所示系统中的会议控制服务器101或者设置在会议控制服务器101中的一个硬件。又如,在另一些实施例中,该视频分段装置可以是如图1所示的系统中的上传媒体流的会议终端或者该会议终端中的一个硬件。又如,在另一些实施例中,该视频分段装置可以是如图2所示的系统中的主设备211或者设置在主设备211中的一个硬件。又如,在另一些实施例中,该视频分段装置可以是如图2所示实施例中的课程服务器201或者课程服务器201中的一个硬件。Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application. The method shown in FIG. 3 can be executed by a video segmentation device. The video segmentation device may be a computer device that can implement the method provided in the embodiment of the application, such as a personal computer, a notebook computer, a tablet computer, a server, etc., or it may be a computer device that can implement the method provided in the embodiment of the application. Internal hardware, such as a graphics card, a graphics processing unit (GPU), or a dedicated device for implementing the method provided in the embodiment of the present application. For example, in some embodiments, the video segmentation device may be the conference control server 101 in the system shown in FIG. 1 or a piece of hardware provided in the conference control server 101. For another example, in other embodiments, the video segmentation device may be a conference terminal uploading a media stream in the system shown in FIG. 1 or a piece of hardware in the conference terminal. For another example, in other embodiments, the video segmentation apparatus may be the main device 211 in the system shown in FIG. 2 or a piece of hardware provided in the main device 211. For another example, in other embodiments, the video segmentation device may be the course server 201 or a piece of hardware in the course server 201 in the embodiment shown in FIG. 2.
为了便于描述,假设图3所示的方法是应用在如图1所示的系统中。For ease of description, it is assumed that the method shown in FIG. 3 is applied to the system shown in FIG. 1.
301,视频分段装置获取待处理视频的文本信息和该待处理视频的语音信息,其中该文本信息包括该待处理视频中的演示文稿和该待处理视频的内容描述信息中的至少一个。301. The video segmentation device acquires text information of a to-be-processed video and voice information of the to-be-processed video, where the text information includes at least one of a presentation in the to-be-processed video and content description information of the to-be-processed video.
该演示文稿是指会议的发言人在发言过程中演示的文稿。本申请实施例对演示文稿的文件格式并不限定,只要该是在发言人的发言过程中通过显示装置展示的文稿都可以是该演示文稿。例如,该演示文稿可以是ppt格式或者pptx格式的文稿。又如,该演示文稿可以是PDF格式的文稿。又如,该演示文稿也可以是word格式或者txt格式的文稿。The presentation refers to the presentation presented by the speaker of the conference during the speech. The embodiment of the present application does not limit the file format of the presentation, as long as the presentation is displayed on the display device during the speech of the speaker, it can be the presentation. For example, the presentation may be in ppt format or pptx format. For another example, the presentation can be a PDF format. For another example, the presentation may also be in word format or txt format.
该内容描述信息是会议的发言人或者会议的主持人在开始进行会议之前上传的用于描述发言内容的信息。可选的,在一些实施例中,该内容描述信息中包括该发言人在视频会议中的发言内容的提纲、摘要和/或关键信息。例如,该内容描述信息中可以包括该发言人的发言内容的关键词。又如,该内容描述信息中可以包括该发言人的发言内容的摘要。又如,该发言人的发言内容可以包括多个部分,该内容描述信息中可以包括该多个部分中的每个部分的主题、摘要和/或关键词。The content description information is the information used to describe the content of the speech uploaded by the speaker or the host of the meeting before starting the meeting. Optionally, in some embodiments, the content description information includes an outline, abstract, and/or key information of the speaker's speech content in the video conference. For example, the content description information may include keywords of the speaker's speech content. For another example, the content description information may include a summary of the content of the speaker's speech. For another example, the content of the speaker's speech may include multiple parts, and the content description information may include the subject, abstract, and/or keywords of each part of the multiple parts.
该语音信息可以包括对该发言人的发言进行语音-文字转换得到对应的文字。本申请实施例对语音-文字转换的具体实现方式并不限定,只要能够将识别到的语音转换为对应的文字即可。该语音信息还可以包括对该发言人的发言进行语音识别得到的至少一个停顿点。停顿点表示说话者在说话过程中的自然停顿。The voice information may include voice-text conversion of the speaker's speech to obtain the corresponding text. The embodiment of the present application does not limit the specific implementation of the voice-text conversion, as long as the recognized voice can be converted into the corresponding text. The voice information may also include at least one pause point obtained by performing voice recognition on the speaker's speech. The pause point represents the natural pause of the speaker in the process of speaking.
302,该视频分段装置根据该文本信息和该语音信息,确定该待处理视频的分段点。302. The video segmentation device determines a segmentation point of the video to be processed according to the text information and the voice information.
如上所述,该文本信息可以包括该演示文稿和该内容描述信息中的至少一个。换句话说,该文本信息可以存在以下三种情况:As described above, the text information may include at least one of the presentation and the content description information. In other words, the text information can have the following three situations:
情况1:该文本信息中只包括该演示文稿;Case 1: Only the presentation is included in the text information;
情况2:该文本信息中只包括该内容描述信息;Case 2: The text information only includes the content description information;
情况3:该文本信息中包括该演示文稿和该内容描述信息。Case 3: The text information includes the presentation and the content description information.
换句话说,在一些情况下,该发言人可以仅在发言过程中展示该演示文稿,而并不会提前上传该内容描述信息。因此可能出现上述情况1。在另一些情况下,该发言人可以仅提前上传该内容描述信息,而并不会在发言过程中展示演示文稿。因此,可能出现上述情况2。在另一些情况下,该发言人可以即在发言过程中展示该演示文稿,也会提前上传该内容描述信息。因此,可能书写上述情况3。In other words, in some cases, the speaker may only show the presentation during the speech without uploading the content description information in advance. Therefore, the above situation 1 may occur. In other cases, the speaker may only upload the content description information in advance without showing the presentation during the speech. Therefore, the above situation 2 may occur. In other cases, the speaker may show the presentation during the speech and upload the content description information in advance. Therefore, it is possible to write case 3 above.
对于上述情况1,该视频分段装置可以根据该演示文稿,确定该待处理视频的分段点。For the above case 1, the video segmentation device may determine the segmentation point of the video to be processed according to the presentation.
对于上述情况2,该视频分段装置可以根据该内容描述信息,确定该待处理视频的分段点。For the above case 2, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information.
可选的,在一些实施例中,对于上述情况3,该视频分段装置可以根据该演示文稿和该内容描述信息中的一个,确定该待处理视频的分段点。换句话说,在该文本信息中包括该演示文稿和该内容描述信息的情况下,该视频分段装置可以根据该演示文稿或该内容描述信息,确定该待处理视频的分段点。Optionally, in some embodiments, for the foregoing case 3, the video segmentation device may determine the segmentation point of the video to be processed according to one of the presentation and the content description information. In other words, in the case where the text information includes the presentation and the content description information, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation or the content description information.
可选的,在一些实施例中,在该文本信息包括该演示文稿和该内容描述信息的情况下,该视频分段装置可以确定该演示文稿的当前页的演示时长,并根据该演示文稿的当前页的演示时长,确定是根据该演示文稿确定待处理视频的分段点,还是根据该内容描述信息确定该待处理视频的分段点。Optionally, in some embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine the presentation duration of the current page of the presentation, and based on the presentation time The presentation duration of the current page determines whether to determine the segmentation point of the video to be processed according to the presentation or to determine the segmentation point of the video to be processed according to the content description information.
可选的,在一些实施例中,该视频分段装置可以在该演示文稿的当前页的演示时大于第一预设时长的情况下,根据该内容描述信息和该语音信息,确定该待处理视频的分段点。这样,可以避免因发言人长时间演示相同的内容导致的视频的一个分段过长的情况发生。该第一预设时长可以根据需要设定。例如,该第一预设时长可以是10分钟。又如,该第一预设时长可以为15分钟。Optionally, in some embodiments, the video segmentation device may determine the to-be-processed based on the content description information and the voice information when the presentation of the current page of the presentation is longer than the first preset duration. The segment point of the video. In this way, it is possible to avoid the situation that a segment of the video is too long due to the speaker showing the same content for a long time. The first preset duration can be set as required. For example, the first preset duration may be 10 minutes. For another example, the first preset duration may be 15 minutes.
可选的,在一些实施例中,该视频分段装置可以在该演示文稿的当前页的演示时长小于或等于第二预设时长的情况下,根据该内容描述信息和该语音信息,确定该待处理视频 的分段点。这样,可以避免因发言人频繁地切换演示文稿的显示内容导致的视频的一个分段过短的情况发生。与该第一预设时长类似,该第二预设时长可以根据需要设定。例如,该第二预设时长可以为20秒。又如,该第二预设时长可以为10秒。Optionally, in some embodiments, the video segmentation device may determine the content description information and the voice information when the presentation duration of the current page of the presentation is less than or equal to the second preset duration. The segment point of the video to be processed. In this way, it can be avoided that a segment of the video is too short due to the speaker frequently switching the display content of the presentation. Similar to the first preset duration, the second preset duration can be set as needed. For example, the second preset duration may be 20 seconds. For another example, the second preset duration may be 10 seconds.
第一预设时长大于第二预设时长。The first preset duration is greater than the second preset duration.
可选的,在一些实施例中,该视频分段装置可以在该演示文稿的当前页的演示时长大于该第二预设时长且小于或等于该第一预设时长的情况下,根据该演示文稿和该语音信息,确定该待处理视频的分段点。Optionally, in some embodiments, the video segmentation device may, in the case where the presentation duration of the current page of the presentation is greater than the second preset duration and less than or equal to the first preset duration, according to the presentation The document and the voice information determine the segmentation point of the video to be processed.
可选的,在另一些实施例中,也可以只设置该第一预设时长。若该演示文稿在当前页的演示时长大于该第一预设时长,则根据该内容描述信息和该语音信息,确定该待处理视频的分段点。若该演示文稿在当前页的演示时长不大于该第一预设时长的情况下,则可以根据该演示文稿和该语音信息,确定该待处理视频的分段点。该演示文稿的当前页的演示时长是该演示文稿停留在当前页的时长。Optionally, in other embodiments, only the first preset duration may be set. If the presentation duration of the presentation on the current page is greater than the first preset duration, the segmentation point of the video to be processed is determined according to the content description information and the voice information. If the presentation duration of the current page of the presentation is not greater than the first preset duration, the segmentation point of the video to be processed may be determined according to the presentation and the voice information. The presentation duration of the current page of the presentation is the duration of the presentation staying on the current page.
可选的,在一些实施例中,该演示文稿的当前页的演示时长的起始时刻是演示文稿切换到当前页的时刻,该演示文稿的当前页的演示时长的结束时刻是演示文稿从当前页切换到其他页的时刻。Optionally, in some embodiments, the start moment of the presentation duration of the current page of the presentation is the moment when the presentation is switched to the current page, and the end moment of the presentation duration of the current page of the presentation is the moment of the presentation from the current page. The moment when the page is switched to another page.
例如,若演示文稿在T 1时刻切换到第n页(n为大于或等于1的正整数),则该视频分段装置可以从T 1时刻开始计时。若在计时时长超过该第一预设时长的情况下,该演示文稿还未切换到第n+1页,则该视频分段装置可以根据该内容描述信息和该语音信息,确定该待处理视频的分段点。若在T 2时刻(T 2大于T 1)该演示文稿切换到第n+1页,且从T 1时刻到T 2时刻的时长小于或等于该第二预设时长,则该视频分段装置可以根据该内容描述信息和该语音信息,确定该待处理视频的分段点。若从T 1时刻到T 2时刻的时长小于或等于该第一预设时长并且大于该第二预设时长,则该视频分段装置可以根据该演示文稿和该语音信息,确定该待处理视频的分段点。更具体地,该视频分段装置可以根据第n页的演示文稿和该语音信息,确定该待处理视频的分段点。 For example, if the presentation time T 1 is switched to page n (n is a positive integer equal to or greater than 1), then the device may start counting the video segment from the time T 1. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point. If the presentation is switched to page n+1 at time T 2 (T 2 is greater than T 1 ), and the time length from time T 1 to time T 2 is less than or equal to the second preset time length, the video segmentation device The segmentation point of the video to be processed can be determined according to the content description information and the voice information. If the duration from time T 1 to time T 2 is less than or equal to the first preset duration and greater than the second preset duration, the video segmentation device may determine the to-be-processed video based on the presentation and the voice information The segmentation point. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.
可选的,在另一些实施例中,该演示文稿的当前页的演示时长的起始时刻可以是上一分段点,该演示文稿的当前页的演示时长的结束时刻是演示文稿从当前页切换到其他页的时刻。Optionally, in other embodiments, the start time of the presentation duration of the current page of the presentation may be the previous segment point, and the end time of the presentation duration of the current page of the presentation is the beginning of the presentation from the current page The moment to switch to another page.
例如,假设演示文稿在T 3时刻切换到第n页(n为大于或等于1的正整数),且该演示文稿在第n页的停留时长大于该第一预设时长。在此情况下,该视频分段装置根据该内容描述信息和该语音信息,确定该待处理视频的一个分段点为T 4时刻。该视频分段装置可以从T 4时刻开始计时。若在计时时长超过该第一预设时长的情况下,该演示文稿还未切换到第n+1页,则该视频分段装置可以根据该内容描述信息和该语音信息,确定该待处理视频的分段点。若在T 5时刻(T 5大于T 4)该演示文稿切换到第n+1页,且从T 4时刻到T 5时刻的时长不大于该第一预设时长并大于该第二预设时长,则该视频分段装置可以根据该演示文稿和该语音信息,确定该待处理视频的分段点。更具体地,该视频分段装置可以根据第n页的演示文稿和该语音信息,确定该待处理视频的分段点。 For example, assume that a presentation time at T 3 is switched to page n (n is a positive integer equal to or greater than 1), and the presentation time length is greater than the first predetermined length of the residence time in the n-th page. In this case, the video segment of the device descriptive information and the voice information according to the content, determines to be treated as a divide point of video time T 4. The video segments from the apparatus may be timed time T 4. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point. If the presentation is switched to page n+1 at time T 5 (T 5 is greater than T 4 ), and the duration from time T 4 to time T 5 is not greater than the first preset duration and greater than the second preset duration , The video segmentation device can determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.
可选的,在另一些实施例中,在该文本信息包括该演示文稿和该内容描述信息的情况下,该视频分段装置可以根据该演示文稿和该语音信息,确定该待处理视频的分段点。换句话说,即使文本信息同时包括该演示文稿和该内容描述信息,该视频分段装置也可以只 参考该演示文稿和该语音信息(即不会使用该内容描述信息),确定该待处理视频的分段点。Optionally, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine the segmentation of the to-be-processed video based on the presentation and the voice information. Segment point. In other words, even if the text information includes the presentation and the content description information at the same time, the video segmentation device can only refer to the presentation and the voice information (that is, the content description information is not used) to determine the video to be processed The segmentation point.
可选的,在另一些实施例中,在该文本信息包括该演示文稿和该内容描述信息的情况下,该视频分段装置可以根据该内容描述信息和该语音信息,确定该待处理视频的分段点。换句话说,即使文本信息同时包括该演示文稿和该内容描述信息,该视频分段装置也可以只参考该内容描述信息和该语音信息(即不会使用该演示文稿),确定该待处理视频的分段点。Optionally, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine the content of the to-be-processed video according to the content description information and the voice information. Segment point. In other words, even if the text information includes both the presentation and the content description information, the video segmentation device can only refer to the content description information and the voice information (that is, the presentation will not be used) to determine the video to be processed The segmentation point.
该视频分段装置根据该演示文稿和该语音信息,确定该待处理视频的分段点可以包括:该视频分段装置确定该演示文稿的切换点,该演示文稿在该切换点前后呈现的内容不同;该视频分段装置根据该语音信息,确定至少一个停顿点;该视频分段装置根据该切换点和该至少一个停顿点,确定该分段点。The video segmentation device determining the segmentation point of the video to be processed according to the presentation and the voice information may include: the video segmentation device determines the switching point of the presentation, and the content of the presentation before and after the switching point Different; the video segmentation device determines at least one pause point based on the voice information; the video segmentation device determines the segment point based on the switching point and the at least one pause point.
演示文稿的切换点是指演示文稿发生切换的时刻。演示文稿发生切换可以是指演示文稿翻页。例如从第1页切换到第2页。演示文稿发生切换也可以是指在没有翻页的情况下,演示文稿的内容发生变化。例如,在演示文稿是文本文档的情况下,发言人可能仅展示该演示文稿的某一页的一部分(例如上半部分),然后滚动到该页的剩余部分(例如下半部分)。虽然此时演示文稿并非翻页,但是演示文稿中的内容发生了变化。The switching point of the presentation refers to the moment when the presentation is switched. The switching of the presentation can refer to the page turning of the presentation. For example, switch from page 1 to page 2. The switching of the presentation can also mean that the content of the presentation changes without turning pages. For example, in a case where the presentation is a text document, the speaker may only show a part (for example, the upper half) of a certain page of the presentation, and then scroll to the remaining part (for example, the lower half) of the page. Although the presentation is not page turning at this time, the content in the presentation has changed.
可选的,在一些实施例中,该视频分段装置可以获取到用于指示切换该演示文稿的内容的切换信号。在此情况下,该视频分段装置可以确定获取到该切换信号的时刻为该切换点。Optionally, in some embodiments, the video segmentation device may obtain a switching signal for instructing to switch the content of the presentation. In this case, the video segmentation device may determine that the moment when the switching signal is acquired is the switching point.
可选的,在一些实施例中,该视频分段装置可以获取到该演示文稿的内容。在此情况下,该视频分段装置可以根据该演示文稿的内容的变化来确定该切换点。例如,该视频分段装置可以在确定该待处理视频在第一时刻所呈现的演示文稿的内容与在第二时刻所呈现的演示文稿的内容不同的情况下,确定该第一时刻为该切换点。可选的,在一些实施例中,该第一时刻与该第二时刻是相邻的时刻,且该第一时刻在该第二时刻之前。可选的,在另一些实施例中,该第一时刻在该第二时刻之前且该第一时刻与该第二时刻间隔时长少于一个预设时长。换句话说,在此情况下,该视频分段装置可以每隔一段时长检测一下演示文稿呈现的内容是否发生变化。Optionally, in some embodiments, the video segmentation device may obtain the content of the presentation. In this case, the video segmentation device may determine the switching point according to the change of the content of the presentation. For example, the video segmentation device may determine that the first moment is the switching when it is determined that the content of the presentation presented at the first moment of the video to be processed is different from the content of the presentation presented at the second moment. point. Optionally, in some embodiments, the first moment and the second moment are adjacent moments, and the first moment is before the second moment. Optionally, in other embodiments, the first time is before the second time and the interval between the first time and the second time is less than a preset time length. In other words, in this case, the video segmentation device can detect whether the content presented by the presentation has changed every period of time.
可选的,在一些实施例中,该视频分段装置可以结合获取到用于指示切换该演示文稿的内容的切换信号和该演示文稿所呈现的内容,确定该切换点。例如,该视频分段装置在T 1时刻获取到该切换信号。该视频分段装置可以获取该演示文稿在T 1时刻的前F 1帧呈现的内容以及T 1时刻之后的F 2帧呈现的内容,F 1和F 2为大于或等于1的正整数。可选的,在一些实施例中,F 1和F 2可以取较小的值,例如F 1和F 2可以等于2。这样可以减少计算量。如果该演示文稿在F 1帧和F 2帧中的连续两帧呈现的内容不同,则可以确定该演示文稿呈现内容发生变化的帧所在的时刻为该切换点。例如假设F 1和F 2的值均为2。若该演示文稿在四帧中的第2帧和第3帧呈现的内容不同,则可以确定第2帧所在的时刻为该切换点。利用切换信号和该演示文稿呈现的内容确定切换点,可以避免切换信号和演示文稿的画面切换不同步导致的确定出的切换点不准确的情况发生。 Optionally, in some embodiments, the video segmentation device may determine the switching point in combination with the acquired switching signal for instructing to switch the content of the presentation and the content presented by the presentation. For example, the video segment unit acquires the switching signal at time T 1. The video segment presentation means may obtain the content after the content F F before time T 1 to a time T 1 and the presentation of presentation 2, F 1 and F 2 is greater than or equal to a positive integer. Optionally, in some embodiments, F 1 and F 2 may take smaller values, for example, F 1 and F 2 may be equal to 2. This can reduce the amount of calculation. If the content of the presentation. 1 in the frame F and F presented in two different two consecutive frames, it may be determined that the presentation time of the presentation frame where the content changes for the switching point. For example, suppose that the values of F 1 and F 2 are both 2. If the content of the presentation in the second and third frames of the four frames is different, it can be determined that the moment when the second frame is located is the switching point. Using the switching signal and the content presented by the presentation to determine the switching point can avoid the inaccuracy of the determined switching point caused by the unsynchronization of the switching signal and the screen switching of the presentation.
可选的,在一些实施例中,该视频分段装置可以根据以下方式确定演示文稿在不同时刻(或者不同帧)呈现的内容是否相同:该视频分段装置比较该演示文稿在不同时刻(或者不同帧)在相同位置的像素值的变化超过预设变化值的个数P,若P大于第一预设阈值 P 1,则该视频分段装置该演示文稿呈现的内容发生了变化。可选的,在一些实施例中,像素值的变化可以通过计算像素灰度值的差值的绝对值确定。可选的,在另一些实施例中,像素值的变化可以通过计算三个色彩通道中的差值的绝对值的和确定。 Optionally, in some embodiments, the video segmentation device may determine whether the content presented by the presentation at different moments (or different frames) is the same in the following manner: the video segmentation device compares the presentation at different moments (or The change of the pixel value at the same position in different frames) exceeds the number P of the preset change value. If P is greater than the first preset threshold P 1 , the content presented by the presentation of the video segmentation device has changed. Optionally, in some embodiments, the change in the pixel value can be determined by calculating the absolute value of the difference between the pixel gray values. Optionally, in other embodiments, the change of the pixel value can be determined by calculating the sum of the absolute values of the differences in the three color channels.
可选的,在一些实施例中,若P大于第二预设阈值P 2(P 2小于P 1),则该视频分段装置可以根据在后的演示文稿,确定关键词。例如,该视频分段装置确定T 1时刻的演示文稿和T 2时刻的演示文稿(T 2时刻晚于T 1时刻)在相同位置的像素值的变化超过该预设变化值的个数大于P 2且小于P 1。在此情况下,该视频分段装置可以根据T 2时刻的演示文稿确定关键词。 Optionally, in some embodiments, if P is greater than a second preset threshold P 2 (P 2 is less than P 1 ), the video segmentation device may determine keywords based on subsequent presentations. For example, the video segmentation device determines that the presentation at time T 1 and the presentation at time T 2 (time T 2 is later than time T 1 ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P 2 and less than P 1 . In this case, the device may determine keywords video segment presentation time T 2 according to the.
如上所述,该语音信息中还可以包括至少一个停顿点。可选的,在一些实施例中,用于确定分段点的至少一个停顿点可以是从起始时刻到当前时刻的全部停顿点。若步骤302确定的分段点是该待处理视频的第一个分段点,则该起始时刻是该待处理视频的起始时刻。若步骤302确定的分段点是该待处理视频的第k个分段点(k为大于或等于2的正整数),则该起始时刻是第k-1个分段点所在的时刻。可选的,在另一些实施例中,该视频分段装置还可以根据切换点所在的时刻,确定一个时间范围内的停顿点,该切换点在这个时间范围内。例如,若该切换点位于T 1时刻,则该视频分段装置可以确定出T 1-t到T 1+t时刻的停顿点。 As mentioned above, the voice information may also include at least one pause point. Optionally, in some embodiments, the at least one pause point used to determine the segment point may be all the pause points from the start time to the current time. If the segment point determined in step 302 is the first segment point of the to-be-processed video, the start time is the start time of the to-be-processed video. If the segment point determined in step 302 is the k-th segment point of the to-be-processed video (k is a positive integer greater than or equal to 2), then the start time is the time where the k-1 segment point is located. Optionally, in other embodiments, the video segmentation device may also determine a pause point within a time range according to the moment when the switching point is located, and the switching point is within this time range. For example, if the switching point is at time T 1 , the video segmentation device can determine the pause point from time T 1 -t to time T 1 +t.
该视频分段装置在确定该切换点与至少一个停顿点中的一个停顿点相同的情况下,确定该切换点为该分段点。该视频分段装置在确定该切换点与该至少一个停顿点中的任一个停顿点均不相同的情况下,确定该至少一个停顿点中距离该切换点最近的一个停顿点为该分段点。停顿点与切换点的距离是指停顿点与切换点的时间差。例如,假设该切换点位于T 1时刻,该至少一个停顿点中的一个停顿点位于T 2时刻,T 2与T 1的差为t。假设该至少一个停顿点中除该T 2时刻停顿点外的停顿点到T 1时刻的差均大于t,则该T 2时刻的停顿点为该分段点。若该至少一个停顿点中有两个停顿点到该切换点的距离相同且小于除该两个停顿点外的其他停顿点到该切换点的距离,则可以确定该两个停顿点中的任一个停顿点为该切换点。 The video segmentation device determines that the switching point is the segmentation point when it is determined that the switching point is the same as one of the at least one pause point. When it is determined that the switching point is not the same as any one of the at least one stopping point, the video segmentation device determines that the one of the at least one stopping point that is closest to the switching point is the segment point . The distance between the stopping point and the switching point refers to the time difference between the stopping point and the switching point. For example, assuming that the switching point is at time T 1, the at least one stop point a point positioned pause time T 2, T 2 T 1 as a difference as t. Assuming that the difference between the at least one stopping point except the stopping point at time T 2 and the time T 1 is greater than t, then the stopping point at time T 2 is the segment point. If the distance from two of the at least one stopping point to the switching point is the same and less than the distance from the switching point to the other stopping points except the two stopping points, then any of the two stopping points can be determined. A pause point is the switching point.
该视频分段装置根据该内容描述信息和该语音信息,确定该待处理视频的分段点可以包括:该视频分段装置根据该语音信息、该内容描述信息的关键词和该语音信息中的停顿点,确定该待处理视频的分段点。The video segmentation device determining the segmentation point of the video to be processed according to the content description information and the voice information may include: the video segmentation device according to the voice information, the keywords of the content description information, and the information in the voice information Pause point, determine the segment point of the video to be processed.
可选的,在一些可能的实现方式中,该待处理视频中可以被划分为多个语音信息片段。第一语音信息片段和第二语音信息片段是该多个语音信息片段中的两个连续的语音信息片段。该第一语音信息片段在该第二语音信息片段之后。该视频分段装置可以根据该第一语音信息片段、该第二语音信息片段、该内容描述信息的关键词和该语音信息中的停顿点,确定第一分段点,该第一分段点是该待处理视频包括的至少一个分段点中的一个。Optionally, in some possible implementation manners, the to-be-processed video may be divided into multiple voice information segments. The first voice information segment and the second voice information segment are two consecutive voice information segments among the multiple voice information segments. The first voice information segment follows the second voice information segment. The video segmentation device can determine the first segmentation point according to the first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information. Is one of at least one segment point included in the to-be-processed video.
该视频分段装置可以以窗口长度W和步长S,在该语音信息上截取文字段。该视频分段装置可以截取出至少一个长度为W的文字段。每个长度为W的文字段就是一个语音信息片段。The video segmentation device can intercept text fields on the voice information with a window length W and a step size S. The video segmentation device can cut out at least one text field of length W. Each text field of length W is a piece of voice information.
该视频分段装置可以确定第一语音信息片段是否与第二语音信息片段相似。如果该第一语音信息片段与该第二语音信息片段不相似,则可以确定该待处理视频的一个分段点在该第一语音信息片段附近。如果该第二语音信息片段与该第一语音信息片段相似,则继续 确定与该第一语音信息片段相邻且位于该第一语音信息片段之后的第三语音信息片段与该第一语音信息片段是否相似。The video segmentation device can determine whether the first voice information segment is similar to the second voice information segment. If the first voice information segment is not similar to the second voice information segment, it can be determined that a segment point of the to-be-processed video is near the first voice information segment. If the second voice information segment is similar to the first voice information segment, continue to determine that a third voice information segment adjacent to the first voice information segment and located after the first voice information segment and the first voice information segment Is it similar.
相似度可以作为用于衡量第一语音信息片段与第二语音信息片段是否相似的一个标准。若该第一语音信息片段和该第二语音信息片段的相似度大于或等于一个相似度阈值,则可以认为第一语音信息片段与第二语音信息片段相似;若该第一语音信息片段和该第二语音信息片段的相似度小于该相似度阈值,则可以认为该第一语音信息片段与该第二语音信息片段不相似。The similarity degree can be used as a criterion for measuring whether the first voice information segment is similar to the second voice information segment. If the similarity between the first voice information segment and the second voice information segment is greater than or equal to a similarity threshold, it can be considered that the first voice information segment is similar to the second voice information segment; if the first voice information segment is If the similarity of the second voice information segment is less than the similarity threshold, it can be considered that the first voice information segment is not similar to the second voice information segment.
可选的,在一些可能的实现方式中,该视频分段装置可以根据该第一语音信息片段的关键词、该第二语音信息片段的关键词、该第一语音信息片段的内容、该第二语音信息片段的内容和该内容描述信息的关键词,确定该第一语音信息片段和该第二语音信息片段的相似度。Optionally, in some possible implementation manners, the video segmentation device may be based on keywords of the first voice information segment, keywords of the second voice information segment, content of the first voice information segment, and the first voice information segment. Second, the content of the voice information segment and the keywords of the content description information determine the similarity between the first voice information segment and the second voice information segment.
该视频分段装置可以确定该第一语音信息片段的关键词。假设从该第一语音信息片段中确定出的关键词数目为N,从该内容描述信息中确定的关键词数目为M,该M个关键词和该N个关键词中没有重复的关键词。The video segmentation device can determine the keywords of the first voice information segment. Assuming that the number of keywords determined from the first voice information segment is N, the number of keywords determined from the content description information is M, and there are no duplicate keywords among the M keywords and the N keywords.
该视频分段装置可以根据以下方式确定关键词:The video segmentation device can determine keywords according to the following methods:
步骤1,根据预先设置的停用词表或者根据文本中的每个词的词性,去掉不代表实际意义的词,例如“的”、“这个”、“然后”等。停用词(Stop Words)是人工输入的,非自动化生成的一些字或词。这些词不表示实际意义,在处理自然语言数据之前或之后会被过滤掉。由停用词组成的停用词集合可以称为停用词表。Step 1. According to the preset stop word list or according to the part of speech of each word in the text, remove the words that do not represent the actual meaning, such as "的", "this", "then", etc. Stop words (Stop Words) are manually input, some characters or words that are not automatically generated. These words do not represent actual meaning and will be filtered out before or after processing natural language data. A set of stop words composed of stop words can be called a stop word list.
步骤2,统计剩余的词中的每个词在文本中出现的频率。每个词在文本中出现的频率可以根据以下公式确定:Step 2: Count the frequency of each of the remaining words in the text. The frequency of each word in the text can be determined according to the following formula:
TF(n)=N(n)/All_N,公式1.1TF(n)=N(n)/All_N, formula 1.1
其中,TF(n)表示经过了步骤1后,该文本的剩余的词中的第n个词在该文本中出现的频率,N(n)表示该第n个词出现的次数,All_N表示剩余的词的总数目。Among them, TF(n) represents the frequency of the nth word in the remaining words of the text in the text after step 1, N(n) represents the number of times the nth word appears, and All_N represents the remaining The total number of words.
步骤3,确定出现频率最高的至少一个词为该文本的关键词。Step 3. Determine at least one word with the highest frequency as a keyword of the text.
例如若该文本是内容描述信息,则可以确定出现频率最高的M个词为该内容描述信息的关键词,其中M为大于或等于1的正整数。若该文本是该第一语音信息片段,则可以确定出现频率最高的N个词为该第一语音信息片段的关键词,N为大于或等于1的正整数。若确定出该第一语音信息片段中出现频率该N个词中的一个或多个词与该内容描述信息的关键词相同,则该N个词中删除重复的词,选择后面的词作为该第一语音信息片段的关键词。例如,假设N等于2,M等于1,该内容描述信息的关键词包括“学生”。假设确定出的该第一语音信息片段中出现频率最高的词为“学生”,那么继续确定出现频率第二高的词。若出现频率第二高的词为“学校”,则可以确定“学校”是该第一语音信息片段的一个关键词,继续确定出现频率第三高的词。假设出现频率第三高的词为“课程”,则可以确定“课程”是该第一语音信息片段的另一个关键词。若该文本是该第二语音信息片段,则可以确定出现频率最高的N个词为该第二语音信息片段的关键词,N为大于或等于1的正整数。若确定出该第二语音信息片段中出现频率该N个词中的一个或多个词与该内容描述信息的关键词相同,则该N个词中删除重复的词,选择后面的词作为该第二语音信息片段的关键词。For example, if the text is content description information, it can be determined that the M words with the highest appearance frequency are keywords of the content description information, where M is a positive integer greater than or equal to 1. If the text is the first voice information segment, it can be determined that the N words with the highest occurrence frequency are keywords of the first voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the first voice information fragment are the same as the keywords of the content description information, the repeated words are deleted from the N words, and the following words are selected as the The keyword of the first voice information segment. For example, assuming that N is equal to 2 and M is equal to 1, the keywords of the content description information include "student". Assuming that the determined word with the highest frequency in the first voice information segment is "student", then continue to determine the word with the second highest frequency. If the word with the second highest appearance frequency is "school", it can be determined that "school" is a keyword of the first voice information segment, and continue to determine the word with the third highest appearance frequency. Assuming that the word with the third highest frequency is "course", it can be determined that "course" is another keyword of the first voice information segment. If the text is the second voice information segment, it can be determined that the N words with the highest occurrence frequency are keywords of the second voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the second voice information fragment are the same as the keywords of the content description information, the repeated words are deleted from the N words, and the following words are selected as the The keyword of the second voice information segment.
可选的,在一些实施例中,该视频分段装置可以根据该第一语音信息片段的关键词、 该内容描述信息的关键词和该第一语音信息片段的内容,确定第一关键词向量。具体地,该视频分段装置可以确定该第一语音信息片段的关键词、该内容描述信息的关键词在该第一语音信息片段的内容中出现的频率,该频率就是该第一关键词向量。语音信息片段的内容是指语音信息片段中包括的全部词。例如,假设该内容描述信息的关键词为“学生”,该第一语音信息片段的关键词为“课程”和“学校”。假设上述三个关键词在该第一语音信息片段中出现的频率为分别为0.1,0.2和0.3,则该第一关键词向量为(0.3,0.2,0.1)。Optionally, in some embodiments, the video segmentation device may determine the first keyword vector based on the keywords of the first voice information segment, the keywords of the content description information, and the content of the first voice information segment . Specifically, the video segmentation device may determine the frequency of the keywords of the first voice information segment and the keywords of the content description information in the content of the first voice information segment, and the frequency is the first keyword vector . The content of the voice information fragment refers to all the words included in the voice information fragment. For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "course" and "school". Assuming that the frequencies of the above three keywords in the first voice information segment are 0.1, 0.2, and 0.3, respectively, the first keyword vector is (0.3, 0.2, 0.1).
类似的,该视频分段装置也可以根据该第二语音信息片段的关键词、该内容描述信息的关键词和该第二语音信息片段的内容,确定第二关键词向量。具体地,该视频分段装置可以确定该第二语音信息片段的关键词、该内容描述信息的关键词在该第二语音信息片段的内容中出现的频率,该频率就是该第二关键词向量。例如,假设该内容描述信息的关键词为“学生”,该第一语音信息片段的关键词为“早餐”和“营养”。假设上述三个关键词在该第二语音信息片段中出现的频率为分别为0.3,0.25和0.05,则该第二关键词向量为(0.3,0.25,0.05)。Similarly, the video segmentation device may also determine the second keyword vector based on the keywords of the second voice information segment, the keywords of the content description information, and the content of the second voice information segment. Specifically, the video segmentation device may determine the keyword of the second voice information segment and the frequency of the keyword of the content description information in the content of the second voice information segment, and the frequency is the second keyword vector . For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "breakfast" and "nutrition". Assuming that the frequencies of the above three keywords appearing in the second voice information segment are 0.3, 0.25, and 0.05, respectively, the second keyword vector is (0.3, 0.25, 0.05).
该视频分段装置根据该第一关键词向量和该第二关键词向量确定的距离,若该距离大于预设距离,则可以认为该第一语音信息片段和该第二语音信息片段的相似度小于相似度阈值。在此情况下,该视频分段装置根据该第一语音信息片段确定该分段点。The video segmentation device determines the distance based on the first keyword vector and the second keyword vector. If the distance is greater than a preset distance, the similarity between the first voice information segment and the second voice information segment can be considered Less than the similarity threshold. In this case, the video segmentation device determines the segmentation point according to the first voice information segment.
该视频分段装置可以根据以下方式根据该第一关键词向量和该第二关键词向量的确定一个距离:The video segmentation device may determine a distance based on the first keyword vector and the second keyword vector in the following manner:
步骤1,将该第一关键词向量扩展为第一向量,将该第二关键词向量扩展为第二向量,其中该第一向量对应的关键词和该第二向量对应的关键词包括该第一语音信息片段的关键词、该第二语音信息片段的关键词和该内容描述信息的关键词,且该第一向量对应的关键词中没有重复的关键词、该第二向量对应的关键词中没有重复的关键词。Step 1. Expand the first keyword vector into a first vector, and expand the second keyword vector into a second vector, wherein the keyword corresponding to the first vector and the keyword corresponding to the second vector include the first vector The keywords of a voice information segment, the keywords of the second voice information segment, and the keywords of the content description information, and there are no repeated keywords in the keywords corresponding to the first vector, and keywords corresponding to the second vector There are no duplicate keywords in.
例如,假设该第一关键词向量为(0.3,0.2,0.1),对应的关键词为“学校”、“课程”和“学生”,假设该第二关键词向量为(0.3,0.25,0.05),对应的关键词为“学生”、“早餐”和“营养”。在此情况下,该第一向量为(0.3,0.1,0,0.2,0),对应的关键词为“学校”、“学生”、“早餐”、“课程”、“营养”,该第二向量为(0,0.3,0.25,0,0.05),对应的关键词为“学校”、“学生”、“早餐”、“课程”、“营养”。For example, suppose the first keyword vector is (0.3,0.2,0.1), and the corresponding keywords are "school", "course" and "student", and suppose the second keyword vector is (0.3,0.25,0.05) , The corresponding keywords are "student", "breakfast" and "nutrition". In this case, the first vector is (0.3,0.1,0,0.2,0), and the corresponding keywords are "school", "student", "breakfast", "course", "nutrition", and the second The vector is (0,0.3,0.25,0,0.05), and the corresponding keywords are "school", "student", "breakfast", "course", and "nutrition".
步骤2,计算该第一向量和该第二向量之间的距离。该第一向量和该第二向量之间的距离就是根据该第一关键词向量和该第二关键词向量确定的距离。Step 2: Calculate the distance between the first vector and the second vector. The distance between the first vector and the second vector is the distance determined according to the first keyword vector and the second keyword vector.
可选的,在一些实施例中,该第一向量和该第二向量之间的距离可以是欧氏距离。由于前后两个语音信息片段中相同的关键词可能会很少。因此如果该第一向量和该第二向量之间的距离是余弦距离,则计算结果中可能会出现很多的0值。因此,选择欧氏距离作为该第一向量和该第二向量之间的距离可能更合适。Optionally, in some embodiments, the distance between the first vector and the second vector may be the Euclidean distance. Because the same keywords in the two voice information fragments may be few. Therefore, if the distance between the first vector and the second vector is a cosine distance, there may be many zero values in the calculation result. Therefore, it may be more appropriate to select the Euclidean distance as the distance between the first vector and the second vector.
可选的,在另一些实施例中,该第一向量和该第二向量之间的距离可以是余弦距离。Optionally, in other embodiments, the distance between the first vector and the second vector may be a cosine distance.
除了利用两个相邻语音信息片段的词频向量来确定两个语音信息片段是否相似外,也可以利用其他方式确定两个语音信息片段是否相似。In addition to using the word frequency vectors of two adjacent speech information segments to determine whether two speech information segments are similar, other methods can also be used to determine whether two speech information segments are similar.
例如,第一关键词向量和第二关键词向量也可以是词频-逆文档频率、二值词频等。确定该第一关键词向量和该第二关键词向量的距离可以是确定该第一关键词向量和该第二关键词向量的n-范数距离(n为大于或等于1的正整数),确定该第一关键词向量和该第二关 键词向量的相对熵距离。For example, the first keyword vector and the second keyword vector may also be term frequency-inverse document frequency, binary term frequency, etc. Determining the distance between the first keyword vector and the second keyword vector may be determining the n-norm distance of the first keyword vector and the second keyword vector (n is a positive integer greater than or equal to 1), Determine the relative entropy distance between the first keyword vector and the second keyword vector.
还以上述第一向量(即(0.3,0.1,0,0.2,0))和第二向量(即(0,0.3,0.25,0,0.05))为例,可以对该第一向量和该第二向量进行2值化处理。2值化处理后的第一向量为(1,1,0,1,0),2值化处理后的第二向量为(0,1,1,0,1)。然后计算1-范数距离,得到第一语音信息片段的关键词和第二语音信息片段的关键词的重复度。关键词的重复度可以认为是一种距离的特殊形式。可以利用关键词的重复度确定第一语音信息片段和该第二语音信息片段是否相似。若该关键词的重复度大于或等于一个预设重复度,则可以认为该第一语音信息片段和该第二语音信息片段相似;若关键词的重复度小于该预设重复度,则可以认为该第一语音信息片段和该第二语音信息片段不相似。可以看出在此情况下该预设重复度可以认为是相似度阈值。Taking the above-mentioned first vector (i.e. (0.3,0.1,0,0.2,0)) and the second vector (i.e. (0,0.3,0.25,0,0.05)) as examples, the first vector and the first vector The two vectors are binarized. The first vector after the binarization process is (1,1,0,1,0), and the second vector after the binarization process is (0,1,1,0,1). Then the 1-norm distance is calculated to obtain the repetition degree of the keywords of the first voice information segment and the keywords of the second voice information segment. The repetition of keywords can be considered a special form of distance. The degree of repetition of the keywords can be used to determine whether the first voice information segment is similar to the second voice information segment. If the repetition degree of the keyword is greater than or equal to a preset repetition degree, it can be considered that the first voice information segment is similar to the second voice information segment; if the repetition degree of the keyword is less than the preset repetition degree, it can be considered The first voice information segment and the second voice information segment are not similar. It can be seen that in this case, the preset degree of repetition can be regarded as the similarity threshold.
可选的,在另一些实施例中,关键词的提取也可以根据词频-逆文档频率来确定。词频可以基于公式1.1确定。逆文档频率可以根据以下公式确定:Optionally, in other embodiments, the extraction of keywords may also be determined according to the term frequency-inverse document frequency. The word frequency can be determined based on formula 1.1. The inverse document frequency can be determined according to the following formula:
IDF(n)=log(Num_Doc/(Doc(n)+1),公式1.2IDF(n)=log(Num_Doc/(Doc(n)+1), formula 1.2
其中IDF(n)表示第n个词的逆文档频率,Num_Doc表示语料库中文档总数,Doc(n)表示语料库中包含第n个词的文档数。IDF(n) represents the inverse document frequency of the nth word, Num_Doc represents the total number of documents in the corpus, and Doc(n) represents the number of documents containing the nth word in the corpus.
词频-逆文档频率可以根据以下公式确定:Term frequency-inverse document frequency can be determined according to the following formula:
TF-IDF(n)=TF(n)×IDF(n),公式1.3TF-IDF(n)=TF(n)×IDF(n), formula 1.3
其中TF-IDF(n)表示第n个词的词频-逆文档频率。如果关键词是根据词频-逆文档频率确定的,则第一关键词向量是由关键词的词频-逆文档频率组成。Where TF-IDF(n) represents the word frequency of the nth word-the inverse document frequency. If the keyword is determined based on the term frequency-inverse document frequency, the first keyword vector is composed of the term frequency of the keyword-inverse document frequency.
在根据词频-逆文档频率确定关键词时,可以不需要先将无意义的词去除。When determining keywords based on term frequency-inverse document frequency, there is no need to remove meaningless words first.
可选的,在另一些实施例中,关键词的提取也可以基于词图的文本排名(TextRank)方法。如果关键词是根据基于词图的TextRank确定的,则第一关键词向量可以由词的权值组成。Optionally, in other embodiments, the extraction of keywords may also be based on a text ranking (TextRank) method of word maps. If the keyword is determined according to the TextRank based on the word graph, the first keyword vector may be composed of the weight of the word.
在该第一语音信息片段和该第二语音信息片段不相似的情况下,该视频分段装置可以根据该第一语音信息片段,确定该分段点。In the case that the first voice information segment and the second voice information segment are not similar, the video segmentation device may determine the segmentation point according to the first voice information segment.
该视频分段装置可以先确定该第一语音信息片段中是否包括停顿点。如果该第一语音信息片段中包括一个停顿点,则可以确定该停顿点是该分段点。如果该第一语音信息片段中包括多个停顿点,则可以确定该多个停顿点中的每个停顿点后的词是否是预设词。该预设词包括有分段意义的连词,例如“接下来”、“下面”、“下一点”等。停顿点后的词是指位于停顿点后的与停顿点相邻的词。如果该多个停顿点中的只有一个停顿点后的词是预设词,则可以确定该停顿点是分段点。如果该多个停顿点中有至少两个停顿点后面的词是预设词,则可以确定该至少两个停顿点中停顿时长的停顿点为该分段点。如果该多个停顿点后面的词都不是该预设词,则可以确定该多个停顿点中停顿时长最长的停顿点为该分段点。如果该第一语音信息片段中没有包括停顿点,则可以根据与该第一语音片段相邻的停顿点确定该分段点。可以理解,与该第一语音片段相邻的停顿点可以有两个,一个位于该第一语音片段之前,另一个位于该第一语音片段之后。该视频分段装置可以根据这两个停顿点到该第一语音信息片段之间的距离,确定该分段点。若停顿点在该第一语音信息片段之前,则该停顿点到该第一语音信息片段之间的距离可以是该停顿点到该第一语音信息片段的起始位置之间的字数或者时间差。若该停顿点在该第一语音信息片段之后,则该停顿点到该第 一语音信息片段之间的距离可以是该停顿点到该第一语音信息片段的结束位置之间的字数或者时间差。为便于描述,以下将位于该第一语音信息片段之前的与该第一语音信息片段相邻的停顿点称为前停顿点,该前停顿点到该第一语音信息片段之间的距离称为距离1;将位于该第一语音片段之后的与该第一语音片段相邻的停顿点称为后停顿点,该后停顿点到该第一语音信息片段之前的距离称为距离2。若距离1小于距离2,则可以确定该前停顿点为该分段点;若距离1大于距离2,则可以确定该后停顿点为该分段点。若距离1等于距离2,则可以确定前停顿点后的词(以下简称词1)和后停顿点后的词(以下简称词2);若词1为该预设词且词2不是该预设词,则确定该前停顿点为该分段点;若词1不是该预设词且词2是该预设词,则确定该后停顿点为该分段点;若词1和词2均为该预设词或者均不是该预设词,则可以确定该前停顿点和该后停顿点中停顿时间最长的一个为该分段点。The video segmentation device may first determine whether a pause point is included in the first voice information segment. If the first voice information segment includes a pause point, it can be determined that the pause point is the segment point. If the first voice information segment includes multiple pauses, it can be determined whether the word after each of the multiple pauses is a preset word. The presupposition includes conjunctions with segmentation meaning, such as "next", "below", "next point" and so on. The word after the pause refers to the word adjacent to the pause after the pause. If only one word after one of the multiple pause points is a preset word, it can be determined that the pause point is a segmentation point. If the word following at least two of the multiple pauses is a preset word, it can be determined that the pause of the pause duration among the at least two pauses is the segment point. If none of the words following the multiple pause points is the preset word, it can be determined that the pause point with the longest pause duration among the multiple pause points is the segment point. If the first voice information segment does not include a pause point, the segment point may be determined according to the pause point adjacent to the first voice segment. It can be understood that there may be two pause points adjacent to the first speech segment, one is located before the first speech segment, and the other is located after the first speech segment. The video segmentation device may determine the segmentation point according to the distance between the two pause points and the first voice information segment. If the pause point is before the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the start position of the first voice information segment. If the pause point is after the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the end position of the first voice information segment. For ease of description, the pause point located before the first voice information segment and adjacent to the first voice information segment is referred to as the front pause point, and the distance between the front pause point and the first voice information segment is referred to as Distance 1: The pause point located after the first voice segment and adjacent to the first voice segment is called a post-pause point, and the distance between the post-pause point and before the first voice information segment is called a distance 2. If the distance 1 is less than the distance 2, the front stop point can be determined to be the segment point; if the distance 1 is greater than the distance 2, then the back stop point can be determined as the segment point. If distance 1 is equal to distance 2, then the word after the pre-pause (hereinafter referred to as word 1) and the word after the post-pause (hereinafter referred to as word 2) can be determined; if word 1 is the presupposition word and word 2 is not the prediction Assuming a word, the preceding pause point is determined as the segment point; if word 1 is not the preset word and word 2 is the preset word, then the post pause point is determined as the segment point; if word 1 and word 2 If both of the preset words or none of the preset words are the preset words, it can be determined that the first pause point and the back pause point have the longest pause time as the segment point.
如上所述,停顿点是说话者的自然的停顿。因此,该停顿点是有一定时长的。可选的,在一些实施例中,若确定停顿点是分段点,可以确定停顿点的中间时刻为该分段点。可选的,在另一些实施例中,若确定停顿点是分段点,可以确定停顿点的结束时刻为该分段点。可选的,在另一些实施例中,若确定停顿点是分段点,可以确定停顿点的起始时刻为该分段点。As mentioned above, the pause point is the natural pause of the speaker. Therefore, the pause point is a certain time long. Optionally, in some embodiments, if it is determined that the pause point is a segment point, the intermediate moment of the pause point may be determined as the segment point. Optionally, in other embodiments, if it is determined that the pause point is a segment point, the end time of the pause point may be determined as the segment point. Optionally, in other embodiments, if it is determined that the pause point is a segment point, the starting moment of the pause point may be determined as the segment point.
303,该视频分段装置根据该分段点,对该待处理视频进行分段。303. The video segmentation device segments the to-be-processed video according to the segmentation point.
若该分段点是该待处理视频的第一个分段点,则该分段的起始时刻是该待处理视频的起始时刻,该分段的结束时刻是该分段点。若该分段点是该待处理视频的第k个分段(k为大于或等于2的正整数),则该分段的起始时刻是第k-1个分段点,该分段的结束时刻是该分段点。If the segment point is the first segment point of the video to be processed, the start time of the segment is the start time of the video to be processed, and the end time of the segment is the segment point. If the segment point is the k-th segment of the video to be processed (k is a positive integer greater than or equal to 2), then the start time of the segment is the k-1 segment point, and the The end time is the segment point.
在确定了分段后,该视频分段装置还可以确定该分段的摘要。After the segment is determined, the video segmentation device can also determine the summary of the segment.
304,该视频分段装置可以根据分段语音信息的内容、该分段语音信息的关键词和目标文本的关键词,确定该分段的摘要。该目标文本包括该演示文稿和该内容描述信息中的至少一个。304. The video segmentation device may determine a summary of the segment according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text. The target text includes at least one of the presentation and the content description information.
可选的,在一些实施例中,该视频分段装置可以先确定第三关键词向量,然后根据该第三关键词向量确定该分段的摘要。Optionally, in some embodiments, the video segmentation device may first determine a third keyword vector, and then determine the segment summary according to the third keyword vector.
该视频分段装置可以根据分段语音信息的内容、该分段语音信息的关键词和该目标文本的关键词,确定第三关键词向量,其中该分段语音信息的内容是指组成该分段的语音信息的全部句子。The video segmentation device can determine the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text, where the content of the segmented voice information refers to the composition of the segment All sentences of the voice information of the segment.
可以理解的是,若该文本信息中只包括该演示文稿,则该目标文本包括该演示文稿;若该文本信息中只包括该内容描述信息,则该目标文本包括该内容描述信息;若该文本信息中包括该演示文稿和该内容描述信息,则该目标文本包括该演示文稿和该内容描述信息。It is understandable that if the text information only includes the presentation, the target text includes the presentation; if the text information only includes the content description information, the target text includes the content description information; if the text The information includes the presentation and the content description information, and the target text includes the presentation and the content description information.
该视频分段装置确定该分段语音信息的关键词的实现方式、该视频分段装置确定该目标文本的关键词的实现方式与该视频分段装置确定该第一语音信息分段的关键词的实现方式类似。The video segmentation device determines the implementation manner of the keyword of the segmented voice information, the video segmentation device determines the implementation manner of the keyword of the target text, and the video segmentation device determines the keyword of the first voice information segment The implementation is similar.
可选的,在一些实施例中,若该视频分段装置比较该演示文稿在不同时刻(或者不同帧)在相同位置的像素值的变化超过预设变化值的个数P大于第二预设阈值P 2(P 2小于P 1),则该视频分段装置可以根据在后的演示文稿,确定该目标文本的关键词。例如,该视频分段装置确定T 1时刻的演示文稿和T 2时刻的演示文稿(T 2时刻晚于T 1时刻)在相同位置的 像素值的变化超过该预设变化值的个数大于P 2且小于P 1。在此情况下,该视频分段装置可以根据T 2时刻的演示文稿确定该目标文本的关键词。 Optionally, in some embodiments, if the video segmentation device compares the pixel value changes of the presentation at the same position at different times (or different frames) by more than a preset change value, the number P is greater than the second preset Threshold P 2 (P 2 is smaller than P 1 ), the video segmentation device can determine the keyword of the target text according to the subsequent presentation. For example, the video segmentation device determines that the presentation at time T 1 and the presentation at time T 2 (time T 2 is later than time T 1 ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P 2 and less than P 1 . In this case, the video segment of the target device may determine the keywords text presentation based on the time T 2.
例如,假设从该演示文稿确定出的关键词数目为L,从该内容描述信息中确定的关键词数目为M,从该分段语音信息中确定的关键词数目为Q,该L个关键词、该M个关键词和该Q个关键词中没有重复的关键词。For example, suppose that the number of keywords determined from the presentation is L, the number of keywords determined from the content description information is M, the number of keywords determined from the segmented voice information is Q, and the L keywords There are no duplicate keywords among the M keywords and the Q keywords.
具体地,该视频分段装置可以先从该内容描述信息中确定M个关键词,然后确定该演示文稿中出现频率最高的L个词。如果该L个词中的一个或多个词也属于该M个关键词,则将该一个或多个词从该L个词中删除,然后继续从该演示文稿中确定出现频率次高的词,直到确定出的L个关键词和该M个关键词没有交集。在此之后,该视频分段装置从该分段语音信息中确定出Q个词。如果该Q个词中的一个或多个词属于该M个关键词或该L个关键词,则将该一个或多个词从该Q个词中删除,然后继续从该分段语音信息中确定出现频率次高的词,直到确定出的Q个关键词与L个关键词和M个关键词都没有交集。Specifically, the video segmentation device may first determine M keywords from the content description information, and then determine the L words that appear most frequently in the presentation. If one or more words in the L words also belong to the M keywords, delete the one or more words from the L words, and then continue to determine the word with the second highest frequency from the presentation , Until the determined L keywords and the M keywords have no intersection. After that, the video segmentation device determines Q words from the segmented voice information. If one or more words in the Q words belong to the M keywords or the L keywords, delete the one or more words from the Q words, and then continue from the segmented voice information Determine the word with the second highest frequency until the determined Q keywords do not overlap with L keywords and M keywords.
该第三关键词向量包括该Q个关键词、该L个关键词和该M个关键词在该分段语音信息中出现的频率。可以理解的是,如果该目标文本中不包括该内容描述信息,则M的值为0;如果该目标文本中不包括该演示文稿,则L的值为0。The third keyword vector includes the Q keywords, the L keywords and the frequency of the M keywords in the segmented voice information. It is understandable that if the target text does not include the content description information, the value of M is 0; if the target text does not include the presentation, the value of L is 0.
该视频分段装置可以根据确定的该第三关键词向量,确定该分段的摘要。The video segmentation device may determine the summary of the segment according to the determined third keyword vector.
具体地,该视频分段装置可以根据该目标文本与该分段语音信息的内容,确定参考文本,其中该参考文本包括J个句子,J为大于或等于1的正整数;根据该分段语音信息的关键词、该目标文本的关键词和该J个句子中的每个句子,确定J个关键词向量;根据该第三关键词向量和该J个关键词向量,确定该分段的摘要。该J个关键词向量中的第j个关键词向量是该分段语音信息的关键词和该目标文本的关键词在第j个句子中出现的频率。Specifically, the video segmentation device may determine the reference text according to the content of the target text and the segmented voice information, where the reference text includes J sentences, and J is a positive integer greater than or equal to 1; according to the segmented voice The keywords of the information, the keywords of the target text, and each sentence in the J sentences, determine J keyword vectors; determine the abstract of the segment according to the third keyword vector and the J keyword vectors . The j-th keyword vector in the J keyword vectors is the frequency of occurrence of the keywords of the segmented voice information and the keywords of the target text in the j-th sentence.
在该目标文本中包括冗余的句子的情况下,将该目标文本中的该冗余的句子删除,得到修正目标文本并将该修正目标文本与该分段语音信息的内容合并,得到该参考文本;在该目标文本不包括该冗余的句子的情况下,将该目标文本与该分段语音信息的内容合并,得到该参考文本。换句话说,在该目标文本包括该演示文稿和该内容描述信息的情况下,该演示文稿中的一个或多个句子可能在该内容描述信息中也出现。在此情况下,将该演示文稿中与该内容描述信息相同的一个或多个句子删除,然后将删除了冗余的句子的演示文稿、内容描述信息和该分段语音信息的内容合并,得到该参考文本。如果该目标文本中不包括冗余的句子,例如该演示文稿中的任一个句子在该内容描述信息中均为出现,或者该目标文本中仅包括该演示文稿和该内容描述信息中的一个,则可以直接将该目标文本与该分段语音信息的内容进行和并,得到该参考文本。In the case that the target text includes redundant sentences, delete the redundant sentences in the target text to obtain the revised target text and merge the revised target text with the content of the segmented voice information to obtain the reference Text; in the case that the target text does not include the redundant sentence, the target text is combined with the content of the segmented voice information to obtain the reference text. In other words, when the target text includes the presentation and the content description information, one or more sentences in the presentation may also appear in the content description information. In this case, delete one or more sentences in the presentation that are the same as the content description information, and then merge the presentation with the redundant sentences deleted, the content description information, and the content of the segmented voice information to obtain The reference text. If the target text does not include redundant sentences, for example, any sentence in the presentation appears in the content description information, or the target text includes only one of the presentation and the content description information, Then, the target text can be directly combined with the content of the segmented voice information to obtain the reference text.
该视频分段装置根据该第三关键词向量和该J个关键词向量,确定该分段的摘要,包括:该视频分段装置根据该第三关键词向量和该J个关键词向量,确定J个距离,其中该J个距离中的第j个距离是根据该第三关键词向量和该J个关键词向量中的第j个关键词向量确定的,j为大于或等于1且小于或等于J的正整数;确定该J个距离中距离最短的R个距离,R为大于或等于1且小于J的正整数;确定该分段的摘要,其中该分段的摘要包括与该R个距离对应的句子。该视频分段装置根据该第三关键词向量和第j个关键词向量确定第j个距离的具体实现方式与该视频分段装置根据该第一关键词向量和该第二关键词向量确定距离的实现方式类似,区别在于:根据该第三关键词向量和第j个关键词向量确定的第j个距 离是欧氏距离;根据该第一关键词向量和该第二关键词向量确定的距离可以是欧氏距离距离,也可以是余弦距离。该第三关键词向量和第j个关键词向量确定的第j个距离不可以是余弦距离的原因是在计算余弦距离时会对第j个关键词向量进行归一化。但是第j个管检测你向量的模长度恰好反映了句子j还有关键词的整体频率,因此不能被归一化。The video segmentation device determines the summary of the segment according to the third keyword vector and the J keyword vectors, including: the video segmentation device determines according to the third keyword vector and the J keyword vectors J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j is greater than or equal to 1 and less than or A positive integer equal to J; determine the R distances with the shortest distance among the J distances, and R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the summary of the segment includes the R distances The sentence corresponding to the distance. The specific implementation manner of the video segmentation device determining the j-th distance according to the third keyword vector and the j-th keyword vector and the video segmentation device determining the distance according to the first keyword vector and the second keyword vector The implementation of is similar, the difference is: the j-th distance determined according to the third keyword vector and the j-th keyword vector is the Euclidean distance; the distance determined according to the first keyword vector and the second keyword vector It can be Euclidean distance or cosine distance. The reason why the j-th distance determined by the third keyword vector and the j-th keyword vector cannot be the cosine distance is that the j-th keyword vector is normalized when the cosine distance is calculated. But the j-th tube detects that the modulus length of your vector just reflects the overall frequency of sentence j and keywords, so it cannot be normalized.
上述向量(例如第一关键词向量、第二关键词向量、第三关键词向量和第j个关键词向量)都是关键词在特定文本中出现的频率(即词频)。在另一些实施例中,上述向量也可以根据词到向量(word to vector,word2vec)确定的词向量确定。例如,第一关键词向量可以通过以下步骤确定:利用word2vex确定每个关键词的词向量;将所有关键词的词向量相加后取平均,得到该第一关键词向量。第二关键词向量和第一关键词向量的确定方式类似,在此就不必赘述。又如,第三关键词向量可以通过以下步骤确定:利用word2vex确定每个关键词的词向量;确定每个关键词的词频;根据每个关键词的词频,对全部关键词的词向量取加权平均,得到该第三关键词向量。又如,第j个关键词向量可以通过以下步骤确定:对第j个句子进行分词和去除停用词;利用word2vex确定剩下的每个词的词向量;将所有词向量相加取平均,得到第j个关键词向量。在关键词向量是基于word2vex确定的情况下,第三关键词向量和第j个关键词向量直接的距离可以是余弦距离。The above-mentioned vectors (for example, the first keyword vector, the second keyword vector, the third keyword vector, and the j-th keyword vector) are all the frequency (that is, the word frequency) of the keywords in the specific text. In other embodiments, the aforementioned vector may also be determined according to a word vector (word to vector, word2vec) determined. For example, the first keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; add the word vectors of all keywords and average them to obtain the first keyword vector. The second keyword vector and the first keyword vector are determined in a similar manner, so there is no need to repeat them here. For another example, the third keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; determine the word frequency of each keyword; weight the word vectors of all keywords according to the word frequency of each keyword On average, the third keyword vector is obtained. For another example, the j-th keyword vector can be determined by the following steps: segment the j-th sentence and remove stop words; use word2vex to determine the word vector of each remaining word; add all the word vectors to average, Get the j-th keyword vector. In the case where the keyword vector is determined based on word2vex, the direct distance between the third keyword vector and the j-th keyword vector may be a cosine distance.
图4是根据本申请实施例提供的会议流程的示意图。Fig. 4 is a schematic diagram of a conference process provided according to an embodiment of the present application.
401,会议终端1向会议控制服务器传输音视频流1。401: The conference terminal 1 transmits audio and video stream 1 to the conference control server.
402,会议终端2向会议控制服务器传输音视频流2。402: The conference terminal 2 transmits the audio and video stream 2 to the conference control server.
403,会议终端3向会议控制服务器传输音视频流3。403: The conference terminal 3 transmits the audio and video stream 3 to the conference control server.
404,会议控制服务器确定主会场。404: The conference control server determines the main conference site.
假设会议控制服务器确定的主会场是会议终端1所在的会场。It is assumed that the main conference site determined by the conference control server is the conference site where the conference terminal 1 is located.
405,会议控制服务器将会议数据发送至会议终端2和会议终端3。405. The conference control server sends the conference data to the conference terminal 2 and the conference terminal 3.
406,会议终端2和会议终端3存储会议数据。406. The conference terminal 2 and the conference terminal 3 store conference data.
可选的,在一些实施例中,会议控制服务器也可以将会议数据发送至会议终端1,会议终端1也可以存储会议数据。Optionally, in some embodiments, the conference control server may also send conference data to the conference terminal 1, and the conference terminal 1 may also store the conference data.
407,会议控制服务器实时对音视频流1进行分段(即确定分段点)并提取各个分段的摘要。407. The conference control server segments the audio and video stream 1 in real time (that is, determines the segment point) and extracts a summary of each segment.
408,会议控制服务器将分段点和摘要发送至会议终端2和会议终端3。这样,会议终端2和会议终端3可以自主选择回看点播放回看视频。当然,在一些实现方式中,会议控制服务器也可以将分段点和摘要发送至会议终端1。408: The conference control server sends the segment point and summary to the conference terminal 2 and the conference terminal 3. In this way, the conference terminal 2 and the conference terminal 3 can independently select the review point to play the review video. Of course, in some implementations, the conference control server may also send the segment point and summary to the conference terminal 1.
图5是根据本申请实施例提供的视频分段方法的示意性流程图。Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.
501,视频分段装置确定会议预定中是否包括会议内容相关文字。换句话说,视频分段装置可以确定该待处理视频是否包括内容描述信息。若确定结果为是(即该待处理视频包括内容描述信息),则执行步骤502;若确定结果为否(即该待处理视频不包括内容描述信息),则执行步骤503。501. The video segmentation device determines whether the meeting reservation includes text related to the meeting content. In other words, the video segmentation device can determine whether the to-be-processed video includes content description information. If the result of the determination is yes (that is, the video to be processed includes content description information), step 502 is executed; if the result of the determination is no (that is, the video to be processed does not include content description information), then step 503 is executed.
502,该视频分段装置提取该会议内容相关文字的关键词。换句话说,该视频分段装置确定该内容描述信息的关键词。502. The video segmentation device extracts keywords related to the content of the conference. In other words, the video segmentation device determines the keywords of the content description information.
在确定了该内容描述信息的关键词后,可以执行步骤503。After the keywords of the content description information are determined, step 503 may be executed.
503,该视频分段装置确定待处理视频中是否有屏幕展示演示文稿。换句话说,该视频 分段装置可以确定该待处理视频是否包括演示文稿,且该演示文稿是通过屏幕展示的。若确定结果为是(即该待处理视频包括演示文稿),则执行步骤504。若确定结果为否(即该待处理视频不包括演示文稿),则执行步骤505。504,该视频分段装置确定用于展示该演示文稿的屏幕的位置。该视频分段装置在确定了该屏幕的位置之后,可以执行步骤506。503. The video segmentation device determines whether there is a screen display presentation in the to-be-processed video. In other words, the video segmentation device can determine whether the to-be-processed video includes a presentation, and the presentation is displayed on the screen. If the determination result is yes (that is, the to-be-processed video includes a presentation), step 504 is executed. If the determination result is no (that is, the to-be-processed video does not include a presentation), step 505 is executed. 504, the video segmentation device determines the position of the screen for displaying the presentation. After the video segmentation device determines the position of the screen, step 506 may be executed.
505,该视频分段装置确定是否有通过辅流传输的演示文稿。换句话说,在一些可能的实现方式中,会议发言人可能不会通过屏幕展示演示文稿,但是会通过辅流将演示文稿上传至会议控制服务器。其他会场中的会议终端可以根据该辅流获取该会议发言人在发言过程中使用的演示文稿。若确定结果为是(即有通过辅流传输的演示文稿),则执行步骤506。若确定结果为否(即没有通过辅流传输的演示文稿),则可以根据语音信息,确定该待处理视频的分段点。505. The video segmentation device determines whether there is a presentation transmitted through an auxiliary stream. In other words, in some possible implementations, the conference speaker may not display the presentation on the screen, but upload the presentation to the conference control server through the auxiliary stream. The conference terminal in the other conference site can obtain the presentation used by the conference speaker in the speech process according to the auxiliary stream. If the determination result is yes (that is, there is a presentation transmitted through the auxiliary stream), step 506 is executed. If the result of the determination is no (that is, there is no presentation transmitted through the auxiliary stream), the segmentation point of the to-be-processed video can be determined according to the voice information.
506,该视频分段装置确定上一分段点到当前时刻的时长是否超过第一预设时长。若该视频分段装置确定上一分段点到当前时刻的时长大于该第一预设时长(即确定结果为是),则执行步骤507。若该视频分段装置确定上一分段点到当前时刻的时长不大于该第一预设时长,则执行步骤508。可以理解的是,若该视频分段装置确定的分段点是第一个分段点,则上一分段点是指待处理视频的起始时刻。为了便于描述,可以将上衣分段点到当前时刻的时长称为演示时长。506. The video segmentation device determines whether the duration from the previous segment point to the current moment exceeds the first preset duration. If the video segmentation device determines that the duration from the last segment point to the current moment is greater than the first preset duration (that is, the determination result is yes), step 507 is executed. If the video segmentation device determines that the duration from the previous segment point to the current moment is not greater than the first preset duration, step 508 is executed. It can be understood that, if the segment point determined by the video segmentation device is the first segment point, the previous segment point refers to the start time of the video to be processed. For ease of description, the length of time from the point of the upper garment segment to the current moment can be referred to as the presentation time length.
507,该视频分段装置根据内容描述信息和语音信息,确定该待处理视频的分段点。507. The video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information.
508,该视频分段装置根据演示文稿和语音信息,确定该待处理视频的分段点。该视频分段装置根据该演示文稿和语音信息,确定该待处理视频的分段点的具体实现方式,可以参考图3所示的实施例,在此就不必赘述。508. The video segmentation device determines the segmentation point of the video to be processed according to the presentation and voice information. For the specific implementation manner of the video segmentation device determining the segmentation point of the to-be-processed video based on the presentation and voice information, reference may be made to the embodiment shown in FIG. 3, and it is unnecessary to repeat it here.
该视频分段装置在确定了该待处理视频的分段点后,可以执行步骤509和步骤510。The video segmentation apparatus may perform step 509 and step 510 after determining the segmentation point of the video to be processed.
509,该视频分段装置确定分段语音信息以及该分段语音信息的关键词,分段语音信息是在分段点的上一个分段点和该分段点之间的语音信息。可以理解的是,若该分段点是待处理视频的第一个分段点,则该分段语音信息是该待处理视频的起始时刻到该分段点之间的语音信息。509. The video segmentation device determines segmented voice information and keywords of the segmented voice information, where the segmented voice information is the voice information between the previous segment point of the segment point and the segment point. It is understandable that if the segment point is the first segment point of the video to be processed, the segment voice information is the voice information from the start time of the video to be processed to the segment point.
510,该视频分段装置根据该分段语音信息,该分段语音信息的关键词和目标文本的关键词,确定分段摘要。步骤509和510的具体实现方式可以参考图3所示的实施例,在此就不必赘述。510. The video segmentation device determines a segmented summary according to the segmented voice information, keywords of the segmented voice information, and keywords of the target text. For the specific implementation of steps 509 and 510, reference may be made to the embodiment shown in FIG. 3, and details are not required here.
可以理解的是,在另一些可能的实现方式中,该视频分段装置在对视频进行分段和提取摘要的过程中,可以先确定待处理视频中是否有通过屏幕展示的演示文稿,然后再确定会议预定中是否包括会议内容相关文字,最后再确定是否有通过辅流传输演示文稿。在另一些可能的实现方式中,该视频分段装置还可以先确定是否有通过辅流传输的演示文稿,然后在确定会议预定中是否包括会议内容相关文字,最后再确定待处理视频中是否有通过屏幕展示的演示文稿。It is understandable that, in other possible implementations, the video segmentation device can first determine whether there is a presentation displayed on the screen in the video to be processed during the process of segmenting the video and extracting the summary, and then Determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether the presentation is transmitted through the auxiliary stream. In other possible implementation manners, the video segmentation device may first determine whether there is a presentation transmitted through the auxiliary stream, and then determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether there is any text in the video to be processed Presentation through the screen.
下面将结合图6对该视频分段装置如何根据内容描述信息和语音信息,确定该待处理视频的分段点进行描述。此外,该视频分段装置如何根据该语音信息,确定该待处理视频的分段点的实现方式也可以参见图6。How the video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information will be described below in conjunction with FIG. 6. In addition, how the video segmentation apparatus determines the segmentation point of the to-be-processed video according to the voice information can also refer to FIG. 6.
图6是根据本申请实施例提供的一种视频分段的方法的示意性流程图。Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application.
601,该视频分段装置以窗口长度W和步长S,持续在该语音信息上截取语音信息片段。601. The video segmentation device continuously intercepts voice information segments on the voice information with the window length W and the step size S.
602,该视频分段装置提取每个语音信息片段的关键词。具体地,该视频分段装置从每个语音信息片段中提取N个关键词。602. The video segmentation device extracts keywords for each voice information segment. Specifically, the video segmentation device extracts N keywords from each voice information segment.
如果该视频分段装置提取过内容描述信息的关键词,则在步骤602之后可以执行步骤603;若该视频分段装置没有提取过内容描述信息的关键词,则在步骤602之后可以执行步骤604。该视频分段装置提取过内容描述信息的关键词意味着该视频分段装置确定该待处理视频包括内容描述信息。在此情况下,该视频分段装置确定的分段点是根据内容描述信息和语音信息确定的。该视频分段装置没有提取过内容描述信息的关键词意味着该视频分段装置确定该待处理视频不包括内容描述信息。在此情况下,该视频分段装置确定的分段点是根据语音信息确定的。If the video segmentation device has extracted the keywords of the content description information, step 603 can be performed after step 602; if the video segmentation device has not extracted the keywords of the content description information, step 604 can be performed after step 602 . The keyword of the content description information extracted by the video segmentation device means that the video segmentation device determines that the video to be processed includes the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on content description information and voice information. The video segmentation device has not extracted the keywords of the content description information, which means that the video segmentation device determines that the to-be-processed video does not include the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on the voice information.
603,该视频分段装置确定第i个语音信息片段中的关键词和该内容描述信息的关键词在第i个语音信息片段中的词频向量C_i。603. The video segmentation device determines the keyword in the i-th voice information segment and the word frequency vector C_i of the keyword of the content description information in the i-th voice information segment.
用于确定第i个语音信息片段中的关键词的方法可以参见图3所示的实施例。具体地,可以参考图3所示实施例中确定该第一语音信息片段的关键词的确定方法,在此就不必赘述。用于确定该内容描述信息的关键词的方法可以参加图3所示的实施例,在此就不必赘述。该视频分段装置确定第i个语音信息片段中的关键词和该内容描述信息的关键词在第i个语音信息片段中的词频向量的实现方式可以参加图3所示实施例中确定第一关键词向量的确定方式,在此就不必赘述。For the method for determining the keyword in the i-th voice information segment, refer to the embodiment shown in FIG. 3. Specifically, reference may be made to the method for determining the keyword of the first voice information segment in the embodiment shown in FIG. 3, which will not be repeated here. The method for determining the keywords of the content description information can participate in the embodiment shown in FIG. 3, and it will not be repeated here. The video segmentation device determines the keyword in the i-th voice information segment and the implementation manner of the keyword of the content description information in the word frequency vector in the i-th voice information segment. There is no need to go into details about how to determine the keyword vector here.
604,该视频分段装置确定第i个语音信息片段中的关键词在第i个语音信息片段中的词频向量C_i。第i个语音信息片段中的关键词在第i个语音信息片段中的词频向量的确定方式与第i个语音信息片段中的关键词和该内容描述信息的关键词在第i个语音信息片段中的词频向量C_i确定方式类似,在此就不必赘述。604. The video segmentation device determines the word frequency vector C_i of the keyword in the i-th voice information segment in the i-th voice information segment. The method of determining the word frequency vector of the keyword in the i-th voice information segment in the i-th voice information segment and the keyword in the i-th voice information segment and the keywords of the content description information are in the i-th voice information segment The method for determining the word frequency vector C_i in is similar, so there is no need to repeat it here.
该视频分段装置在执行了步骤603或步骤604之后,可以依次执行步骤605和步骤606。After performing step 603 or step 604, the video segmentation apparatus may perform step 605 and step 606 in sequence.
605,该视频分段装置确定C_i和C_(i-1)之间的距离。C_(i-1)是该视频分段装置确定第i-1个语音信息片段的关键词(或者第i-1个语音信息片段的关键词和该内容描述信息的关键词)在第i-1个语音信息片段中的词频向量。第i-1个语音信息片段是第i个语音信息片段之前的一个语音信息片段605. The video segmentation device determines the distance between C_i and C_(i-1). C_(i-1) is that the video segmentation device determines that the keyword of the i-1th voice information fragment (or the keyword of the i-1th voice information fragment and the keyword of the content description information) is in the i-th The word frequency vector in a piece of speech information. The i-1th voice information segment is a voice information segment before the i-th voice information segment
606,若C_i和C_(i-1)之间的距离大于预设距离,则可以确定分段点位于第i个语音信息片段前后。该视频分段装置在确定出分段点位于第i个语音信息片段前后的情况下,可以根据停顿点确定该分段点。该视频分段装置根据停顿点确定分段点的具体实现方式可以参考图3所示的实施例,在此就不必赘述。606: If the distance between C_i and C_(i-1) is greater than the preset distance, it may be determined that the segment point is located before and after the i-th voice information segment. When the video segmentation device determines that the segment point is located before and after the i-th voice information segment, the segment point can be determined according to the pause point. For the specific implementation manner of the video segmentation device determining the segmentation point according to the pause point, reference may be made to the embodiment shown in FIG. 3, and it is not necessary to repeat it here.
若C_i和C_(i-1)之间的距离小于或等于该预设距离,则可以认为分段点不在第i个语音信息片段和第i-1个语音信息片段中。在此情况下,可以继续确定下一个语音信息片段的词频向量和第i个语音信息片段的词频向量。If the distance between C_i and C_(i-1) is less than or equal to the preset distance, it can be considered that the segment point is not in the i-th voice information segment and the i-1-th voice information segment. In this case, the word frequency vector of the next speech information segment and the word frequency vector of the i-th speech information segment can be determined continuously.
图7是根据本申请实施例提供的视频分段装置的结构框图。如图7所示,视频分段装置700包括获取单元701和处理单元702。Fig. 7 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application. As shown in FIG. 7, the video segmentation device 700 includes an acquisition unit 701 and a processing unit 702.
获取单元701,用于获取待处理视频的文本信息和该待处理视频的语音信息,其中该文本信息包括该待处理视频中的演示文稿和该待处理视频的内容描述信息中的至少一个。The acquiring unit 701 is configured to acquire text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.
处理单元702,用于根据该文本信息和该语音信息,确定该待处理视频的分段点。The processing unit 702 is configured to determine the segment point of the video to be processed according to the text information and the voice information.
处理单元702,还用于根据该分段点,对该待处理视频进行分段。The processing unit 702 is further configured to segment the to-be-processed video according to the segmentation point.
获取单元701和处理单元702的具体功能和有益效果可以参见图3至图6所示的方法,在此就不再赘述。The specific functions and beneficial effects of the acquiring unit 701 and the processing unit 702 can be referred to the methods shown in FIG. 3 to FIG. 6, which will not be repeated here.
图8是根据本申请实施例提供的视频分段装置的结构框图。图8所示的视频分段装置800包括:处理器801、存储器802和收发器803。Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application. The video segmentation device 800 shown in FIG. 8 includes a processor 801, a memory 802, and a transceiver 803.
处理器801、存储器802和收发器803之间通过内部连接通路互相通信,传递控制和/或数据信号。The processor 801, the memory 802, and the transceiver 803 communicate with each other through internal connection paths, and transfer control and/or data signals.
上述本申请实施例揭示的方法可以应用于处理器801中,或者由处理器801实现。处理器801可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器801中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器801可以是通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器802,处理器801读取存储器802中的指令,结合其硬件完成上述方法的步骤。The method disclosed in the foregoing embodiment of the present application may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 801 or instructions in the form of software. The aforementioned processor 801 may be a general-purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory (RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory, or electrically erasable programmable memory, registers, etc. mature in the field Storage medium. The storage medium is located in the memory 802, and the processor 801 reads the instructions in the memory 802 and completes the steps of the above method in combination with its hardware.
可选的,在一些实施例中,存储器802可以存储用于执行如图3至图6所示方法中视频分段装置执行的方法的指令。处理器801可以执行存储器802中存储的指令结合其他硬件(例如收发器803)完成如图3至图6所示方法中视频分段装置的步骤,具体工作过程和有益效果可以参见图3至图6所示实施例中的描述。Optionally, in some embodiments, the memory 802 may store instructions for executing the method executed by the video segmentation apparatus in the methods shown in FIGS. 3 to 6. The processor 801 can execute the instructions stored in the memory 802 in combination with other hardware (such as the transceiver 803) to complete the steps of the video segmentation device in the method shown in FIG. 3 to FIG. 6. The specific working process and beneficial effects can be seen in FIGS. 3 to 6 shows the description in the embodiment.
本申请实施例还提供一种芯片,该芯片包括收发单元和处理单元。其中,收发单元可以是输入输出电路、通信接口;处理单元为该芯片上集成的处理器或者微处理器或者集成电路。该芯片可以执行上述方法实施例中视频分段装置的方法。An embodiment of the present application also provides a chip, which includes a transceiver unit and a processing unit. Among them, the transceiver unit may be an input/output circuit or a communication interface; the processing unit is a processor or microprocessor or integrated circuit integrated on the chip. The chip can execute the method of the video segmentation device in the above method embodiment.
本申请实施例还提供一种计算机可读存储介质,其上存储有指令,该指令被执行时执行上述方法实施例中视频分段装置的方法。The embodiment of the present application also provides a computer-readable storage medium on which an instruction is stored, and when the instruction is executed, the method of the video segmentation device in the foregoing method embodiment is executed.
本申请实施例还提供一种包含指令的计算机程序产品,该指令被执行时执行上述方法实施例中视频分段装置的方法。The embodiment of the present application also provides a computer program product containing instructions that, when executed, execute the method of the video segmentation device in the foregoing method embodiment.
本领域普通技术人员可以意识到,结合本申请中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps described in the examples in combination with the embodiments disclosed in this application can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划 分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (24)

  1. 一种视频分段方法,其特征在于,包括:A video segmentation method, characterized in that it comprises:
    视频分段装置获取待处理视频的文本信息和所述待处理视频的语音信息,其中所述文本信息包括所述待处理视频中的演示文稿和所述待处理视频的内容描述信息中的至少一个;The video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed ;
    所述视频分段装置根据所述文本信息和所述语音信息,确定所述待处理视频的分段点;The video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information;
    所述视频分段装置根据所述分段点,对所述待处理视频进行分段。The video segmentation device segments the to-be-processed video according to the segmentation point.
  2. 如权利要求1所述的方法,其特征在于,在所述文本信息包括所述演示文稿的情况下,所述视频分段装置根据所述文本信息和所述语音信息,确定所述待处理视频的分段点,包括:The method according to claim 1, wherein, in a case where the text information includes the presentation, the video segmentation device determines the to-be-processed video according to the text information and the voice information The segmentation points include:
    确定所述演示文稿的切换点,所述演示文稿在所述切换点前后呈现的内容不同;Determining a switching point of the presentation, where the content of the presentation is different before and after the switching point;
    根据所述语音信息,确定至少一个停顿点;Determine at least one pause point according to the voice information;
    根据所述切换点和所述至少一个停顿点,确定所述分段点。The segment point is determined according to the switching point and the at least one stopping point.
  3. 如权利要求2所述的方法,其特征在于,所述根据所述切换点和所述至少一个停顿点,确定所述分段点,包括:The method according to claim 2, wherein the determining the segment point according to the switching point and the at least one stopping point comprises:
    在确定所述切换点与所述至少一个停顿点中的一个停顿点相同的情况下,确定所述切换点为所述分段点;In a case where it is determined that the switching point is the same as one of the at least one stopping point, determining that the switching point is the segment point;
    在确定所述至少一个停顿点中的任一个停顿点与所述切换点的均不相同的情况下,确定所述至少一个停顿点中距离所述切换点最近的一个停顿点为所述分段点。In the case where it is determined that any one of the at least one stopping point is not the same as the switching point, it is determined that the one of the at least one stopping point closest to the switching point is the segment point.
  4. 如权利要求2或3所述的方法,其特征在于,所述确定所述演示文稿的切换点,包括:确定获取到用于指示切换所述演示文稿的内容的切换信号的时刻为所述切换点。The method according to claim 2 or 3, wherein the determining the switching point of the presentation comprises: determining that the moment when a switching signal for instructing to switch the content of the presentation is acquired is the switching point.
  5. 如权利要求2至4中任一项所述的方法,其特征在于,所述文本信息还包括所述内容描述信息,在所述视频分段装置根据所述文本信息和所述语音信息,确定所述待处理视频的分段点之前,所述方法还包括:The method according to any one of claims 2 to 4, wherein the text information further comprises the content description information, and the video segmentation device determines according to the text information and the voice information Before the segment point of the to-be-processed video, the method further includes:
    确定所述演示文稿的当前页的演示时长小于或等于第一预设时长且大于第二预设时长。It is determined that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.
  6. 如权利要求1所述的方法,其特征在于,在所述文本信息包括所述内容描述信息的情况下,所述视频分段装置根据所述文本信息和所述语音信息,确定所述待处理视频的分段点,包括:The method according to claim 1, wherein, in the case that the text information includes the content description information, the video segmentation device determines the to-be-processed information according to the text information and the voice information Video segmentation points, including:
    根据所述语音信息、所述内容描述信息的关键词和所述语音信息中的停顿点,确定所述待处理视频的分段点。Determine the segmentation point of the video to be processed according to the voice information, the keywords of the content description information, and the pause points in the voice information.
  7. 如权利要求6所述的方法,其特征在于,所述语音信息包括第一语音信息片段和第二语音信息片段,其中所述第二语音信息片段是在所述第一语音信息片段之前且与所述第一语音信息片段相邻的语音信息片段,The method according to claim 6, wherein the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before and with the first voice information fragment. Voice information segments adjacent to the first voice information segment,
    所述根据所述语音信息、所述内容描述信息的关键词和所述语音信息中的停顿点,确定所述待处理视频的分段点,包括:The determining the segment point of the to-be-processed video according to the voice information, the keywords of the content description information, and the pause point in the voice information includes:
    根据所述第一语音信息片段、所述第二语音信息片段、所述内容描述信息的关键词和所述语音信息中的停顿点,确定第一分段点,其中所述待处理视频的分段点包括所述 第一分段点。According to the first voice information segment, the second voice information segment, the keywords of the content description information, and the pause point in the voice information, the first segment point is determined, where the segment of the video to be processed is The segment point includes the first segment point.
  8. 如权利要求7所述的方法,其特征在于,所述根据所述第一语音信息片段、所述第二语音信息片段、所述内容描述信息的关键词和所述语音信息中的停顿点,确定第一分段点,包括:The method according to claim 7, wherein the method is based on the first voice information segment, the second voice information segment, keywords of the content description information, and pause points in the voice information, Determine the first segment point, including:
    根据所述第一语音信息片段的关键词、所述第二语音信息片段的关键词、所述第一语音信息片段的内容、所述第二语音信息片段的内容和所述内容描述信息的关键词,确定所述第一语音信息片段和所述第二语音信息片段的相似度;According to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the key of the content description information Words, determining the similarity between the first voice information segment and the second voice information segment;
    确定所述第一语音信息片段和所述第二语音信息片段的相似度小于相似度阈值;Determining that the similarity between the first voice information segment and the second voice information segment is less than a similarity threshold;
    根据所述语音信息中的停顿点,确定所述第一分段点。Determine the first segment point according to the pause point in the voice information.
  9. 如权利要求8所述的方法,其特征在于,所述语音信息中的停顿点包括所述第一语音信息片段内的停顿点或与所述第一语音信息片段相邻的停顿点,所述根据所述语音信息中的停顿点,确定所述第一分段点,包括:The method according to claim 8, wherein the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment, and the The determining the first segment point according to the pause point in the voice information includes:
    根据所述第一语音信息片段内的停顿点数目、与所述第一语音信息片段相邻的停顿点数目、停顿时长以及与停顿点相邻的词中的至少一个,确定所述第一分段点。According to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the pause duration, and the words adjacent to the pause point, the first score is determined Segment point.
  10. 如权利要求6至9中任一项所述的方法,其特征在于,所述文本信息还包括所述演示文稿,在所述视频分段装置根据所述文本信息和所述语音信息,确定所述待处理视频的分段点之前,所述方法还包括:The method according to any one of claims 6 to 9, characterized in that the text information further comprises the presentation, and the video segmentation device determines the content according to the text information and the voice information. Before the segmentation point of the video to be processed, the method further includes:
    确定所述演示文稿的当前页的演示时长大于第一预设时长;或者It is determined that the presentation duration of the current page of the presentation is greater than the first preset duration; or
    确定所述演示文稿的当前页的演示时长小于或等于第二预设时长。It is determined that the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
  11. 如权利要求1至10中任一项所述的方法,其特征在于,所述方法还包括:所述视频分段装置根据分段语音信息的内容、所述分段语音信息的关键词和目标文本的关键词,确定所述分段的摘要,其中所述目标文本包括所述演示文稿和所述内容描述信息中的至少一个。The method according to any one of claims 1 to 10, wherein the method further comprises: the video segmentation device according to the content of the segmented voice information, keywords and targets of the segmented voice information The keywords of the text determine the segmented abstract, wherein the target text includes at least one of the presentation and the content description information.
  12. 如权利要求1至11中任一项所述的方法,其特征在于,所述待处理视频为实时视频流,所述待处理视频的语音信息为所述实时视频流从所述实时视频流的起始时刻或者上一分段点到当前时刻的语音信息。The method according to any one of claims 1 to 11, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is the real-time video stream from the real-time video stream. Voice information from the start time or the last segment point to the current time.
  13. 一种视频分段装置,其特征在于,包括:A video segmentation device, characterized by comprising:
    获取单元,用于获取待处理视频的文本信息和所述待处理视频的语音信息,其中所述文本信息包括所述待处理视频中的演示文稿和所述待处理视频的内容描述信息中的至少一个;The acquiring unit is configured to acquire the text information of the to-be-processed video and the voice information of the to-be-processed video, wherein the text information includes at least the presentation in the to-be-processed video and the content description information of the to-be-processed video One;
    处理单元,用于根据所述文本信息和所述语音信息,确定所述待处理视频的分段点;A processing unit, configured to determine the segmentation point of the video to be processed according to the text information and the voice information;
    所述处理单元,还用于根据所述分段点,对所述待处理视频进行分段。The processing unit is further configured to segment the to-be-processed video according to the segmentation point.
  14. 如权利要求13所述的视频分段装置,其特征在于,所述处理单元,具体用于在所述文本信息包括所述演示文稿的情况下,根据所述文本信息和所述语音信息,确定所述演示文稿的切换点,所述演示文稿在所述切换点前后呈现的内容不同;The video segmentation device according to claim 13, wherein the processing unit is specifically configured to determine based on the text information and the voice information when the text information includes the presentation The switching point of the presentation, the content presented by the presentation before and after the switching point is different;
    根据所述语音信息,确定至少一个停顿点;Determine at least one pause point according to the voice information;
    根据所述切换点和所述至少一个停顿点,确定所述分段点。The segment point is determined according to the switching point and the at least one stopping point.
  15. 如权利要求14所述的视频分段装置,其特征在于,所述处理单元,具体用于The video segmentation device according to claim 14, wherein the processing unit is specifically configured to
    在确定所述切换点与所述至少一个停顿点中的一个停顿点相同的情况下,确定所述 切换点为所述分段点;In a case where it is determined that the switching point is the same as one of the at least one stopping point, determining that the switching point is the segment point;
    在确定所述至少一个停顿点中的任一个停顿点与所述切换点的均不相同的情况下,确定所述至少一个停顿点中距离所述切换点最近的一个停顿点为所述分段点。In the case where it is determined that any one of the at least one stopping point is not the same as the switching point, it is determined that the one of the at least one stopping point closest to the switching point is the segment point.
  16. 如权利要求14或15所述的视频分段装置,其特征在于,所述处理单元,具体用于确定获取到用于指示切换所述演示文稿的内容的切换信号的时刻为所述切换点。The video segmentation device according to claim 14 or 15, wherein the processing unit is specifically configured to determine the switching point when a switching signal for instructing to switch the content of the presentation is acquired.
  17. 如权利要求14至16中任一项所述的视频分段装置,其特征在于,所述处理单元,还用于在所述文本信息还包括所述内容描述信息的情况下,在根据所述文本信息和所述语音信息,确定所述待处理视频的分段点之前,确定所述演示文稿的当前页的演示时长小于或等于第一预设时长且大于第二预设时长。The video segmentation device according to any one of claims 14 to 16, wherein the processing unit is further configured to, when the text information further includes the content description information, perform the The text information and the voice information determine that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration before determining the segmentation point of the video to be processed.
  18. 如权利要求13所述的视频分段装置,其特征在于,所述处理单元,具体用于在所述文本信息包括所述内容描述信息的情况下,根据所述语音信息、所述内容描述信息的关键词和所述语音信息中的停顿点,确定所述待处理视频的分段点。The video segmentation device according to claim 13, wherein the processing unit is specifically configured to, in a case where the text information includes the content description information, according to the voice information and the content description information And the pause point in the voice information to determine the segment point of the video to be processed.
  19. 如权利要求18所述的视频分段装置,其特征在于,所述语音信息包括第一语音信息片段和第二语音信息片段,其中所述第二语音信息片段是在所述第一语音信息片段之前且与所述第一语音信息片段相邻的语音信息片段,The video segmentation device according to claim 18, wherein the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is in the first voice information fragment The previous voice information segment that is adjacent to the first voice information segment,
    所述处理单元,具体用于根据所述第一语音信息片段、所述第二语音信息片段、所述内容描述信息的关键词和所述语音信息中的停顿点,确定第一分段点,其中所述待处理视频的分段点包括所述第一分段点。The processing unit is specifically configured to determine the first segment point according to the first voice information segment, the second voice information segment, the keywords of the content description information, and the pause point in the voice information, The segment point of the video to be processed includes the first segment point.
  20. 如权利要求19所述的视频分段装置,其特征在于,所述处理单元,具体用于根据所述第一语音信息片段的关键词、所述第二语音信息片段的关键词、所述第一语音信息片段的内容、所述第二语音信息片段的内容和所述内容描述信息的关键词,确定所述第一语音信息片段和所述第二语音信息片段的相似度;The video segmentation device according to claim 19, wherein the processing unit is specifically configured to perform according to keywords of the first voice information fragment, keywords of the second voice information fragment, and The content of a voice information segment, the content of the second voice information segment, and the keywords of the content description information, determining the similarity between the first voice information segment and the second voice information segment;
    确定所述第一语音信息片段和所述第二语音信息片段的相似度小于相似度阈值;Determining that the similarity between the first voice information segment and the second voice information segment is less than a similarity threshold;
    根据所述语音信息中的停顿点,确定所述第一分段点。Determine the first segment point according to the pause point in the voice information.
  21. 如权利要求20所述的视频分段装置,其特征在于,所述语音信息中的停顿点包括所述第一语音信息片段内的停顿点或与所述第一语音信息片段相邻的停顿点,所述处理单元,具体用于根据所述第一语音信息片段内的停顿点数目、与所述第一语音信息片段相邻的停顿点数目、停顿时长以及与停顿点相邻的词中的至少一个,确定所述第一分段点。The video segmentation device according to claim 20, wherein the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment , The processing unit is specifically configured to determine the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the pause duration, and the number of words adjacent to the pause point. At least one, the first segment point is determined.
  22. 如权利要求18至21中任一项所述的视频分段装置,其特征在于,所述处理单元,还用于在所述文本信息还包括所述演示文稿的情况下,在根据所述文本信息和所述语音信息,确定所述待处理视频的分段点之前,确定所述演示文稿的当前页的演示时长大于第一预设时长;或者The video segmentation device according to any one of claims 18 to 21, wherein the processing unit is further configured to: when the text information further includes the presentation Information and the voice information, before determining the segment point of the to-be-processed video, it is determined that the presentation duration of the current page of the presentation is greater than the first preset duration; or
    确定所述演示文稿的当前页的演示时长小于或等于第二预设时长。It is determined that the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
  23. 如权利要求13至22中任一项所述的视频分段装置,其特征在于,所述处理单元还用于根据分段语音信息的内容、所述分段语音信息的关键词和目标文本的关键词,确定所述分段的摘要,其中所述目标文本包括所述演示文稿和所述内容描述信息中的至少一个。The video segmentation device according to any one of claims 13 to 22, wherein the processing unit is further configured to determine the content of the segmented voice information, the keywords of the segmented voice information, and the target text. A keyword is used to determine the segmented abstract, wherein the target text includes at least one of the presentation and the content description information.
  24. 如权利要求13至23中任一项所述的视频分段装置,其特征在于,所述待处理 视频为实时视频流,所述待处理视频的语音信息为所述实时视频流从所述实时视频流的起始时刻或者上一分段点到当前时刻的语音信息。The video segmentation device according to any one of claims 13 to 23, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is that the real-time video stream is derived from the real-time video stream. The voice information from the start time of the video stream or the last segment point to the current time.
PCT/CN2020/083397 2019-05-07 2020-04-05 Video segmentation method and video segmentation device WO2020224362A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910376477.2 2019-05-07
CN201910376477.2A CN111918145B (en) 2019-05-07 2019-05-07 Video segmentation method and video segmentation device

Publications (1)

Publication Number Publication Date
WO2020224362A1 true WO2020224362A1 (en) 2020-11-12

Family

ID=73051391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083397 WO2020224362A1 (en) 2019-05-07 2020-04-05 Video segmentation method and video segmentation device

Country Status (2)

Country Link
CN (1) CN111918145B (en)
WO (1) WO2020224362A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114051154A (en) * 2021-11-05 2022-02-15 新华智云科技有限公司 News video strip splitting method and system
CN114173191A (en) * 2021-12-09 2022-03-11 上海开放大学 Multi-language question answering method and system based on artificial intelligence
CN114245229A (en) * 2022-01-29 2022-03-25 北京百度网讯科技有限公司 Short video production method, device, equipment and storage medium
CN114363695A (en) * 2021-11-11 2022-04-15 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
CN115209233A (en) * 2022-06-25 2022-10-18 平安银行股份有限公司 Video playing method and related device and equipment
CN118012979A (en) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 Intelligent acquisition and storage system for common surgical operation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187642A1 (en) * 2002-03-29 2003-10-02 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
CN102547139A (en) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 Method for splitting news video program, and method and system for cataloging news videos
US20130028574A1 (en) * 2011-07-29 2013-01-31 Xerox Corporation Systems and methods for enriching audio/video recordings
WO2013097101A1 (en) * 2011-12-28 2013-07-04 华为技术有限公司 Method and device for analysing video file
CN104519401A (en) * 2013-09-30 2015-04-15 华为技术有限公司 Video division point acquiring method and equipment
CN104540044A (en) * 2014-12-30 2015-04-22 北京奇艺世纪科技有限公司 Video segmentation method and device
CN106982344A (en) * 2016-01-15 2017-07-25 阿里巴巴集团控股有限公司 video information processing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187642A1 (en) * 2002-03-29 2003-10-02 International Business Machines Corporation System and method for the automatic discovery of salient segments in speech transcripts
CN102547139A (en) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 Method for splitting news video program, and method and system for cataloging news videos
US20130028574A1 (en) * 2011-07-29 2013-01-31 Xerox Corporation Systems and methods for enriching audio/video recordings
WO2013097101A1 (en) * 2011-12-28 2013-07-04 华为技术有限公司 Method and device for analysing video file
CN104519401A (en) * 2013-09-30 2015-04-15 华为技术有限公司 Video division point acquiring method and equipment
CN104540044A (en) * 2014-12-30 2015-04-22 北京奇艺世纪科技有限公司 Video segmentation method and device
CN106982344A (en) * 2016-01-15 2017-07-25 阿里巴巴集团控股有限公司 video information processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHU YINGYING, ZHOU DONGRU: "Vision, Speech and Text for Video Segmentation", COMPUTER ENGINEERING AND APPLICATIONS, no. 3, 1 February 2001 (2001-02-01), pages 85 - 87, XP055752287, ISSN: 1002-8331 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114051154A (en) * 2021-11-05 2022-02-15 新华智云科技有限公司 News video strip splitting method and system
CN114363695A (en) * 2021-11-11 2022-04-15 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
CN114363695B (en) * 2021-11-11 2023-06-13 腾讯科技(深圳)有限公司 Video processing method, device, computer equipment and storage medium
CN114173191A (en) * 2021-12-09 2022-03-11 上海开放大学 Multi-language question answering method and system based on artificial intelligence
CN114173191B (en) * 2021-12-09 2024-03-19 上海开放大学 Multi-language answering method and system based on artificial intelligence
CN114245229A (en) * 2022-01-29 2022-03-25 北京百度网讯科技有限公司 Short video production method, device, equipment and storage medium
CN114245229B (en) * 2022-01-29 2024-02-06 北京百度网讯科技有限公司 Short video production method, device, equipment and storage medium
CN115209233A (en) * 2022-06-25 2022-10-18 平安银行股份有限公司 Video playing method and related device and equipment
CN115209233B (en) * 2022-06-25 2023-08-25 平安银行股份有限公司 Video playing method, related device and equipment
CN118012979A (en) * 2024-04-10 2024-05-10 济南宝林信息技术有限公司 Intelligent acquisition and storage system for common surgical operation

Also Published As

Publication number Publication date
CN111918145B (en) 2022-09-09
CN111918145A (en) 2020-11-10

Similar Documents

Publication Publication Date Title
WO2020224362A1 (en) Video segmentation method and video segmentation device
US10497382B2 (en) Associating faces with voices for speaker diarization within videos
US10037313B2 (en) Automatic smoothed captioning of non-speech sounds from audio
WO2021109678A1 (en) Video generation method and apparatus, electronic device, and storage medium
US9672827B1 (en) Real-time conversation model generation
US20170371496A1 (en) Rapidly skimmable presentations of web meeting recordings
WO2023011094A1 (en) Video editing method and apparatus, electronic device, and storage medium
US11281707B2 (en) System, summarization apparatus, summarization system, and method of controlling summarization apparatus, for acquiring summary information
JP6339529B2 (en) Conference support system and conference support method
CN111050201A (en) Data processing method and device, electronic equipment and storage medium
US20190199939A1 (en) Suggestion of visual effects based on detected sound patterns
US9563704B1 (en) Methods, systems, and media for presenting suggestions of related media content
CN105590627A (en) Image display apparatus, method for driving same, and computer readable recording medium
JP2016046705A (en) Conference record editing apparatus, method and program for the same, conference record reproduction apparatus, and conference system
US20200403816A1 (en) Utilizing volume-based speaker attribution to associate meeting attendees with digital meeting content
US20200151208A1 (en) Time code to byte indexer for partial object retrieval
JP6690442B2 (en) Presentation support device, presentation support system, presentation support method, and presentation support program
CN113301382B (en) Video processing method, device, medium, and program product
US10123090B2 (en) Visually representing speech and motion
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
KR102226427B1 (en) Apparatus for determining title of user, system including the same, terminal and method for the same
US11128927B2 (en) Content providing server, content providing terminal, and content providing method
EP4322090A1 (en) Information processing device and information processing method
CN116088675A (en) Virtual image interaction method, related device, equipment, system and medium
KR20160055511A (en) Apparatus, method and system for searching video using rhythm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20802337

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20802337

Country of ref document: EP

Kind code of ref document: A1