WO2020224362A1

WO2020224362A1 - Video segmentation method and video segmentation device

Info

Publication number: WO2020224362A1
Application number: PCT/CN2020/083397
Authority: WO
Inventors: 苏芸
Original assignee: 华为技术有限公司
Priority date: 2019-05-07
Filing date: 2020-04-05
Publication date: 2020-11-12
Also published as: CN111918145B; CN111918145A

Abstract

The present application provides a video segmentation method and a video segmentation device. The method comprises: a video segmentation device segments a video to be processed according to at least one of content description information for describing content of the video to be processed and a presentation presented in the video to be processed as well as voice information of the video to be processed, wherein the content description information and the presentation are uploaded in advance. The technical solution can segment the video to be processed with reference to information except the content of the video to be processed, thereby improving the accuracy of segmentation.

Description

Video segmentation method and video segmentation device

Technical field

This application relates to the field of information technology, and more specifically, to a video segmentation method and a video segmentation device.

Background technique

In order to watch the video conveniently, a complete video can be divided into multiple segments. In this way, the user can directly watch the segment of interest.

At present, a common video segmentation method is to segment the video based on the text information in the video. The text information in the above video may be subtitles in the video, or text obtained by performing voice recognition on the video. In other words, the basis for segmenting a video currently comes from the video itself. In addition, the current video segmentation based on text information in the video needs to obtain all text information of the video. The video stream of the live video is generated in real time. Therefore, all text information of the video can be obtained only after the live video broadcast ends. Therefore, the above method cannot segment the live video in real time. In addition, the above method only segments the video according to the text information of the video. This may cause the determined segmentation point to not necessarily be a suitable segmentation point.

Summary of the invention

The present application provides a video segmentation method and a video segmentation device, which can improve the accuracy of video segmentation.

In a first aspect, an embodiment of the present application provides a video segmentation method, including: a video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, wherein the text information includes a presentation in the video to be processed At least one of the document and the content description information of the video to be processed; the video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information; the video segmentation device determines the segmentation point of the video to be processed according to the segmentation point, Segment the video to be processed. The above technical solutions can combine information other than the content of the to-be-processed video to segment the to-be-processed video, thereby improving the accuracy of segmentation.

With reference to the first aspect, in a possible implementation of the first aspect, in a case where the text information includes the presentation, the video segmentation device determines the video segmentation of the to-be-processed video according to the text information and the voice information. The segmentation point includes: determining the switching point of the presentation, which presents different content before and after the switching point; determining at least one pause point according to the voice information; determining according to the switching point and the at least one pause point The segment point. The switching of the presentation often means that the content of the speaker's speech has changed. Therefore, the above technical solution divides the to-be-processed video into unused segments by considering the change of the presentation, and can reasonably quickly determine the segmentation point of the to-be-processed video. In addition, the above technical solution only needs to be based on the switching point of the presentation and the pause point near the switching point when determining the segmentation point of the video to be processed. Therefore, the above technical solution does not need to obtain the completed video file, and the video can be segmented. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.

With reference to the first aspect, in a possible implementation of the first aspect, the determining the segment point according to the switching point and the at least one stopping point includes: determining the switching point and the at least one stopping point If one of the stopping points of is the same, the switching point is determined to be the segment point; when it is determined that any one of the at least one stopping point is different from that of the switching point, the at least one stopping point is determined The stop point closest to the switching point is the segment point.

With reference to the first aspect, in a possible implementation of the first aspect, the determining the switching point of the presentation includes: determining that the moment when a switching signal for instructing to switch the content of the presentation is acquired is the switching point .

With reference to the first aspect, in a possible implementation of the first aspect, the text information further includes the content description information, and the video segmentation device determines the segmentation of the to-be-processed video based on the text information and the voice information. Before the segment point, the method further includes: determining that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.

With reference to the first aspect, in a possible implementation of the first aspect, in a case where the text information includes the content description information, the video segmentation device determines the to-be-processed video according to the text information and the voice information The segmentation point includes: determining the segmentation point of the video to be processed according to the voice information, keywords of the content description information, and pause points in the voice information. The content description information is information input by the user in advance to describe the video to be processed. The content description information can usually include some key information in the video to be processed, such as keywords, key content, etc. Therefore, the key content described in different segments of the video to be processed can be determined more accurately based on the content description information, so that the video to be processed can be segmented more accurately.

With reference to the first aspect, in a possible implementation of the first aspect, the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before the first voice information fragment And the voice information segment adjacent to the first voice information segment, according to the voice information, the keywords of the content description information, and the pause point in the voice information, determining the segment point of the to-be-processed video includes: according to the The first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information, determine the first segment point, where the segment point of the to-be-processed video includes the first segment Segment point. In addition, when determining the segmentation point of the video to be processed in the above technical solution, the location of the segmentation point can be determined only based on the keywords of the content description information and the voice information in two adjacent video segments. The division of video clips can be implemented according to fixed time and step size. Therefore, during the video playback process, video segments can be divided into the played video. In this way, the video can be segmented without obtaining the completed video file. In other words, the video to be processed can be segmented in real time by using the above technical solution. Therefore, the above technical solution can be applied to segmentation processing of live video.

With reference to the first aspect, in a possible implementation of the first aspect, it is determined according to the first voice information segment, the second voice information segment, the keywords of the content description information, and the pause point in the voice information The first segmentation point includes: according to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the content description Key words of information, determine the similarity between the first voice information segment and the second voice information segment; determine that the similarity between the first voice information segment and the second voice information segment is less than the similarity threshold; according to the voice information The stop point is to determine the first segment point.

With reference to the first aspect, in a possible implementation of the first aspect, the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment, According to the pause points in the voice information, determining the first segment point includes: according to the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and At least one of the words adjacent to the pause point determines the first segment point.

With reference to the first aspect, in a possible implementation of the first aspect, the pause points in the first voice information segment include K, or, the pause points adjacent to the first voice information segment include K. The first segment point is determined according to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the duration of the pause, and the words adjacent to the pause point Including: when K is equal to 1, determine the K pause points as the segment point; include a preset in K words that are a positive integer greater than or equal to 2 and adjacent to the K pause points In the case of words, determine the pause point adjacent to the predetermined word as the segment point; in the case where K is a positive integer greater than or equal to 2 and the K words include at least two predetermined words , Determine the pause point with the longest pause duration among the at least two pause points adjacent to at least two preset words as the segment point; K is a positive integer greater than or equal to 2 and the K words do not include In the case of the preset word, it is determined that the pause point with the longest pause duration among the K pause points is the segment point.

With reference to the first aspect, in a possible implementation of the first aspect, the text information further includes the presentation, and the video segmentation device determines the segmentation of the to-be-processed video according to the text information and the voice information Before the point, the method further includes: determining that the presentation duration of the current page of the presentation is greater than the first preset duration; or determining that the presentation duration of the current page of the presentation is less than or equal to the second preset duration. The above technical solution can avoid the occurrence of inappropriate segmentation caused by long-term unchanged or very rapid changes of the presentation.

With reference to the first aspect, in a possible implementation of the first aspect, the method further includes: the video segmentation device according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text , Determine the summary of the segment, where the target text includes at least one of the presentation and the content description information. Based on the above technical solution, the user can use the summary to quickly determine the desired location when reviewing the video. In addition, the foregoing technical solution takes into account information other than the video to be processed in the process of determining the summary. This can improve the accuracy of the determined abstract and the speed of determining the abstract.

With reference to the first aspect, in a possible implementation of the first aspect, the video segmentation device determines the segment based on the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text The abstract includes: determining the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text; and determining the abstract of the segment according to the third keyword vector.

With reference to the first aspect, in a possible implementation of the first aspect, the video segmentation device determines the summary of the segment according to the third keyword vector, including: according to the target text and the segmented voice information , Determine the reference text, where the reference text includes J sentences, where J is a positive integer greater than or equal to 1; according to the key words of the segmented voice information, the key words of the target text and each sentence in the J sentences , Determine J keyword vectors; according to the third keyword vector and the J keyword vectors, determine the summary of the segment.

With reference to the first aspect, in a possible implementation of the first aspect, determining the reference text according to the target text and the segmented voice information includes: if the target text includes redundant sentences, adding The redundant sentence in the target text is deleted, the revised target text is obtained, and the revised target text is combined with the segmented speech information to obtain the reference text; in the case that the target text does not include the redundant sentence, The target text is combined with the segmented voice information to obtain the reference text.

With reference to the first aspect, in a possible implementation of the first aspect, determining the summary of the segment according to the third keyword vector and the J keyword vectors includes: according to the third keyword vector and The J keyword vectors determine J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j Is a positive integer greater than or equal to 1 and less than or equal to J; determine the R distances with the shortest distance among the J distances, where R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the score The summary of the segment includes sentences corresponding to the R distances.

With reference to the first aspect, in a possible implementation of the first aspect, the to-be-processed video is a real-time video stream, and the voice information of the to-be-processed video is the real-time video stream from the start time of the real-time video stream or the previous The voice information from the segment point to the current moment. The above technical solution can realize real-time segmentation of the video. In other words, when the video is segmented using the above technical solution, it is not necessary to obtain all the content of the video to be processed. Therefore, the above technical solution can realize real-time segmentation of live video.

In a second aspect, an embodiment of the present application provides a video segmentation device, and the device includes a unit for executing the first aspect or any possible implementation manner of the first aspect.

Optionally, the video segmentation apparatus of the second aspect may be a computer device, or may be a component (such as a chip or a circuit, etc.) that can be used in a computer device.

In a third aspect, an embodiment of the present application provides a storage medium, and the storage medium stores instructions for implementing the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, the embodiments of the present application provide a computer program product containing instructions. When the computer program product is run on a computer, the computer can execute the first aspect or any one of the possible implementations of the first aspect. The method described.

Description of the drawings

FIG. 1 is a schematic diagram of a system that can apply the video segmentation method provided by the embodiments of the present application;

FIG. 2 is a schematic diagram of another system that can apply the video segmentation method provided by the embodiments of the present application;

Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application;

Fig. 4 is a schematic diagram of a video conference process provided according to an embodiment of the present application;

Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application;

Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application;

Fig. 7 is a structural block diagram of a video segmentation device according to an embodiment of the present application;

Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application.

Detailed ways

The technical solution in this application will be described below in conjunction with the drawings.

In this application, "at least one" refers to one or more, and "multiple" refers to two or more. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, both A and B exist, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are in an "or" relationship. "At least one item (a) in the following" or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a). For example, at least one of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c may be single or multiple. In addition, in the embodiments of the present application, words such as "first" and "second" do not limit the number and execution order.

It should be noted that in this application, words such as "exemplary" or "for example" are used to indicate examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in this application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.

Various aspects or features of this application can be implemented as methods, devices, or products using standard programming and/or engineering techniques. The term "article of manufacture" as used in this application encompasses a computer program accessible from any computer-readable device, carrier, or medium. For example, computer-readable media may include, but are not limited to: magnetic storage devices (for example, hard disks, floppy disks, or tapes, etc.), optical disks (for example, compact discs (CD), digital versatile discs (DVD)) Etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.). In addition, various storage media described herein may represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.

Figure 1 is a schematic diagram of a system that can apply the video segmentation method provided by this application. FIG. 1 shows a video conference system, which includes a conference control server 101, a conference terminal 111, a conference terminal 112, and a conference terminal 113. The conference terminal 111, the conference terminal 112, and the conference terminal 113 can establish a conference through the conference control server 101.

Video conferences usually include at least two conference rooms. Each conference site can access the conference control server through a conference terminal. The conference terminal may be a device used to access a video conference. The conference terminal can be used to receive conference data and present the conference content on the display device according to the conference data. The conference terminal may include a host and a display device. The host can receive conference data through a communication interface, generate a video signal according to the received conference data, and output the video signal to the display device in a wired or wireless manner. The display device presents the content of the meeting according to the received video signal. Optionally, in some embodiments, the display device may be built in the host. For example, the conference terminal may be an electronic device with a built-in display device, such as a notebook computer, a tablet computer, or a smart phone. Optionally, in other embodiments, the display device may be a display device externally placed on the host. For example, the host may be a computer host, and the display device may be a display, a television, or a projector. For another example, even if the host has a built-in display device, the display device used for presenting conference content may also be a display device external to the host. For example, the host may be a notebook computer, and the display device may be a monitor, a television or a projector externally connected to the notebook computer.

In some cases, a video conference may include a main venue and at least one branch venue. In this case, the conference terminal (for example, the conference terminal 111) in the main conference site can upload the collected media stream of the main conference site to the conference control server 101. The conference control server 101 may generate conference data according to the received media stream, and send the conference data to the conference terminals (for example, the conference terminal 112 and the conference terminal 113) in the branch venue. The conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data.

In other cases, at least two venues in a video conference may not be distinguished between primary and secondary. The conference terminal in each conference site can upload the collected media stream to the conference control server 101. For example, suppose that the conference terminal 111 is the conference terminal used to access the video conference in the conference room 1, the conference terminal 112 is the conference terminal used to access the video conference in the conference room 2, and the conference terminal 113 is the conference terminal used to access the video in the conference room 3. The conference terminal for the conference. The conference terminal 111 can upload the collected media stream of the conference site 1 to the conference control server 101. The conference control server 101 can generate conference data 1 according to the media stream of the conference site 1, and send the conference data 1 to the conference terminal 112 and the conference terminal 113. The conference terminal 112 and the conference terminal 113 may present the conference content on the display device according to the received conference data 1. Similarly, the conference terminal 112 can also upload the collected media stream of the conference site 2 to the conference control server, and the conference control server 101 can generate conference data 2 according to the media stream of the conference site 2, and send the conference data 2 to the conference terminal 111 With the conference terminal 113, the conference terminal 111 and the conference terminal 113 can present the conference content on the display device according to the received conference data 2; the conference terminal 113 can also upload the collected media stream of the conference site 3 to the conference control server, and the conference control The server 101 can generate conference data 3 according to the media stream of the conference site 3, and send the conference data 3 to the conference terminal 111 and the conference terminal 112, and the conference terminal 111 and the conference terminal 112 can display the conference data 3 on the display device according to the received conference data 3 Content of meeting.

Optionally, in some embodiments, the media stream may be an audio stream. Optionally, in other embodiments, the media stream may be a video stream. The media device responsible for collecting media streams may be built in the conference terminal (for example, a camera and microphone in the conference terminal), or may be externally connected to the conference terminal, which is not limited in this embodiment of the application.

Optionally, in some embodiments, the speakers of the conference use presentations during the speech. In this case, the media stream may be an audio stream of the speaker. The presentation used by the speaker during the speaking process can be uploaded to the conference control server 101 through an auxiliary stream (also called a data stream or a computer screen stream). The conference control server 101 generates conference data based on the received audio stream and auxiliary stream. Optionally, in some possible implementation manners, the conference data may include the received audio stream and auxiliary stream. Optionally, in other possible implementation manners, the conference data may include a processed audio stream obtained after processing the received audio stream and the auxiliary stream. Processing the received audio stream can be a transcoding operation on the received audio stream, for example, the bit rate of the audio stream can be reduced, so as to reduce the amount of data required to transmit the audio stream to other conference terminals. Optionally, in other possible implementation manners, the conference data may include the received audio stream, an audio stream with a different bit rate from the received audio stream, and the auxiliary stream. In this way, the conference terminal can select a suitable audio stream according to the network condition and/or the way of accessing the conference. For example, if the network condition of the conference terminal is good or Wi-Fi is used to access the conference, you can choose an audio stream with a higher bit rate, so that you can listen to a clearer sound. For another example, if the network condition of the conference terminal is poor, an audio stream with a lower bit rate can be selected, which can reduce the interruption of the conference live broadcast caused by the bad network condition. For another example, if the conference terminal uses a mobile network to access the conference, an audio stream with a lower bit rate can be selected, which can reduce data consumption. Optionally, in other possible implementation manners, the conference data may include subtitles corresponding to the speaker's speech in addition to at least one bit rate audio stream and auxiliary stream. The subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.

Optionally, in other embodiments, the media stream may be a video stream of the speaker during the speech. In other words, the media stream can include both the voice information and picture information of the speaker during the speech. Correspondingly, the media stream uploaded to the conference control server 101 is the video stream. In some cases, it is assumed that the speaker uses a presentation during the speech and uses an output device (such as a projector, a television, etc.) to show the presentation. The screen information in the media stream includes the presentation displayed by the speaker. Therefore, the video stream uploaded to the conference control server 101 includes the presentation. In this case, the conference control server 101 can directly determine the conference data according to the video stream. In other cases, the presentation used by the speaker during the speech can be uploaded to the conference control server 101 in the form of auxiliary streams. The conference control server 101 may generate conference data according to the collected video stream and the auxiliary stream. Optionally, in some possible implementation manners, the conference data may include collected video streams and auxiliary streams. Optionally, in other possible implementation manners, the conference data may include a processed video obtained after processing the collected video stream and the auxiliary stream. Processing the captured video stream can be a transcoding operation on the captured video stream, for example, the resolution of the video stream can be reduced, so as to reduce the amount of data required to transmit the video stream to other conference terminals. Optionally, in other possible implementation manners, the conference data may include a collected video stream, a video stream with a resolution different from the collected video stream, and the auxiliary stream. In this way, the conference terminal can select an appropriate video stream according to the network conditions and/or the way to access the conference. For example, if the network of the conference terminal is good or Wi-Fi is used to access the conference, a video stream with a higher resolution can be selected so that the audience can see a clearer picture. For another example, if the network condition of the conference terminal is poor, a video stream with a lower resolution can be selected, which can reduce the interruption of the conference live broadcast caused by the bad network condition. For another example, if the conference terminal uses a mobile network to access the conference, a video stream with a lower resolution can be selected, which can reduce data consumption. Optionally, in other possible implementation manners, in addition to the video stream of at least one resolution and the auxiliary stream, the conference data may also include subtitles corresponding to the speaker's speech. The subtitles can be generated by voice-to-text conversion based on voice recognition technology, or they can be manually recorded by the speakers, or they can be generated based on voice-text conversion combined with manual modification. of.

Fig. 2 is a schematic diagram of another system to which the video segmentation method provided by this application can be applied. FIG. 2 shows a distance education system, which includes a course server 201, a main device 211, a client device 212, and a client device 213.

The main device 211 can upload the collected media stream to the course server 201. The course server 201 can generate course data according to the media stream, and send the course data to the client device 212 and the client device 213, and the client device 212 and the client device 213 can present the course data on the display device according to the received course data. Course content.

The main device 211 may be a notebook computer or a desktop computer. The client device 212 and the client device 213 may be notebook computers, desktop computers, tablet computers, smart phones, and so on.

Optionally, in some embodiments, the teacher in charge of the lecture uses the presentation during the lecture. In this case, the media stream may be an audio stream of the teacher's lecture. The presentation used by the teacher during the lecture can be uploaded to the course server 201 through auxiliary streams. The course server 201 generates course data according to the received audio stream and auxiliary stream.

Optionally, in other embodiments, the media stream may be a video stream of the teacher during the lecture. In other words, the media stream can include both the audio information and the picture information of the teacher during the lecture. Correspondingly, the media stream uploaded to the course server 201 is the video stream. In some cases, it is assumed that the teacher uses a presentation during the lecture and uses an output device (such as a projector, a television, etc.) to show the presentation. The picture information in the media stream includes the presentation presented by the teacher. Therefore, the presentation is included in the video stream uploaded to the course server 201. In this case, the course server 201 can directly determine the course data according to the video stream. In other cases, the presentation used by the teacher during the lecture can be uploaded to the course server 201 by way of auxiliary streams. The course server 201 can generate course data according to the collected video stream and the auxiliary stream.

The specific content of the course data is similar to the specific content of the conference data, so I won’t repeat it for brevity.

Fig. 3 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application. The method shown in FIG. 3 can be executed by a video segmentation device. The video segmentation device may be a computer device that can implement the method provided in the embodiment of the application, such as a personal computer, a notebook computer, a tablet computer, a server, etc., or it may be a computer device that can implement the method provided in the embodiment of the application. Internal hardware, such as a graphics card, a graphics processing unit (GPU), or a dedicated device for implementing the method provided in the embodiment of the present application. For example, in some embodiments, the video segmentation device may be the conference control server 101 in the system shown in FIG. 1 or a piece of hardware provided in the conference control server 101. For another example, in other embodiments, the video segmentation device may be a conference terminal uploading a media stream in the system shown in FIG. 1 or a piece of hardware in the conference terminal. For another example, in other embodiments, the video segmentation apparatus may be the main device 211 in the system shown in FIG. 2 or a piece of hardware provided in the main device 211. For another example, in other embodiments, the video segmentation device may be the course server 201 or a piece of hardware in the course server 201 in the embodiment shown in FIG. 2.

For ease of description, it is assumed that the method shown in FIG. 3 is applied to the system shown in FIG. 1.

301. The video segmentation device acquires text information of a to-be-processed video and voice information of the to-be-processed video, where the text information includes at least one of a presentation in the to-be-processed video and content description information of the to-be-processed video.

The presentation refers to the presentation presented by the speaker of the conference during the speech. The embodiment of the present application does not limit the file format of the presentation, as long as the presentation is displayed on the display device during the speech of the speaker, it can be the presentation. For example, the presentation may be in ppt format or pptx format. For another example, the presentation can be a PDF format. For another example, the presentation may also be in word format or txt format.

The content description information is the information used to describe the content of the speech uploaded by the speaker or the host of the meeting before starting the meeting. Optionally, in some embodiments, the content description information includes an outline, abstract, and/or key information of the speaker's speech content in the video conference. For example, the content description information may include keywords of the speaker's speech content. For another example, the content description information may include a summary of the content of the speaker's speech. For another example, the content of the speaker's speech may include multiple parts, and the content description information may include the subject, abstract, and/or keywords of each part of the multiple parts.

The voice information may include voice-text conversion of the speaker's speech to obtain the corresponding text. The embodiment of the present application does not limit the specific implementation of the voice-text conversion, as long as the recognized voice can be converted into the corresponding text. The voice information may also include at least one pause point obtained by performing voice recognition on the speaker's speech. The pause point represents the natural pause of the speaker in the process of speaking.

302. The video segmentation device determines a segmentation point of the video to be processed according to the text information and the voice information.

As described above, the text information may include at least one of the presentation and the content description information. In other words, the text information can have the following three situations:

Case 1: Only the presentation is included in the text information;

Case 2: The text information only includes the content description information;

Case 3: The text information includes the presentation and the content description information.

In other words, in some cases, the speaker may only show the presentation during the speech without uploading the content description information in advance. Therefore, the above situation 1 may occur. In other cases, the speaker may only upload the content description information in advance without showing the presentation during the speech. Therefore, the above situation 2 may occur. In other cases, the speaker may show the presentation during the speech and upload the content description information in advance. Therefore, it is possible to write case 3 above.

For the above case 1, the video segmentation device may determine the segmentation point of the video to be processed according to the presentation.

For the above case 2, the video segmentation apparatus may determine the segmentation point of the video to be processed according to the content description information.

Optionally, in some embodiments, for the foregoing case 3, the video segmentation device may determine the segmentation point of the video to be processed according to one of the presentation and the content description information. In other words, in the case where the text information includes the presentation and the content description information, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation or the content description information.

Optionally, in some embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine the presentation duration of the current page of the presentation, and based on the presentation time The presentation duration of the current page determines whether to determine the segmentation point of the video to be processed according to the presentation or to determine the segmentation point of the video to be processed according to the content description information.

Optionally, in some embodiments, the video segmentation device may determine the to-be-processed based on the content description information and the voice information when the presentation of the current page of the presentation is longer than the first preset duration. The segment point of the video. In this way, it is possible to avoid the situation that a segment of the video is too long due to the speaker showing the same content for a long time. The first preset duration can be set as required. For example, the first preset duration may be 10 minutes. For another example, the first preset duration may be 15 minutes.

Optionally, in some embodiments, the video segmentation device may determine the content description information and the voice information when the presentation duration of the current page of the presentation is less than or equal to the second preset duration. The segment point of the video to be processed. In this way, it can be avoided that a segment of the video is too short due to the speaker frequently switching the display content of the presentation. Similar to the first preset duration, the second preset duration can be set as needed. For example, the second preset duration may be 20 seconds. For another example, the second preset duration may be 10 seconds.

The first preset duration is greater than the second preset duration.

Optionally, in some embodiments, the video segmentation device may, in the case where the presentation duration of the current page of the presentation is greater than the second preset duration and less than or equal to the first preset duration, according to the presentation The document and the voice information determine the segmentation point of the video to be processed.

Optionally, in other embodiments, only the first preset duration may be set. If the presentation duration of the presentation on the current page is greater than the first preset duration, the segmentation point of the video to be processed is determined according to the content description information and the voice information. If the presentation duration of the current page of the presentation is not greater than the first preset duration, the segmentation point of the video to be processed may be determined according to the presentation and the voice information. The presentation duration of the current page of the presentation is the duration of the presentation staying on the current page.

Optionally, in some embodiments, the start moment of the presentation duration of the current page of the presentation is the moment when the presentation is switched to the current page, and the end moment of the presentation duration of the current page of the presentation is the moment of the presentation from the current page. The moment when the page is switched to another page.

For example, if the presentation time T ₁ is switched to page n (n is a positive integer equal to or greater than 1), then the device may start counting the video segment from the time T _1. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point. If the presentation is switched to page n+1 at time T ₂ (T _{2 is} greater than T ₁ ), and the time length from time T _{1 to} time T ₂ is less than or equal to the second preset time length, the video segmentation device The segmentation point of the video to be processed can be determined according to the content description information and the voice information. If the duration from time T _{1 to} time T ₂ is less than or equal to the first preset duration and greater than the second preset duration, the video segmentation device may determine the to-be-processed video based on the presentation and the voice information The segmentation point. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.

Optionally, in other embodiments, the start time of the presentation duration of the current page of the presentation may be the previous segment point, and the end time of the presentation duration of the current page of the presentation is the beginning of the presentation from the current page The moment to switch to another page.

For example, assume that a presentation time at T ₃ is switched to page n (n is a positive integer equal to or greater than 1), and the presentation time length is greater than the first predetermined length of the residence time in the n-th page. In this case, the video segment of the device descriptive information and the voice information according to the content, determines to be treated as a divide point of video time T _4. The video segments from the apparatus may be timed time T _4. If the timing duration exceeds the first preset duration, and the presentation has not been switched to page n+1, the video segmentation device may determine the to-be-processed video based on the content description information and the voice information The segmentation point. If the presentation is switched to page n+1 at time T ₅ (T _{5 is} greater than T ₄ ), and the duration from time T _{4 to} time T ₅ is not greater than the first preset duration and greater than the second preset duration , The video segmentation device can determine the segmentation point of the video to be processed according to the presentation and the voice information. More specifically, the video segmentation device can determine the segmentation point of the video to be processed according to the presentation on the nth page and the voice information.

Optionally, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine the segmentation of the to-be-processed video based on the presentation and the voice information. Segment point. In other words, even if the text information includes the presentation and the content description information at the same time, the video segmentation device can only refer to the presentation and the voice information (that is, the content description information is not used) to determine the video to be processed The segmentation point.

Optionally, in other embodiments, in the case that the text information includes the presentation and the content description information, the video segmentation device may determine the content of the to-be-processed video according to the content description information and the voice information. Segment point. In other words, even if the text information includes both the presentation and the content description information, the video segmentation device can only refer to the content description information and the voice information (that is, the presentation will not be used) to determine the video to be processed The segmentation point.

The video segmentation device determining the segmentation point of the video to be processed according to the presentation and the voice information may include: the video segmentation device determines the switching point of the presentation, and the content of the presentation before and after the switching point Different; the video segmentation device determines at least one pause point based on the voice information; the video segmentation device determines the segment point based on the switching point and the at least one pause point.

The switching point of the presentation refers to the moment when the presentation is switched. The switching of the presentation can refer to the page turning of the presentation. For example, switch from page 1 to page 2. The switching of the presentation can also mean that the content of the presentation changes without turning pages. For example, in a case where the presentation is a text document, the speaker may only show a part (for example, the upper half) of a certain page of the presentation, and then scroll to the remaining part (for example, the lower half) of the page. Although the presentation is not page turning at this time, the content in the presentation has changed.

Optionally, in some embodiments, the video segmentation device may obtain a switching signal for instructing to switch the content of the presentation. In this case, the video segmentation device may determine that the moment when the switching signal is acquired is the switching point.

Optionally, in some embodiments, the video segmentation device may obtain the content of the presentation. In this case, the video segmentation device may determine the switching point according to the change of the content of the presentation. For example, the video segmentation device may determine that the first moment is the switching when it is determined that the content of the presentation presented at the first moment of the video to be processed is different from the content of the presentation presented at the second moment. point. Optionally, in some embodiments, the first moment and the second moment are adjacent moments, and the first moment is before the second moment. Optionally, in other embodiments, the first time is before the second time and the interval between the first time and the second time is less than a preset time length. In other words, in this case, the video segmentation device can detect whether the content presented by the presentation has changed every period of time.

Optionally, in some embodiments, the video segmentation device may determine the switching point in combination with the acquired switching signal for instructing to switch the content of the presentation and the content presented by the presentation. For example, the video segment unit acquires the switching signal at time T _1. The video segment presentation means may obtain the content after the content F F before time T ₁ to _a time T ₁ and the presentation of presentation _2, F ₁ and F ₂ is greater than or equal to a positive integer. Optionally, in some embodiments, F ₁ and F ₂ may take smaller values, for example, F ₁ and F ₂ may be equal to 2. This can reduce the amount of calculation. If the content of the _{presentation. 1} in the frame F and F presented in _two different two consecutive frames, it may be determined that the presentation time of the presentation frame where the content changes for the switching point. For example, suppose that the values of F ₁ and F ₂ are both 2. If the content of the presentation in the second and third frames of the four frames is different, it can be determined that the moment when the second frame is located is the switching point. Using the switching signal and the content presented by the presentation to determine the switching point can avoid the inaccuracy of the determined switching point caused by the unsynchronization of the switching signal and the screen switching of the presentation.

Optionally, in some embodiments, the video segmentation device may determine whether the content presented by the presentation at different moments (or different frames) is the same in the following manner: the video segmentation device compares the presentation at different moments (or The change of the pixel value at the same position in different frames) exceeds the number P of the preset change value. If P is greater than the first preset threshold P ₁ , the content presented by the presentation of the video segmentation device has changed. Optionally, in some embodiments, the change in the pixel value can be determined by calculating the absolute value of the difference between the pixel gray values. Optionally, in other embodiments, the change of the pixel value can be determined by calculating the sum of the absolute values of the differences in the three color channels.

Optionally, in some embodiments, if P is greater than a second preset threshold P ₂ (P _{2 is} less than P ₁ ), the video segmentation device may determine keywords based on subsequent presentations. For example, the video segmentation device determines that the presentation at time T _{1 and} the presentation at time T ₂ (time T ₂ is later than time T ₁ ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P ₂ and less than P ₁ . In this case, the device may determine keywords video segment presentation time T ₂ according to the.

As mentioned above, the voice information may also include at least one pause point. Optionally, in some embodiments, the at least one pause point used to determine the segment point may be all the pause points from the start time to the current time. If the segment point determined in step 302 is the first segment point of the to-be-processed video, the start time is the start time of the to-be-processed video. If the segment point determined in step 302 is the k-th segment point of the to-be-processed video (k is a positive integer greater than or equal to 2), then the start time is the time where the k-1 segment point is located. Optionally, in other embodiments, the video segmentation device may also determine a pause point within a time range according to the moment when the switching point is located, and the switching point is within this time range. For example, if the switching point is at time T ₁ , the video segmentation device can determine the pause point from time T ₁ -t to time T ₁ +t.

The video segmentation device determines that the switching point is the segmentation point when it is determined that the switching point is the same as one of the at least one pause point. When it is determined that the switching point is not the same as any one of the at least one stopping point, the video segmentation device determines that the one of the at least one stopping point that is closest to the switching point is the segment point . The distance between the stopping point and the switching point refers to the time difference between the stopping point and the switching point. For example, assuming that the switching point is at time T _1, the at least one stop point a point positioned pause time T _2, T ₂ T ₁ as a difference as t. Assuming that the difference between the at least one stopping point except the stopping point at time T _{2 and the} time T ₁ is greater than t, then the stopping point at time T ₂ is the segment point. If the distance from two of the at least one stopping point to the switching point is the same and less than the distance from the switching point to the other stopping points except the two stopping points, then any of the two stopping points can be determined. A pause point is the switching point.

The video segmentation device determining the segmentation point of the video to be processed according to the content description information and the voice information may include: the video segmentation device according to the voice information, the keywords of the content description information, and the information in the voice information Pause point, determine the segment point of the video to be processed.

Optionally, in some possible implementation manners, the to-be-processed video may be divided into multiple voice information segments. The first voice information segment and the second voice information segment are two consecutive voice information segments among the multiple voice information segments. The first voice information segment follows the second voice information segment. The video segmentation device can determine the first segmentation point according to the first segment of voice information, the second segment of voice information, the keywords of the content description information, and the pause point in the voice information. Is one of at least one segment point included in the to-be-processed video.

The video segmentation device can intercept text fields on the voice information with a window length W and a step size S. The video segmentation device can cut out at least one text field of length W. Each text field of length W is a piece of voice information.

The video segmentation device can determine whether the first voice information segment is similar to the second voice information segment. If the first voice information segment is not similar to the second voice information segment, it can be determined that a segment point of the to-be-processed video is near the first voice information segment. If the second voice information segment is similar to the first voice information segment, continue to determine that a third voice information segment adjacent to the first voice information segment and located after the first voice information segment and the first voice information segment Is it similar.

The similarity degree can be used as a criterion for measuring whether the first voice information segment is similar to the second voice information segment. If the similarity between the first voice information segment and the second voice information segment is greater than or equal to a similarity threshold, it can be considered that the first voice information segment is similar to the second voice information segment; if the first voice information segment is If the similarity of the second voice information segment is less than the similarity threshold, it can be considered that the first voice information segment is not similar to the second voice information segment.

Optionally, in some possible implementation manners, the video segmentation device may be based on keywords of the first voice information segment, keywords of the second voice information segment, content of the first voice information segment, and the first voice information segment. Second, the content of the voice information segment and the keywords of the content description information determine the similarity between the first voice information segment and the second voice information segment.

The video segmentation device can determine the keywords of the first voice information segment. Assuming that the number of keywords determined from the first voice information segment is N, the number of keywords determined from the content description information is M, and there are no duplicate keywords among the M keywords and the N keywords.

The video segmentation device can determine keywords according to the following methods:

Step 1. According to the preset stop word list or according to the part of speech of each word in the text, remove the words that do not represent the actual meaning, such as "的", "this", "then", etc. Stop words (Stop Words) are manually input, some characters or words that are not automatically generated. These words do not represent actual meaning and will be filtered out before or after processing natural language data. A set of stop words composed of stop words can be called a stop word list.

Step 2: Count the frequency of each of the remaining words in the text. The frequency of each word in the text can be determined according to the following formula:

TF(n)=N(n)/All_N, formula 1.1

Among them, TF(n) represents the frequency of the nth word in the remaining words of the text in the text after step 1, N(n) represents the number of times the nth word appears, and All_N represents the remaining The total number of words.

Step 3. Determine at least one word with the highest frequency as a keyword of the text.

For example, if the text is content description information, it can be determined that the M words with the highest appearance frequency are keywords of the content description information, where M is a positive integer greater than or equal to 1. If the text is the first voice information segment, it can be determined that the N words with the highest occurrence frequency are keywords of the first voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the first voice information fragment are the same as the keywords of the content description information, the repeated words are deleted from the N words, and the following words are selected as the The keyword of the first voice information segment. For example, assuming that N is equal to 2 and M is equal to 1, the keywords of the content description information include "student". Assuming that the determined word with the highest frequency in the first voice information segment is "student", then continue to determine the word with the second highest frequency. If the word with the second highest appearance frequency is "school", it can be determined that "school" is a keyword of the first voice information segment, and continue to determine the word with the third highest appearance frequency. Assuming that the word with the third highest frequency is "course", it can be determined that "course" is another keyword of the first voice information segment. If the text is the second voice information segment, it can be determined that the N words with the highest occurrence frequency are keywords of the second voice information segment, and N is a positive integer greater than or equal to 1. If it is determined that one or more of the N words in the second voice information fragment are the same as the keywords of the content description information, the repeated words are deleted from the N words, and the following words are selected as the The keyword of the second voice information segment.

Optionally, in some embodiments, the video segmentation device may determine the first keyword vector based on the keywords of the first voice information segment, the keywords of the content description information, and the content of the first voice information segment . Specifically, the video segmentation device may determine the frequency of the keywords of the first voice information segment and the keywords of the content description information in the content of the first voice information segment, and the frequency is the first keyword vector . The content of the voice information fragment refers to all the words included in the voice information fragment. For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "course" and "school". Assuming that the frequencies of the above three keywords in the first voice information segment are 0.1, 0.2, and 0.3, respectively, the first keyword vector is (0.3, 0.2, 0.1).

Similarly, the video segmentation device may also determine the second keyword vector based on the keywords of the second voice information segment, the keywords of the content description information, and the content of the second voice information segment. Specifically, the video segmentation device may determine the keyword of the second voice information segment and the frequency of the keyword of the content description information in the content of the second voice information segment, and the frequency is the second keyword vector . For example, suppose that the keyword of the content description information is "student", and the keywords of the first voice information fragment are "breakfast" and "nutrition". Assuming that the frequencies of the above three keywords appearing in the second voice information segment are 0.3, 0.25, and 0.05, respectively, the second keyword vector is (0.3, 0.25, 0.05).

The video segmentation device determines the distance based on the first keyword vector and the second keyword vector. If the distance is greater than a preset distance, the similarity between the first voice information segment and the second voice information segment can be considered Less than the similarity threshold. In this case, the video segmentation device determines the segmentation point according to the first voice information segment.

The video segmentation device may determine a distance based on the first keyword vector and the second keyword vector in the following manner:

Step 1. Expand the first keyword vector into a first vector, and expand the second keyword vector into a second vector, wherein the keyword corresponding to the first vector and the keyword corresponding to the second vector include the first vector The keywords of a voice information segment, the keywords of the second voice information segment, and the keywords of the content description information, and there are no repeated keywords in the keywords corresponding to the first vector, and keywords corresponding to the second vector There are no duplicate keywords in.

For example, suppose the first keyword vector is (0.3,0.2,0.1), and the corresponding keywords are "school", "course" and "student", and suppose the second keyword vector is (0.3,0.25,0.05) , The corresponding keywords are "student", "breakfast" and "nutrition". In this case, the first vector is (0.3,0.1,0,0.2,0), and the corresponding keywords are "school", "student", "breakfast", "course", "nutrition", and the second The vector is (0,0.3,0.25,0,0.05), and the corresponding keywords are "school", "student", "breakfast", "course", and "nutrition".

Step 2: Calculate the distance between the first vector and the second vector. The distance between the first vector and the second vector is the distance determined according to the first keyword vector and the second keyword vector.

Optionally, in some embodiments, the distance between the first vector and the second vector may be the Euclidean distance. Because the same keywords in the two voice information fragments may be few. Therefore, if the distance between the first vector and the second vector is a cosine distance, there may be many zero values in the calculation result. Therefore, it may be more appropriate to select the Euclidean distance as the distance between the first vector and the second vector.

Optionally, in other embodiments, the distance between the first vector and the second vector may be a cosine distance.

In addition to using the word frequency vectors of two adjacent speech information segments to determine whether two speech information segments are similar, other methods can also be used to determine whether two speech information segments are similar.

For example, the first keyword vector and the second keyword vector may also be term frequency-inverse document frequency, binary term frequency, etc. Determining the distance between the first keyword vector and the second keyword vector may be determining the n-norm distance of the first keyword vector and the second keyword vector (n is a positive integer greater than or equal to 1), Determine the relative entropy distance between the first keyword vector and the second keyword vector.

Taking the above-mentioned first vector (i.e. (0.3,0.1,0,0.2,0)) and the second vector (i.e. (0,0.3,0.25,0,0.05)) as examples, the first vector and the first vector The two vectors are binarized. The first vector after the binarization process is (1,1,0,1,0), and the second vector after the binarization process is (0,1,1,0,1). Then the 1-norm distance is calculated to obtain the repetition degree of the keywords of the first voice information segment and the keywords of the second voice information segment. The repetition of keywords can be considered a special form of distance. The degree of repetition of the keywords can be used to determine whether the first voice information segment is similar to the second voice information segment. If the repetition degree of the keyword is greater than or equal to a preset repetition degree, it can be considered that the first voice information segment is similar to the second voice information segment; if the repetition degree of the keyword is less than the preset repetition degree, it can be considered The first voice information segment and the second voice information segment are not similar. It can be seen that in this case, the preset degree of repetition can be regarded as the similarity threshold.

Optionally, in other embodiments, the extraction of keywords may also be determined according to the term frequency-inverse document frequency. The word frequency can be determined based on formula 1.1. The inverse document frequency can be determined according to the following formula:

IDF(n)=log(Num_Doc/(Doc(n)+1), formula 1.2

IDF(n) represents the inverse document frequency of the nth word, Num_Doc represents the total number of documents in the corpus, and Doc(n) represents the number of documents containing the nth word in the corpus.

Term frequency-inverse document frequency can be determined according to the following formula:

TF-IDF(n)=TF(n)×IDF(n), formula 1.3

Where TF-IDF(n) represents the word frequency of the nth word-the inverse document frequency. If the keyword is determined based on the term frequency-inverse document frequency, the first keyword vector is composed of the term frequency of the keyword-inverse document frequency.

When determining keywords based on term frequency-inverse document frequency, there is no need to remove meaningless words first.

Optionally, in other embodiments, the extraction of keywords may also be based on a text ranking (TextRank) method of word maps. If the keyword is determined according to the TextRank based on the word graph, the first keyword vector may be composed of the weight of the word.

In the case that the first voice information segment and the second voice information segment are not similar, the video segmentation device may determine the segmentation point according to the first voice information segment.

The video segmentation device may first determine whether a pause point is included in the first voice information segment. If the first voice information segment includes a pause point, it can be determined that the pause point is the segment point. If the first voice information segment includes multiple pauses, it can be determined whether the word after each of the multiple pauses is a preset word. The presupposition includes conjunctions with segmentation meaning, such as "next", "below", "next point" and so on. The word after the pause refers to the word adjacent to the pause after the pause. If only one word after one of the multiple pause points is a preset word, it can be determined that the pause point is a segmentation point. If the word following at least two of the multiple pauses is a preset word, it can be determined that the pause of the pause duration among the at least two pauses is the segment point. If none of the words following the multiple pause points is the preset word, it can be determined that the pause point with the longest pause duration among the multiple pause points is the segment point. If the first voice information segment does not include a pause point, the segment point may be determined according to the pause point adjacent to the first voice segment. It can be understood that there may be two pause points adjacent to the first speech segment, one is located before the first speech segment, and the other is located after the first speech segment. The video segmentation device may determine the segmentation point according to the distance between the two pause points and the first voice information segment. If the pause point is before the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the start position of the first voice information segment. If the pause point is after the first voice information segment, the distance between the pause point and the first voice information segment may be the number of words or the time difference between the pause point and the end position of the first voice information segment. For ease of description, the pause point located before the first voice information segment and adjacent to the first voice information segment is referred to as the front pause point, and the distance between the front pause point and the first voice information segment is referred to as Distance 1: The pause point located after the first voice segment and adjacent to the first voice segment is called a post-pause point, and the distance between the post-pause point and before the first voice information segment is called a distance 2. If the distance 1 is less than the distance 2, the front stop point can be determined to be the segment point; if the distance 1 is greater than the distance 2, then the back stop point can be determined as the segment point. If distance 1 is equal to distance 2, then the word after the pre-pause (hereinafter referred to as word 1) and the word after the post-pause (hereinafter referred to as word 2) can be determined; if word 1 is the presupposition word and word 2 is not the prediction Assuming a word, the preceding pause point is determined as the segment point; if word 1 is not the preset word and word 2 is the preset word, then the post pause point is determined as the segment point; if word 1 and word 2 If both of the preset words or none of the preset words are the preset words, it can be determined that the first pause point and the back pause point have the longest pause time as the segment point.

As mentioned above, the pause point is the natural pause of the speaker. Therefore, the pause point is a certain time long. Optionally, in some embodiments, if it is determined that the pause point is a segment point, the intermediate moment of the pause point may be determined as the segment point. Optionally, in other embodiments, if it is determined that the pause point is a segment point, the end time of the pause point may be determined as the segment point. Optionally, in other embodiments, if it is determined that the pause point is a segment point, the starting moment of the pause point may be determined as the segment point.

303. The video segmentation device segments the to-be-processed video according to the segmentation point.

If the segment point is the first segment point of the video to be processed, the start time of the segment is the start time of the video to be processed, and the end time of the segment is the segment point. If the segment point is the k-th segment of the video to be processed (k is a positive integer greater than or equal to 2), then the start time of the segment is the k-1 segment point, and the The end time is the segment point.

After the segment is determined, the video segmentation device can also determine the summary of the segment.

304. The video segmentation device may determine a summary of the segment according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text. The target text includes at least one of the presentation and the content description information.

Optionally, in some embodiments, the video segmentation device may first determine a third keyword vector, and then determine the segment summary according to the third keyword vector.

The video segmentation device can determine the third keyword vector according to the content of the segmented voice information, the keywords of the segmented voice information, and the keywords of the target text, where the content of the segmented voice information refers to the composition of the segment All sentences of the voice information of the segment.

It is understandable that if the text information only includes the presentation, the target text includes the presentation; if the text information only includes the content description information, the target text includes the content description information; if the text The information includes the presentation and the content description information, and the target text includes the presentation and the content description information.

The video segmentation device determines the implementation manner of the keyword of the segmented voice information, the video segmentation device determines the implementation manner of the keyword of the target text, and the video segmentation device determines the keyword of the first voice information segment The implementation is similar.

Optionally, in some embodiments, if the video segmentation device compares the pixel value changes of the presentation at the same position at different times (or different frames) by more than a preset change value, the number P is greater than the second preset Threshold P ₂ (P _{2 is} smaller than P ₁ ), the video segmentation device can determine the keyword of the target text according to the subsequent presentation. For example, the video segmentation device determines that the presentation at time T _{1 and} the presentation at time T ₂ (time T ₂ is later than time T ₁ ) at the same position, the number of pixel values changes exceeding the preset change value is greater than P ₂ and less than P ₁ . In this case, the video segment of the target device may determine the keywords text presentation based on the time T _2.

For example, suppose that the number of keywords determined from the presentation is L, the number of keywords determined from the content description information is M, the number of keywords determined from the segmented voice information is Q, and the L keywords There are no duplicate keywords among the M keywords and the Q keywords.

Specifically, the video segmentation device may first determine M keywords from the content description information, and then determine the L words that appear most frequently in the presentation. If one or more words in the L words also belong to the M keywords, delete the one or more words from the L words, and then continue to determine the word with the second highest frequency from the presentation , Until the determined L keywords and the M keywords have no intersection. After that, the video segmentation device determines Q words from the segmented voice information. If one or more words in the Q words belong to the M keywords or the L keywords, delete the one or more words from the Q words, and then continue from the segmented voice information Determine the word with the second highest frequency until the determined Q keywords do not overlap with L keywords and M keywords.

The third keyword vector includes the Q keywords, the L keywords and the frequency of the M keywords in the segmented voice information. It is understandable that if the target text does not include the content description information, the value of M is 0; if the target text does not include the presentation, the value of L is 0.

The video segmentation device may determine the summary of the segment according to the determined third keyword vector.

Specifically, the video segmentation device may determine the reference text according to the content of the target text and the segmented voice information, where the reference text includes J sentences, and J is a positive integer greater than or equal to 1; according to the segmented voice The keywords of the information, the keywords of the target text, and each sentence in the J sentences, determine J keyword vectors; determine the abstract of the segment according to the third keyword vector and the J keyword vectors . The j-th keyword vector in the J keyword vectors is the frequency of occurrence of the keywords of the segmented voice information and the keywords of the target text in the j-th sentence.

In the case that the target text includes redundant sentences, delete the redundant sentences in the target text to obtain the revised target text and merge the revised target text with the content of the segmented voice information to obtain the reference Text; in the case that the target text does not include the redundant sentence, the target text is combined with the content of the segmented voice information to obtain the reference text. In other words, when the target text includes the presentation and the content description information, one or more sentences in the presentation may also appear in the content description information. In this case, delete one or more sentences in the presentation that are the same as the content description information, and then merge the presentation with the redundant sentences deleted, the content description information, and the content of the segmented voice information to obtain The reference text. If the target text does not include redundant sentences, for example, any sentence in the presentation appears in the content description information, or the target text includes only one of the presentation and the content description information, Then, the target text can be directly combined with the content of the segmented voice information to obtain the reference text.

The video segmentation device determines the summary of the segment according to the third keyword vector and the J keyword vectors, including: the video segmentation device determines according to the third keyword vector and the J keyword vectors J distances, where the j-th distance in the J distances is determined according to the third keyword vector and the j-th keyword vector in the J keyword vectors, j is greater than or equal to 1 and less than or A positive integer equal to J; determine the R distances with the shortest distance among the J distances, and R is a positive integer greater than or equal to 1 and less than J; determine the summary of the segment, where the summary of the segment includes the R distances The sentence corresponding to the distance. The specific implementation manner of the video segmentation device determining the j-th distance according to the third keyword vector and the j-th keyword vector and the video segmentation device determining the distance according to the first keyword vector and the second keyword vector The implementation of is similar, the difference is: the j-th distance determined according to the third keyword vector and the j-th keyword vector is the Euclidean distance; the distance determined according to the first keyword vector and the second keyword vector It can be Euclidean distance or cosine distance. The reason why the j-th distance determined by the third keyword vector and the j-th keyword vector cannot be the cosine distance is that the j-th keyword vector is normalized when the cosine distance is calculated. But the j-th tube detects that the modulus length of your vector just reflects the overall frequency of sentence j and keywords, so it cannot be normalized.

The above-mentioned vectors (for example, the first keyword vector, the second keyword vector, the third keyword vector, and the j-th keyword vector) are all the frequency (that is, the word frequency) of the keywords in the specific text. In other embodiments, the aforementioned vector may also be determined according to a word vector (word to vector, word2vec) determined. For example, the first keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; add the word vectors of all keywords and average them to obtain the first keyword vector. The second keyword vector and the first keyword vector are determined in a similar manner, so there is no need to repeat them here. For another example, the third keyword vector can be determined by the following steps: use word2vex to determine the word vector of each keyword; determine the word frequency of each keyword; weight the word vectors of all keywords according to the word frequency of each keyword On average, the third keyword vector is obtained. For another example, the j-th keyword vector can be determined by the following steps: segment the j-th sentence and remove stop words; use word2vex to determine the word vector of each remaining word; add all the word vectors to average, Get the j-th keyword vector. In the case where the keyword vector is determined based on word2vex, the direct distance between the third keyword vector and the j-th keyword vector may be a cosine distance.

Fig. 4 is a schematic diagram of a conference process provided according to an embodiment of the present application.

401: The conference terminal 1 transmits audio and video stream 1 to the conference control server.

402: The conference terminal 2 transmits the audio and video stream 2 to the conference control server.

403: The conference terminal 3 transmits the audio and video stream 3 to the conference control server.

404: The conference control server determines the main conference site.

It is assumed that the main conference site determined by the conference control server is the conference site where the conference terminal 1 is located.

405. The conference control server sends the conference data to the conference terminal 2 and the conference terminal 3.

406. The conference terminal 2 and the conference terminal 3 store conference data.

Optionally, in some embodiments, the conference control server may also send conference data to the conference terminal 1, and the conference terminal 1 may also store the conference data.

407. The conference control server segments the audio and video stream 1 in real time (that is, determines the segment point) and extracts a summary of each segment.

408: The conference control server sends the segment point and summary to the conference terminal 2 and the conference terminal 3. In this way, the conference terminal 2 and the conference terminal 3 can independently select the review point to play the review video. Of course, in some implementations, the conference control server may also send the segment point and summary to the conference terminal 1.

Fig. 5 is a schematic flowchart of a video segmentation method provided according to an embodiment of the present application.

501. The video segmentation device determines whether the meeting reservation includes text related to the meeting content. In other words, the video segmentation device can determine whether the to-be-processed video includes content description information. If the result of the determination is yes (that is, the video to be processed includes content description information), step 502 is executed; if the result of the determination is no (that is, the video to be processed does not include content description information), then step 503 is executed.

502. The video segmentation device extracts keywords related to the content of the conference. In other words, the video segmentation device determines the keywords of the content description information.

After the keywords of the content description information are determined, step 503 may be executed.

503. The video segmentation device determines whether there is a screen display presentation in the to-be-processed video. In other words, the video segmentation device can determine whether the to-be-processed video includes a presentation, and the presentation is displayed on the screen. If the determination result is yes (that is, the to-be-processed video includes a presentation), step 504 is executed. If the determination result is no (that is, the to-be-processed video does not include a presentation), step 505 is executed. 504, the video segmentation device determines the position of the screen for displaying the presentation. After the video segmentation device determines the position of the screen, step 506 may be executed.

505. The video segmentation device determines whether there is a presentation transmitted through an auxiliary stream. In other words, in some possible implementations, the conference speaker may not display the presentation on the screen, but upload the presentation to the conference control server through the auxiliary stream. The conference terminal in the other conference site can obtain the presentation used by the conference speaker in the speech process according to the auxiliary stream. If the determination result is yes (that is, there is a presentation transmitted through the auxiliary stream), step 506 is executed. If the result of the determination is no (that is, there is no presentation transmitted through the auxiliary stream), the segmentation point of the to-be-processed video can be determined according to the voice information.

506. The video segmentation device determines whether the duration from the previous segment point to the current moment exceeds the first preset duration. If the video segmentation device determines that the duration from the last segment point to the current moment is greater than the first preset duration (that is, the determination result is yes), step 507 is executed. If the video segmentation device determines that the duration from the previous segment point to the current moment is not greater than the first preset duration, step 508 is executed. It can be understood that, if the segment point determined by the video segmentation device is the first segment point, the previous segment point refers to the start time of the video to be processed. For ease of description, the length of time from the point of the upper garment segment to the current moment can be referred to as the presentation time length.

507. The video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information.

508. The video segmentation device determines the segmentation point of the video to be processed according to the presentation and voice information. For the specific implementation manner of the video segmentation device determining the segmentation point of the to-be-processed video based on the presentation and voice information, reference may be made to the embodiment shown in FIG. 3, and it is unnecessary to repeat it here.

The video segmentation apparatus may perform step 509 and step 510 after determining the segmentation point of the video to be processed.

509. The video segmentation device determines segmented voice information and keywords of the segmented voice information, where the segmented voice information is the voice information between the previous segment point of the segment point and the segment point. It is understandable that if the segment point is the first segment point of the video to be processed, the segment voice information is the voice information from the start time of the video to be processed to the segment point.

510. The video segmentation device determines a segmented summary according to the segmented voice information, keywords of the segmented voice information, and keywords of the target text. For the specific implementation of steps 509 and 510, reference may be made to the embodiment shown in FIG. 3, and details are not required here.

It is understandable that, in other possible implementations, the video segmentation device can first determine whether there is a presentation displayed on the screen in the video to be processed during the process of segmenting the video and extracting the summary, and then Determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether the presentation is transmitted through the auxiliary stream. In other possible implementation manners, the video segmentation device may first determine whether there is a presentation transmitted through the auxiliary stream, and then determine whether the meeting content-related text is included in the meeting reservation, and finally determine whether there is any text in the video to be processed Presentation through the screen.

How the video segmentation device determines the segmentation point of the video to be processed according to the content description information and the voice information will be described below in conjunction with FIG. 6. In addition, how the video segmentation apparatus determines the segmentation point of the to-be-processed video according to the voice information can also refer to FIG. 6.

Fig. 6 is a schematic flowchart of a method for video segmentation according to an embodiment of the present application.

601. The video segmentation device continuously intercepts voice information segments on the voice information with the window length W and the step size S.

602. The video segmentation device extracts keywords for each voice information segment. Specifically, the video segmentation device extracts N keywords from each voice information segment.

If the video segmentation device has extracted the keywords of the content description information, step 603 can be performed after step 602; if the video segmentation device has not extracted the keywords of the content description information, step 604 can be performed after step 602 . The keyword of the content description information extracted by the video segmentation device means that the video segmentation device determines that the video to be processed includes the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on content description information and voice information. The video segmentation device has not extracted the keywords of the content description information, which means that the video segmentation device determines that the to-be-processed video does not include the content description information. In this case, the segmentation point determined by the video segmentation device is determined based on the voice information.

603. The video segmentation device determines the keyword in the i-th voice information segment and the word frequency vector C_i of the keyword of the content description information in the i-th voice information segment.

For the method for determining the keyword in the i-th voice information segment, refer to the embodiment shown in FIG. 3. Specifically, reference may be made to the method for determining the keyword of the first voice information segment in the embodiment shown in FIG. 3, which will not be repeated here. The method for determining the keywords of the content description information can participate in the embodiment shown in FIG. 3, and it will not be repeated here. The video segmentation device determines the keyword in the i-th voice information segment and the implementation manner of the keyword of the content description information in the word frequency vector in the i-th voice information segment. There is no need to go into details about how to determine the keyword vector here.

604. The video segmentation device determines the word frequency vector C_i of the keyword in the i-th voice information segment in the i-th voice information segment. The method of determining the word frequency vector of the keyword in the i-th voice information segment in the i-th voice information segment and the keyword in the i-th voice information segment and the keywords of the content description information are in the i-th voice information segment The method for determining the word frequency vector C_i in is similar, so there is no need to repeat it here.

After performing step 603 or step 604, the video segmentation apparatus may perform step 605 and step 606 in sequence.

605. The video segmentation device determines the distance between C_i and C_(i-1). C_(i-1) is that the video segmentation device determines that the keyword of the i-1th voice information fragment (or the keyword of the i-1th voice information fragment and the keyword of the content description information) is in the i-th The word frequency vector in a piece of speech information. The i-1th voice information segment is a voice information segment before the i-th voice information segment

606: If the distance between C_i and C_(i-1) is greater than the preset distance, it may be determined that the segment point is located before and after the i-th voice information segment. When the video segmentation device determines that the segment point is located before and after the i-th voice information segment, the segment point can be determined according to the pause point. For the specific implementation manner of the video segmentation device determining the segmentation point according to the pause point, reference may be made to the embodiment shown in FIG. 3, and it is not necessary to repeat it here.

If the distance between C_i and C_(i-1) is less than or equal to the preset distance, it can be considered that the segment point is not in the i-th voice information segment and the i-1-th voice information segment. In this case, the word frequency vector of the next speech information segment and the word frequency vector of the i-th speech information segment can be determined continuously.

Fig. 7 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application. As shown in FIG. 7, the video segmentation device 700 includes an acquisition unit 701 and a processing unit 702.

The acquiring unit 701 is configured to acquire text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed.

The processing unit 702 is configured to determine the segment point of the video to be processed according to the text information and the voice information.

The processing unit 702 is further configured to segment the to-be-processed video according to the segmentation point.

The specific functions and beneficial effects of the acquiring unit 701 and the processing unit 702 can be referred to the methods shown in FIG. 3 to FIG. 6, which will not be repeated here.

Fig. 8 is a structural block diagram of a video segmentation device provided according to an embodiment of the present application. The video segmentation device 800 shown in FIG. 8 includes a processor 801, a memory 802, and a transceiver 803.

The processor 801, the memory 802, and the transceiver 803 communicate with each other through internal connection paths, and transfer control and/or data signals.

The method disclosed in the foregoing embodiment of the present application may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 801 or instructions in the form of software. The aforementioned processor 801 may be a general-purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory (RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory, or electrically erasable programmable memory, registers, etc. mature in the field Storage medium. The storage medium is located in the memory 802, and the processor 801 reads the instructions in the memory 802 and completes the steps of the above method in combination with its hardware.

Optionally, in some embodiments, the memory 802 may store instructions for executing the method executed by the video segmentation apparatus in the methods shown in FIGS. 3 to 6. The processor 801 can execute the instructions stored in the memory 802 in combination with other hardware (such as the transceiver 803) to complete the steps of the video segmentation device in the method shown in FIG. 3 to FIG. 6. The specific working process and beneficial effects can be seen in FIGS. 3 to 6 shows the description in the embodiment.

An embodiment of the present application also provides a chip, which includes a transceiver unit and a processing unit. Among them, the transceiver unit may be an input/output circuit or a communication interface; the processing unit is a processor or microprocessor or integrated circuit integrated on the chip. The chip can execute the method of the video segmentation device in the above method embodiment.

The embodiment of the present application also provides a computer-readable storage medium on which an instruction is stored, and when the instruction is executed, the method of the video segmentation device in the foregoing method embodiment is executed.

The embodiment of the present application also provides a computer program product containing instructions that, when executed, execute the method of the video segmentation device in the foregoing method embodiment.

A person of ordinary skill in the art may realize that the units and algorithm steps described in the examples in combination with the embodiments disclosed in this application can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A video segmentation method, characterized in that it comprises:

The video segmentation device acquires text information of a video to be processed and voice information of the video to be processed, where the text information includes at least one of a presentation in the video to be processed and content description information of the video to be processed ；

The video segmentation device determines the segmentation point of the video to be processed according to the text information and the voice information;

The video segmentation device segments the to-be-processed video according to the segmentation point.
The method according to claim 1, wherein, in a case where the text information includes the presentation, the video segmentation device determines the to-be-processed video according to the text information and the voice information The segmentation points include:

Determining a switching point of the presentation, where the content of the presentation is different before and after the switching point;

Determine at least one pause point according to the voice information;

The segment point is determined according to the switching point and the at least one stopping point.
The method according to claim 2, wherein the determining the segment point according to the switching point and the at least one stopping point comprises:

In a case where it is determined that the switching point is the same as one of the at least one stopping point, determining that the switching point is the segment point;

In the case where it is determined that any one of the at least one stopping point is not the same as the switching point, it is determined that the one of the at least one stopping point closest to the switching point is the segment point.
The method according to claim 2 or 3, wherein the determining the switching point of the presentation comprises: determining that the moment when a switching signal for instructing to switch the content of the presentation is acquired is the switching point.
The method according to any one of claims 2 to 4, wherein the text information further comprises the content description information, and the video segmentation device determines according to the text information and the voice information Before the segment point of the to-be-processed video, the method further includes:

It is determined that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration.
The method according to claim 1, wherein, in the case that the text information includes the content description information, the video segmentation device determines the to-be-processed information according to the text information and the voice information Video segmentation points, including:

Determine the segmentation point of the video to be processed according to the voice information, the keywords of the content description information, and the pause points in the voice information.
The method according to claim 6, wherein the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is before and with the first voice information fragment. Voice information segments adjacent to the first voice information segment,

The determining the segment point of the to-be-processed video according to the voice information, the keywords of the content description information, and the pause point in the voice information includes:

According to the first voice information segment, the second voice information segment, the keywords of the content description information, and the pause point in the voice information, the first segment point is determined, where the segment of the video to be processed is The segment point includes the first segment point.
The method according to claim 7, wherein the method is based on the first voice information segment, the second voice information segment, keywords of the content description information, and pause points in the voice information, Determine the first segment point, including:

According to the keywords of the first voice information segment, the keywords of the second voice information segment, the content of the first voice information segment, the content of the second voice information segment, and the key of the content description information Words, determining the similarity between the first voice information segment and the second voice information segment;

Determining that the similarity between the first voice information segment and the second voice information segment is less than a similarity threshold;

Determine the first segment point according to the pause point in the voice information.
The method according to claim 8, wherein the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment, and the The determining the first segment point according to the pause point in the voice information includes:

According to at least one of the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the pause duration, and the words adjacent to the pause point, the first score is determined Segment point.
The method according to any one of claims 6 to 9, characterized in that the text information further comprises the presentation, and the video segmentation device determines the content according to the text information and the voice information. Before the segmentation point of the video to be processed, the method further includes:

It is determined that the presentation duration of the current page of the presentation is greater than the first preset duration; or

It is determined that the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
The method according to any one of claims 1 to 10, wherein the method further comprises: the video segmentation device according to the content of the segmented voice information, keywords and targets of the segmented voice information The keywords of the text determine the segmented abstract, wherein the target text includes at least one of the presentation and the content description information.
The method according to any one of claims 1 to 11, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is the real-time video stream from the real-time video stream. Voice information from the start time or the last segment point to the current time.
A video segmentation device, characterized by comprising:

The acquiring unit is configured to acquire the text information of the to-be-processed video and the voice information of the to-be-processed video, wherein the text information includes at least the presentation in the to-be-processed video and the content description information of the to-be-processed video One;

A processing unit, configured to determine the segmentation point of the video to be processed according to the text information and the voice information;

The processing unit is further configured to segment the to-be-processed video according to the segmentation point.
The video segmentation device according to claim 13, wherein the processing unit is specifically configured to determine based on the text information and the voice information when the text information includes the presentation The switching point of the presentation, the content presented by the presentation before and after the switching point is different;

Determine at least one pause point according to the voice information;

The segment point is determined according to the switching point and the at least one stopping point.
The video segmentation device according to claim 14, wherein the processing unit is specifically configured to

In a case where it is determined that the switching point is the same as one of the at least one stopping point, determining that the switching point is the segment point;

In the case where it is determined that any one of the at least one stopping point is not the same as the switching point, it is determined that the one of the at least one stopping point closest to the switching point is the segment point.
The video segmentation device according to claim 14 or 15, wherein the processing unit is specifically configured to determine the switching point when a switching signal for instructing to switch the content of the presentation is acquired.
The video segmentation device according to any one of claims 14 to 16, wherein the processing unit is further configured to, when the text information further includes the content description information, perform the The text information and the voice information determine that the presentation duration of the current page of the presentation is less than or equal to the first preset duration and greater than the second preset duration before determining the segmentation point of the video to be processed.
The video segmentation device according to claim 13, wherein the processing unit is specifically configured to, in a case where the text information includes the content description information, according to the voice information and the content description information And the pause point in the voice information to determine the segment point of the video to be processed.
The video segmentation device according to claim 18, wherein the voice information includes a first voice information fragment and a second voice information fragment, wherein the second voice information fragment is in the first voice information fragment The previous voice information segment that is adjacent to the first voice information segment,

The processing unit is specifically configured to determine the first segment point according to the first voice information segment, the second voice information segment, the keywords of the content description information, and the pause point in the voice information, The segment point of the video to be processed includes the first segment point.
The video segmentation device according to claim 19, wherein the processing unit is specifically configured to perform according to keywords of the first voice information fragment, keywords of the second voice information fragment, and The content of a voice information segment, the content of the second voice information segment, and the keywords of the content description information, determining the similarity between the first voice information segment and the second voice information segment;

Determining that the similarity between the first voice information segment and the second voice information segment is less than a similarity threshold;

Determine the first segment point according to the pause point in the voice information.
The video segmentation device according to claim 20, wherein the pause point in the voice information includes a pause point in the first voice information segment or a pause point adjacent to the first voice information segment , The processing unit is specifically configured to determine the number of pause points in the first voice information segment, the number of pause points adjacent to the first voice information segment, the pause duration, and the number of words adjacent to the pause point. At least one, the first segment point is determined.
The video segmentation device according to any one of claims 18 to 21, wherein the processing unit is further configured to: when the text information further includes the presentation Information and the voice information, before determining the segment point of the to-be-processed video, it is determined that the presentation duration of the current page of the presentation is greater than the first preset duration; or

It is determined that the presentation duration of the current page of the presentation is less than or equal to the second preset duration.
The video segmentation device according to any one of claims 13 to 22, wherein the processing unit is further configured to determine the content of the segmented voice information, the keywords of the segmented voice information, and the target text. A keyword is used to determine the segmented abstract, wherein the target text includes at least one of the presentation and the content description information.
The video segmentation device according to any one of claims 13 to 23, wherein the video to be processed is a real-time video stream, and the voice information of the video to be processed is that the real-time video stream is derived from the real-time video stream. The voice information from the start time of the video stream or the last segment point to the current time.