CN114501112B

CN114501112B - Method, apparatus, device, medium, and article for generating video notes

Info

Publication number: CN114501112B
Application number: CN202210076832.6A
Authority: CN
Inventors: 高炳楠
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2024-03-22
Anticipated expiration: 2042-01-24
Also published as: CN114501112A

Abstract

The present disclosure provides methods, apparatus, devices, media, and products for generating video notes, relating to the field of computer technology, and in particular to the field of speech processing technology. The specific implementation scheme is as follows: acquiring a target video; determining a video note category corresponding to the target video; determining note audio based on the video note category; and generating video notes corresponding to the target video based on the note audio. The implementation mode can improve the intelligent degree of generating video notes.

Description

Method, apparatus, device, medium, and article for generating video notes

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the field of speech processing technology.

Background

At present, along with the development of information technology, people often extract important information in a note making mode when acquiring corresponding information, so that subsequent review and reference are facilitated.

In practice, it is found that more and more information is available in video categories, and for such information, people often take notes by manually recording or manually capturing a video. However, this way of making notes has a problem of low degree of intellectualization.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, medium, and article for generating video notes.

According to an aspect of the present disclosure, there is provided a method for generating video notes, comprising: acquiring a target video; determining a video note category corresponding to the target video; determining note audio based on the video note category; and generating video notes corresponding to the target video based on the note audio.

According to another aspect of the present disclosure, there is provided an apparatus for generating video notes, comprising: a video acquisition unit configured to acquire a target video; the category determining unit is configured to determine a video note category corresponding to the target video; an audio determining unit configured to determine note audio based on the video note category; and a note generation unit configured to generate a video note corresponding to the target video based on the note audio.

According to another aspect of the present disclosure, there is provided an electronic device including: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for generating video notes as any of the above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for generating video notes of any one of the above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method for generating video notes as in any of the above.

According to the technology disclosed by the invention, a method for generating video notes is provided, and the intelligent degree of generating the video notes can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a method for generating video notes according to the present disclosure;

FIG. 3 is a schematic illustration of one application scenario of a method for generating video notes according to the present disclosure;

FIG. 4 is a flow chart of another embodiment of a method for generating video notes according to the disclosure;

FIG. 5 is a schematic structural diagram of one embodiment of an apparatus for generating video notes according to the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a method for generating video notes in accordance with an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 can play the target video, detect an operation instruction triggered by man-machine interaction operation with a user in the playing process of the target video, and send the operation instruction to the server 105 through the network 104, so that the server 105 can generate a video note corresponding to the target video based on the operation instruction.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, cell phones, computers, tablets, etc. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server that provides various services, for example, the server 105 may receive an operation instruction sent by the terminal devices 101, 102, 103 through the network 104, parse the operation instruction, determine a video note category corresponding to the operation instruction, determine note audio based on the video note category, generate a video note corresponding to the video based on the note audio, and return the video note to the terminal devices 101, 102, 103 through the network 104.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the method for generating a video note provided in the embodiment of the present disclosure may be performed by the terminal devices 101, 102, 103, or may be performed by the server 105, and the apparatus for generating a video note may be provided in the terminal devices 101, 102, 103, or may be provided in the server 105, which is not limited in the embodiment of the present disclosure.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating video notes according to the present disclosure is shown. The method for generating video notes of the embodiment comprises the following steps:

in step 201, a target video is acquired.

In this embodiment, the executing body (such as the terminal devices 101, 102, 103 or the server 105 in fig. 1) may acquire the target video from other electronic devices that are locally stored or previously connected, and may control the target video to be output in the application software for playing the video, so that the user generates a video note for the target video based on man-machine interaction with the target video during the playing of the target video. The target video may be a knowledge video that the user needs to learn, a work information video that the user needs to acquire during work, a conference video recorded during a conference, etc., and the specific content of the target video is not limited in this embodiment.

Step 202, determining a video note category corresponding to the target video.

In the present embodiment, the video note category may be used to describe the type of note generated for the video, and may include, but is not limited to, a complete video note, a partial video note, a real-time video note, a key content video note, etc., which is not limited in this embodiment.

The full video notes may be notes generated for all contents of the target video, the partial video notes may be notes generated for part of contents of the target video, the real-time video notes may be notes generated for a current playing time of the target video, and the key content notes may be notes generated for key contents in the target video. Optionally, the key content in the key content note may be preset in a customized manner, so as to meet the requirement of the user for acquiring the specified key content.

In step 203, based on the video note category, note audio is determined.

In this embodiment, the execution body may determine, based on the video note category, note audio corresponding to the target video, and use the note audio as a basis for generating the video note. Note audio may be audio obtained by converting audio of a specified portion in the target video, audio input by a user, or audio obtained by performing a certain integration process on audio of a specified portion in the target video, which is not limited in this embodiment.

In some optional implementations of the present embodiment, determining the note audio based on the video note category may include: responding to the fact that the video note type is determined to be the key content note, and acquiring key content keywords; determining at least one video segment matching the key content keyword from the target video; taking the audio corresponding to the at least one video clip as initial note audio; and integrating the initial note audio based on the knowledge graph corresponding to the key content keywords to obtain the final note audio. By implementing this alternative implementation, the executing entity can determine at least one video clip from the video based on the key content keywords for the case where a video note needs to be generated for the key content of the video, and take the audio corresponding to the at least one video clip as the initial note audio. And the initial note audio can be subjected to integrated processing such as extraction summarization or supplementary explanation based on the knowledge graph, for example, associated content, which has an association relationship with the knowledge content related to the initial note audio, in the knowledge graph is used as supplementary explanation content and is converted into corresponding audio output, so that the key content note can be quickly generated.

Step 204, based on the note audio, generating a video note corresponding to the target video.

In this embodiment, the execution subject may directly take the note audio as the video note corresponding to the target video. Or, the execution body can also perform corresponding audio recognition on the audio based on the note audio to obtain a text corresponding to the note audio, and take the text as a video note corresponding to the target video. Alternatively, the execution subject may generate the video note by associating the note audio with the text at the same time.

In some alternative implementations of the present embodiment, the following steps may also be performed: the executing body stores the target video and the video notes in association.

In other optional implementations of this embodiment, storing the target video and the video note association may further include: and in response to determining that the video note type is the real-time video note, determining a video time point corresponding to the audio or text input by the user in real time, and storing the video time point in the target video in association with the real-time video note.

With continued reference to fig. 3, a schematic diagram of one application scenario of the method for generating video notes according to the present disclosure is shown. In the application scenario of fig. 3, the executing body may determine, during the playing process of the target video, a video note category indicated by an operation instruction triggered by the man-machine interaction operation based on the man-machine interaction operation with the user. Then, based on the video note category being the complete video note category or the partial video note category, selecting all videos or partial videos 301 of the target video, and determining corresponding video and audio 302 as the generation basis of the video note 306. And in the playing process of the target video, the user can pause the playing of the target video and select to record the real-time note of which the video note type is the real-time video note type. The real-time notes here may be speech 303 entered by the user in real-time, or text 305 entered by the user in real-time.

In addition, in the case that the video note category includes the complete video note and the real-time video note, and the partial video note and the real-time video note, the execution subject may perform fusion processing on the voice 303 and the video audio 302 input by the user in real time according to the time sequence of the target video, so as to obtain the fused audio. And performing audio recognition 304 on the fused audio to obtain a video note 306. Alternatively, if text 305 entered by the user in real-time is present, the executing body may generate a video note 306 based on the audio recognition result corresponding to the audio recognition 304 in combination with the text 305 entered by the user in real-time.

According to the method for generating the video notes, which is provided by the embodiment of the disclosure, for the target video, the corresponding video note type can be determined, the note audio is determined based on the video note type, the video notes corresponding to the target video are automatically generated based on the note audio, and the intelligent degree of generating the video notes can be improved.

With continued reference to fig. 4, a flow 400 of another embodiment of a method for generating video notes according to the present disclosure is shown. As shown in fig. 4, the method for generating a video note of the present embodiment may include the steps of:

step 401, obtaining a target video.

Step 402, determining a video note category corresponding to the target video, wherein the video note category comprises at least one of the following: full video notes, partial video notes, real-time video notes.

In this embodiment, the complete video note may be a note generated for all video content of the target video, the partial video note may be a note generated for partial video content of the target video, and the real-time video note may be a note generated for the current playing time of the target video. The executing body can determine the video note category based on an operation instruction triggered by man-machine interaction operation between the user and the target video.

Optionally, under the condition that the video note type is determined to be the real-time video note based on an operation instruction triggered by man-machine interaction operation between the user and the target video, the execution subject may pause playing of the target video, and store a video playing time point of the current target video in association with voice or text input by the user in real time.

For the detailed descriptions of steps 401 to 402, please refer to the detailed descriptions of steps 201 to 202 together, and the detailed descriptions are omitted here.

In step 403, in response to determining that the video note category is a complete video note, audio corresponding to the target video is determined to be note audio.

In this embodiment, if the video note category is a complete video note, the execution subject may extract audio corresponding to the complete video of the target video, and take the audio as the note audio.

In response to determining that the video note category is a partial video note, a target video clip is determined from the target video, step 404.

In this embodiment, if the video note category is a partial video note, the execution subject may determine a start time and an end time of the partial video from the target video, and extract a video clip between the start time and the end time as the target video clip.

Wherein the number of target video clips is at least one, and for each target video clip, a start time and an end time of the target video clip can be determined.

In some optional implementations of this embodiment, determining the target video clip from the target video may include: detecting an operation instruction triggered by touch operation of a user on the target video, and determining the moment of clicking the target video by the user as the starting moment of a target video segment in response to determining that the target video is clicked by the user; and in response to determining that the user clicks on the target video again, determining the time at which the user clicks on the target video again as the end time of one of the target video segments.

In other optional implementations of this embodiment, determining the target video clip from the target video may include: detecting an operation instruction triggered by the touch operation of the user on the target video, determining the starting time of the long press of the user as the starting time of one target video segment and the releasing time of the long press of the user as the ending time of one target video segment in response to determining the long press of the user on the target video.

And step 405, determining the audio corresponding to the target video clip as the note audio.

In this embodiment, the execution subject may use the audio corresponding to the at least one target video clip as the note audio.

In step 406, in response to determining that the video note category is a real-time video note, real-time speech is acquired.

In this embodiment, if the video note type is a real-time video note, the executing body may acquire real-time voice input by the user, and record a video playing time point in the target video at the time when the real-time voice is acquired.

Alternatively, the executing body may store the real-time voice in association with a video playing time point in the corresponding target video.

In step 407, the real-time speech is determined to be note audio.

In this embodiment, the execution subject may determine the real-time voice input by the user as the note audio.

In step 408, in response to determining that the video note category is a real-time video note, real-time note text is obtained.

In this embodiment, if the video note type is real-time video note, the executing body may further acquire real-time note text input by the user, and record a video playing time point in the target video at a time when the real-time note text is acquired.

Alternatively, the executing body may store the real-time note text in association with the video playing time point in the corresponding target video.

Step 409, converting the real-time note text into note audio.

In this embodiment, the execution body may convert the real-time note text into the note audio, as the generation basis of the video note.

Alternatively, the executing body may not perform the step of converting the real-time note text into the note audio, and may directly generate the video note based on the real-time note text, which is not limited in the embodiment.

Step 410, obtaining the category of the note language corresponding to the target video.

In this embodiment, the execution body may acquire the note language parameters configured for the target video, and parse the note language parameters to obtain the note language category. The note language type may be a language type of video notes that the user needs to obtain, for example, the note language type may be chinese, english, etc., which is not limited in this embodiment.

In step 411, the note audio is converted into video notes of the note language category, the video notes including video notes in audio format and video notes in text format.

In this embodiment, the execution body may convert the above-mentioned note audio into a corresponding note text, and translate the note text to obtain a translated note text matching the note language category. After that, the execution body may take the translation note text as a video note, or may convert the translation note text into translation note audio, and take the translation note audio as a video note. The translation note text is a text format video note, and the translation note audio is an audio format video note.

Alternatively, for the real-time note text, the execution body may not convert the real-time note text into the note audio, but directly translate the real-time note text to obtain the translated note text. And converting the translation note text into translation note audio.

In some alternative implementations of the present embodiment, the executing body may store video notes in audio format, video notes in text format, and target video associations.

In other optional implementations of this embodiment, if the video note category corresponding to the target video includes both the complete video note and the real-time video note, the note audio at this time may include the note audio corresponding to the complete video note and the note audio corresponding to the real-time video note. The execution body may determine a video playing time point associated with the note audio corresponding to the real-time video note, and insert the note audio corresponding to the real-time video note into the note audio corresponding to the complete video note based on the video playing time point.

In other optional implementations of this embodiment, if the video note category corresponding to the target video includes both a partial video note and a real-time video note, the note audio at this time may include note audio corresponding to the partial video note and note audio corresponding to the real-time video note. The execution body can determine a video playing time point associated with the note audio corresponding to the real-time video note, and based on the video playing time point and the video playing time point corresponding to the partial video note, fuse the note audio corresponding to the real-time video note with the note audio corresponding to the partial video note to obtain fused note audio. And converting the fused note audio into video notes of the note language type.

The method for generating video notes provided in the above embodiments of the present disclosure may also generate video notes directly based on the audio of the target video for the complete video notes; for a partial video note, selecting at least one video clip of a target video, and generating a video note based on the audio of the at least one video clip; for real-time video notes, video notes are generated based on voice or text input by a user, so that the generation diversity of the video notes is improved. And when the video note is generated, the video note of the required note language category can be generated based on language conversion, so that the practicability of the video note is improved.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of an apparatus for generating video notes, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to an electronic device such as a terminal device, a server, or the like.

As shown in fig. 5, the apparatus 500 for generating a video note of the present embodiment includes: a video acquisition unit 501, a category determination unit 502, an audio determination unit 503, and a note generation unit 504.

The video acquisition unit 501 is configured to acquire a target video.

The category determination unit 502 is configured to determine a video note category corresponding to the target video.

The audio determination unit 503 is configured to determine note audio based on the video note category.

The note generation unit 504 is configured to generate a video note corresponding to the target video based on the note audio.

In some optional implementations of the present embodiment, the video note category includes at least one of: full video notes, partial video notes, real-time video notes.

In some optional implementations of the present embodiment, the audio determination unit 503 is further configured to: and in response to determining that the video note category is a complete video note, determining the audio corresponding to the target video as note audio.

In some optional implementations of the present embodiment, the audio determination unit 503 is further configured to: in response to determining that the video note category is a partial video note, determining a target video clip from the target video; and determining the audio corresponding to the target video clip as note audio.

In some optional implementations of the present embodiment, the audio determination unit 503 is further configured to: acquiring real-time voice in response to determining that the video note category is a real-time video note; real-time speech is determined to be note audio.

In some optional implementations of the present embodiment, the audio determination unit 503 is further configured to: acquiring real-time note text in response to determining that the video note category is real-time video notes; the real-time note text is converted to note audio.

In some optional implementations of the present embodiment, the note generation unit 504 is further configured to: acquiring a note language category corresponding to a target video; the note audio is converted to video notes of the note language class.

In some alternative implementations of the present embodiment, the video notes include video notes in audio format and video notes in text format.

It should be understood that the units 501 to 504 recited in the apparatus 500 for generating a video note correspond to the respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the method for generating video notes are equally applicable to the apparatus 500 and the units contained therein, and are not repeated here.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as a method for generating video notes. For example, in some embodiments, the method for generating video notes may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by computing unit 601, one or more steps of the method for generating video notes described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method for generating video notes by any other suitable means (e.g., by means of firmware).

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for generating video notes, comprising:

acquiring a target video;

determining a video note category indicated by an operation instruction of a user as a video note category corresponding to the target video;

determining note audio based on the video note category;

generating a video note corresponding to the target video based on the note audio;

wherein the determining note audio based on the video note category comprises: responding to the fact that the video note type is determined to be the key content note, and acquiring key content keywords; determining at least one video segment matched with the key content keyword from the target video; taking the audio corresponding to the at least one video clip as initial note audio; based on the knowledge graph corresponding to the key content keywords, carrying out integration processing on the initial note audio to obtain final note audio;

in response to determining that the video note type is a partial video note, determining the moment when a user clicks on a target video as the starting moment of a target video segment, determining the moment when the user clicks on the target video again as the ending moment of the target video segment, and determining the audio corresponding to the target video segment as note audio;

in response to determining that the video note category is a real-time video note, acquiring real-time voice input by a user, and determining the real-time voice as note audio;

and in response to determining that the video note category comprises a partial video note and a real-time video note, according to video playing time points corresponding to note audios corresponding to the partial video note and the real-time video note respectively in the target video, fusing the note audios corresponding to the partial video note and the real-time video note respectively to obtain fused note audios.

2. The method of claim 1, wherein the video note category comprises at least one of: full video notes, partial video notes, real-time video notes.

3. The method of claim 2, wherein the determining note audio based on the video note category comprises:

and in response to determining that the video note category is the complete video note, determining the audio corresponding to the target video as the note audio.

4. The method of claim 2, wherein the determining note audio based on the video note category comprises:

acquiring real-time note text in response to determining that the video note category is the real-time video note;

and converting the real-time note text into the note audio.

5. The method of claim 1, wherein the generating the video note corresponding to the target video based on the note audio comprises:

acquiring a note language category corresponding to the target video;

converting the note audio to the video notes of the note language category.

6. The method of claim 1, wherein the video notes include video notes in audio format and video notes in text format.

7. An apparatus for generating video notes, comprising:

a video acquisition unit configured to acquire a target video;

a category determining unit configured to determine a video note category indicated by an operation instruction of a user as a video note category corresponding to the target video;

an audio determining unit configured to determine note audio based on the video note category;

a note generation unit configured to generate a video note corresponding to the target video based on the note audio;

wherein the audio determination unit is further configured to: responding to the fact that the video note type is determined to be the key content note, and acquiring key content keywords; determining at least one video segment matched with the key content keyword from the target video; taking the audio corresponding to the at least one video clip as initial note audio; based on the knowledge graph corresponding to the key content keywords, carrying out integration processing on the initial note audio to obtain final note audio;

8. The apparatus of claim 7, wherein the video note category comprises at least one of: full video notes, partial video notes, real-time video notes.

9. The apparatus of claim 8, wherein the audio determination unit is further configured to:

10. The apparatus of claim 8, wherein the audio determination unit is further configured to:

and converting the real-time note text into the note audio.

11. The apparatus of claim 7, wherein the note generation unit is further configured to:

acquiring a note language category corresponding to the target video;

converting the note audio to the video notes of the note language category.

12. The apparatus of claim 7, wherein the video notes comprise video notes in audio format and video notes in text format.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.