CN113569705B

CN113569705B - Scene segmentation point judging method, system, storage medium and electronic equipment

Info

Publication number: CN113569705B
Application number: CN202110835243.7A
Authority: CN
Inventors: 胡郡郡; 唐大闰
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2024-04-02
Anticipated expiration: 2041-07-23
Also published as: CN113569705A

Abstract

The application discloses a scene segmentation point judging method, a system, a storage medium and electronic equipment, wherein the scene segmentation point judging method comprises the following steps: video feature acquisition: dividing the video to obtain a plurality of video equal parts, extracting features of each video equal part through a deep learning pre-training model, and obtaining first video features corresponding to each video equal part; model processing step: inputting a plurality of first video features into a video scene segmentation point judgment model based on an attention mechanism to process so as to obtain classification probability corresponding to each video equal part; and judging the classification probability of each equal-sized video through a threshold value to determine scene division points. The invention can realize a global autonomous attention mechanism by using the bert and realize scene segmentation by using the stronger attention capability of the network.

Description

Scene segmentation point judging method, system, storage medium and electronic equipment

Technical Field

The invention belongs to the field of scene segmentation point judgment, and particularly relates to a scene segmentation point judgment method, a scene segmentation point judgment system, a storage medium and electronic equipment.

Background

A method (Dence Boundary Generator) based on event detection. However, this approach has overlapping time regions for each event, and scene segmentation requires that each segment have no temporal overlap.

Disclosure of Invention

The embodiment of the application provides a scene segmentation point judging method, a system, a storage medium and electronic equipment, which are used for at least solving the problem that the time areas of segmentation events of the existing scene segmentation point judging method are overlapped.

The invention provides a scene segmentation point judging method, which comprises the following steps:

video feature acquisition: dividing a video to obtain a plurality of video aliquots, extracting features of each video aliquot through a deep learning pre-training model, and obtaining first video features corresponding to each video aliquot;

model processing step: inputting a plurality of first video features into a video scene segmentation point judgment model based on an attention mechanism to process so as to obtain classification probability corresponding to each video equal part;

and judging the classification probability of each video equal part through a threshold value to determine scene division points.

The method for judging the scene division points, wherein the step of obtaining the video features comprises the following steps:

video equal part obtaining step: dividing said video into a plurality of said video aliquots over time;

and obtaining video feature dimensions, namely performing feature extraction on each video equal part by using the deep learning pre-training model to obtain the first video feature corresponding to each video equal part.

The scene segmentation point judging method comprises the following steps:

sample video aliquot obtaining step: dividing the sample video into a plurality of sample video equal parts according to time;

extracting features from each sample video equal part by using the deep learning pre-training model to obtain first sample features corresponding to each sample video equal part;

model construction: constructing the segmentation point judgment model and training the segmentation point judgment model through the first sample characteristics;

the classification probability obtaining step: and obtaining the classification probability corresponding to each video equal part through the trained segmentation point judgment model according to the first video features.

The scene segmentation point judging method comprises the following steps:

the construction step of the division points: constructing a second sample feature of each division scene division point according to the first sample feature;

a second sample feature processing step: constructing and processing the second sample characteristic through a bert network layer to obtain a third sample characteristic;

and a prediction step: constructing and predicting the third sample characteristic through a Predictor network to obtain a sample scene segmentation point;

a constraint step: and constraining the sample scene segmentation points through a classification loss function and a consistency regularization loss function.

The invention also provides a scene segmentation point judging system, which comprises:

the video feature acquisition module divides the video to obtain a plurality of video equal parts, extracts features of each video equal part through a deep learning pre-training model, and obtains first video features corresponding to each video equal part;

the model processing module inputs a plurality of first video features into a video scene segmentation point judgment model based on an attention mechanism to process so as to obtain classification probability corresponding to each video equal part;

and the judging module judges the classification probability of each equal-part video through a threshold value to determine scene division points.

The scene division point judging system, wherein the video feature obtaining module comprises:

a video equal share obtaining unit that divides the video into a plurality of the video equal shares according to time;

and the video feature obtaining dimension unit is used for carrying out feature extraction on each video equal part by using the deep learning pre-training model to obtain the first video feature corresponding to each video equal part.

The scene segmentation point judging system, wherein the model processing module comprises:

a sample video equal part obtaining unit which divides a sample video into a plurality of sample video equal parts according to time;

the method comprises the steps that a sample video feature dimension unit is obtained, the sample video feature dimension unit extracts features from each sample video equal part by using a deep learning pre-training model, and first sample features corresponding to each sample video equal part are obtained;

the model construction unit is used for constructing the segmentation point judgment model and training the segmentation point judgment model through the first sample characteristics;

and the classification probability obtaining unit obtains the classification probability of each video equal part through the trained segmentation point judgment model according to the first video characteristic.

The scene division point judgment system described above, wherein the model construction unit includes:

a segmentation point construction component that constructs a second sample feature for each segmentation scene segmentation point from the first sample features;

a second sample feature processing assembly, constructed and configured to process the second sample feature through a bert network layer to obtain a third sample feature;

the prediction assembly is constructed and predicts the third sample characteristics through a Predictor network to obtain sample scene division points;

and the constraint component is used for constraining the sample scene segmentation points through a classification loss function and a consistency regularization loss function.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the scene segmentation point determination method according to any one of the preceding claims when executing the computer program.

A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a scene cut point determination method as described in any of the above.

The invention has the beneficial effects that:

the invention belongs to the field of computer vision in a deep learning technology. The invention can realize the global autonomous attention mechanism by using the bert, can realize scene segmentation by using the stronger attention capability of the network, and has simpler and clearer network arrangement structure and better effect compared with other segmentation networks. .

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application.

In the drawings:

FIG. 1 is a flow chart of a scene cut point determination method of the present invention;

FIG. 2 is a flow chart of substep S1 of the present invention;

FIG. 3 is a flow chart of substep S2 of the present invention;

FIG. 4 is a flow chart of substep S23 of the present invention;

FIG. 5 is a video scene segmentation map of the present invention;

FIG. 6 is a schematic diagram of a model of the present invention;

FIG. 7 is a schematic diagram of a scene cut point determination system according to the present invention;

fig. 8 is a frame diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The present invention will be described in detail below with reference to the embodiments shown in the drawings, but it should be understood that the embodiments are not limited to the present invention, and functional, method, or structural equivalents and alternatives according to the embodiments are within the scope of protection of the present invention by those skilled in the art.

Before explaining the various embodiments of the invention in detail, the core inventive concepts of the invention are summarized and described in detail by the following examples.

Embodiment one:

referring to fig. 1, fig. 1 is a flowchart of a scene segmentation point determination method. As shown in fig. 1, the scene segmentation point determination method of the present invention includes:

video feature acquisition step S1: dividing a video to obtain a plurality of video aliquots, extracting features of each video aliquot through a deep learning pre-training model, and obtaining first video features corresponding to each video aliquot;

model processing step S2: inputting a plurality of first video features into a video scene segmentation point judgment model based on an attention mechanism to process so as to obtain classification probability corresponding to each video equal part;

and a judging step S3, namely judging the classification probability of each video equal part through a threshold value to determine scene division points.

Referring to fig. 2, fig. 2 is a flowchart of a video feature acquisition step S1. As shown in fig. 2, the video feature obtaining step S1 includes:

video equal part obtaining step S11: dividing said video into a plurality of said video aliquots over time;

and step S12, performing feature extraction on each video equal part by using the deep learning pre-training model to obtain the first video feature corresponding to each video equal part.

Referring to fig. 3, fig. 3 is a flowchart of a model processing step S2. As shown in fig. 3, the model processing step S2 includes:

sample video aliquot obtaining step S21: dividing the sample video into a plurality of sample video equal parts according to time;

step S22 of obtaining sample video feature dimension, namely extracting features from each sample video equal part by using a deep learning pre-training model to obtain first sample features corresponding to each sample video equal part;

model construction step S23: constructing the segmentation point judgment model and training the segmentation point judgment model through the first sample characteristics;

classification probability obtaining step S24: and obtaining the classification probability corresponding to each video equal part through the trained segmentation point judgment model according to the first video features.

Referring to fig. 4, fig. 4 is a flowchart of the model construction step S23. As shown in fig. 4, the model construction step S23 includes:

the division point construction step S231: constructing a second sample feature of each division scene division point according to the first sample feature;

second sample feature processing step S232: constructing and processing the second sample characteristic through a bert network layer to obtain a third sample characteristic;

prediction step S233: constructing and predicting the third sample characteristic through a Predictor network to obtain a sample scene segmentation point;

constraint step S234: and constraining the sample scene segmentation points through a classification loss function and a consistency regularization loss function.

The video scene segmentation map is shown in fig. 5, and the overall model scheme is shown in fig. 6.

Specifically, the training phase includes:

step 1, dividing a video into L equal parts according to time, wherein each equal part of video is called a clip.

And 2, extracting features from each clip by using a deep learning pre-training model, wherein each clip obtains a 1*D feature expression (D is a feature dimension), and then L clips obtain L-D features.

And 3, constructing the characteristic of each point, wherein each point takes the characteristic [ F ] of the clip where the point is located, and the dimension is 1*D.

And 4, constructing a bert network layer, wherein the characteristics of each point are taken as the input of the bert, the input of the bert has L characteristic token (EMB) of points in total, and a classification token (CLS) is additionally added. The dimension of each EMB token is 1 x D, the dimension of each CLS token is 1*D, the input CLS token of the first layer of the bert is randomly initialized, the input CLS token of the second layer of the bert is the first layer of the output CLS token, the bert can be set into multiple layers, and the last layer only outputs one output CLS token. (model is shown in FIG. 2).

Step 5: the Predictor network is built, here designed as two MLP layers, with an output dimension of 1*L, and each value represents whether the point at the corresponding position is a segmentation point.

Step 6, designing a loss function: the classification loss function is

L _cls ＝g _mask log(p)+(1-g _mask )log(1-p)

Wherein g _mask The method comprises the following steps: when the distance between a certain point and the groudtluth is less than or equal to 1, the positive example is considered, and conversely, the negative example is considered.

And 7, back propagation training the model.

Reasoning:

and step 1, obtaining the characteristic L.times.D of each video according to the same training stage.

And 2, forward propagation is carried out through a bert network and a predictor network to obtain the classification probability of each point, and a certain threshold value is clamped to judge whether the points are segmentation points.

Embodiment two:

referring to fig. 7, fig. 7 is a schematic diagram of a scene segmentation point determination system according to the present invention. The scene cut point judgment system according to the present invention as shown in fig. 7, includes:

the video feature acquisition module is used for dividing videos to obtain a plurality of video equal parts, extracting features from each video equal part through a deep learning pre-training model, and obtaining first video features corresponding to each video equal part;

Wherein, the video feature acquisition module includes:

and the video feature obtaining dimension unit is used for carrying out feature extraction on each video equal part by using a deep learning pre-training model to obtain the first video feature corresponding to each video equal part.

Wherein the model processing module comprises:

Wherein the model construction unit includes:

Embodiment III:

referring to fig. 8, a specific implementation of an electronic device is disclosed in this embodiment. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

In particular, the processor 81 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In a particular embodiment, the Memory 82 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 82 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 81.

The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any one of the scene cut point determination methods of the above embodiments.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 8, the processor 81, the memory 82, and the communication interface 83 are connected to each other via the bus 80 and perform communication with each other.

The communication interface 83 is used to implement communications between various modules, devices, units, and/or units in embodiments of the present application. Communication port 83 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 80 includes hardware, software, or both that couple components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 80 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 80 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

The electronic device may determine based on the scene cut point, thereby implementing the methods described in connection with fig. 1-4.

In addition, in combination with the scene division point determination method in the above embodiment, the embodiment of the application may be implemented by providing a computer readable storage medium. The computer readable storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the scene cut point determination methods of the above embodiments.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

In summary, the method and the device have the beneficial effects that the global autonomous attention mechanism can be realized by using the bert, scene segmentation can be realized by using the stronger attention capability of the network, and compared with other segmentation networks, the network arrangement structure is simpler and clearer and has better effect.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. The scope of the invention should, therefore, be determined with reference to the appended claims.

Claims

1. The scene segmentation point judging method is characterized by comprising the following steps of:

judging the classification probability of each video equal part through a threshold value to determine scene division points;

wherein the model processing step includes:

the classification probability obtaining step: obtaining classification probability corresponding to each video equal part through the trained segmentation point judgment model according to the first video features;

wherein the model construction step includes:

2. The scene cut point determination method according to claim 1, wherein the video feature acquisition step includes:

3. A scene cut point judgment system, characterized by comprising:

the judging module judges the classification probability of each video equal part through a threshold value to determine scene division points;

wherein the model processing module comprises:

a sample video feature dimension unit is obtained, wherein the sample video feature dimension unit extracts features from each sample video equal part by using a deep learning pre-training model, and obtains first sample features corresponding to each sample video equal part;

the classification probability obtaining unit obtains the classification probability of each video equal part through the trained segmentation point judgment model according to the first video features;

wherein the model construction unit includes:

4. The scene cut point determination system according to claim 3, wherein the video feature acquisition module comprises:

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the scene cut point determination method according to any of claims 1 to 2 when executing the computer program.

6. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the scene cut point determination method according to any one of claims 1 to 2.