CN113569703B

CN113569703B - Real division point judging method, system, storage medium and electronic equipment

Info

Publication number: CN113569703B
Application number: CN202110835226.3A
Authority: CN
Inventors: 胡郡郡; 唐大闰
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2024-04-16
Anticipated expiration: 2041-07-23
Also published as: CN113569703A

Abstract

The application discloses a real partition point judging method, a system, a storage medium and electronic equipment, wherein the real partition point judging method comprises the following steps: video feature dimension acquisition: dividing a video into a plurality of video equal parts according to time, extracting features from the video equal parts by using a deep learning pre-training model, and obtaining video features; model processing step: inputting the video features into a real segmentation point judgment model to process so as to obtain the classification probability of each candidate segmentation point; and judging the candidate segmentation points according to the classification probability to determine the real segmentation points of the scene. The invention uses global consistency loss, reduces the similarity of the same scene, improves the similarity of different scenes, can obtain very good expression, gradually converges the model, and does not generate the rise of l oss.

Description

Real division point judging method, system, storage medium and electronic equipment

Technical Field

The invention belongs to the field of true partition point judgment, and particularly relates to a true partition point judgment method, a true partition point judgment system, a storage medium and electronic equipment.

Background

A method (Dence Boundary Generator) based on event detection. However, this approach has overlapping time regions for each event, and scene segmentation requires that each segment have no temporal overlap.

Disclosure of Invention

The embodiment of the application provides a real division point judging method, a system, a storage medium and electronic equipment, which are used for at least solving the problem that events have overlapping time areas in the existing real division point judging method.

The invention provides a method for judging real division points, which comprises the following steps:

video feature dimension acquisition: dividing a video into a plurality of video equal parts according to time, extracting features from the video equal parts by using a deep learning pre-training model, and obtaining video features;

model processing step: inputting the video features into a real segmentation point judgment model to process so as to obtain the classification probability of each candidate segmentation point;

and judging the candidate segmentation points according to the classification probability to determine the real segmentation points of the scene.

The method for judging the real division point, wherein the step of obtaining the video features comprises the following steps:

video equal part obtaining step: dividing the video into a plurality of video equal parts according to time;

and obtaining video features, namely extracting features from each video equal part by using a deep learning pre-training model to obtain first features corresponding to each video equal part.

The real partition point judging method, wherein the model processing step comprises the following steps:

sample video aliquot obtaining step: dividing the sample video into a plurality of sample video equal parts according to time;

extracting features from each sample video equal part by using a deep learning pre-training model to obtain a plurality of sample video features of the video screen;

constructing candidate segmentation point features: taking sample video features of equal parts of video where the candidate dividing points are located and sample video features between the last candidate dividing point and sample video features between the next candidate dividing point for each candidate dividing point, sequentially building an Encoder network and a Predictor network, designing a loss function, and then building a real dividing point judging model;

the classification probability obtaining step: and obtaining the classification probability of each candidate segmentation point through a real segmentation point judgment model according to the video features.

The real division point judging method comprises the following steps: and judging the classification probability of each candidate segmentation point by setting a threshold value to determine whether the candidate segmentation point is the real segmentation point of the scene.

The invention also provides a true partition point judging system, which comprises:

the video feature dimension acquisition module divides a video into a plurality of video equal parts according to time, and extracts features from the video equal parts by using a deep learning pre-training model to obtain video features;

the model processing module inputs the video features into a real segmentation point judgment model to process so as to obtain the classification probability of each candidate segmentation point;

and the judging module judges the candidate segmentation points according to the classification probability to determine the real segmentation points of the scene.

The real partition point judging system, wherein the video feature obtaining module comprises:

the video equal part obtaining unit divides the video into a plurality of video equal parts according to time;

and obtaining video feature units, wherein the video feature obtaining units use a deep learning pre-training model to extract features from each video equal part to obtain first features corresponding to each video equal part.

The real partition point judging system, wherein the model processing module comprises:

the sample video equal part obtaining unit divides the sample video into a plurality of sample video equal parts according to time;

the method comprises the steps that a sample video feature unit is obtained, the sample video feature unit extracts features from each sample video equal part by using a deep learning pre-training model, and a plurality of sample video features of the video screen are obtained;

constructing candidate segmentation point feature units, wherein the candidate segmentation point feature units take sample video features of video equal parts where the candidate segmentation points are located, sample video features between the last candidate segmentation point and sample video features between the next candidate segmentation point for each candidate segmentation point, sequentially building an Encoder network and a Predictor network, designing a loss function, and then constructing a real segmentation point judgment model;

and the classification probability obtaining unit obtains the classification probability of each candidate segmentation point through a real segmentation point judgment model according to the video features.

The real partition point judging system, wherein the judging module comprises: and judging the classification probability of each candidate segmentation point by setting a threshold value to determine whether the candidate segmentation point is the real segmentation point of the scene.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the real partition point determination method of any of the above when executing the computer program.

A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a real division point judgment method as described in any one of the above.

The invention has the beneficial effects that:

the invention belongs to the field of computer vision in a deep learning technology. The invention uses global consistency loss, reduces the similarity of the same scene, improves the similarity of different scenes, can obtain very good expression, gradually converges the model, and does not generate loss rise; the invention also uses a transducer to realize automatic attention and learn the relation inside the video sequence.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application.

In the drawings:

FIG. 1 is a flow chart of a true segmentation point determination method of the present invention;

FIG. 2 is a flow chart of substep S1 of the present invention;

FIG. 3 is a flow chart of substep S2 of the present invention;

FIG. 4 is a video scene segmentation map of the present invention;

FIG. 5 is a model diagram of the present invention;

FIG. 6 is a schematic diagram of a real division point judgment system according to the present invention;

fig. 7 is a frame diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The present invention will be described in detail below with reference to the embodiments shown in the drawings, but it should be understood that the embodiments are not limited to the present invention, and functional, method, or structural equivalents and alternatives according to the embodiments are within the scope of protection of the present invention by those skilled in the art.

Before explaining the various embodiments of the invention in detail, the core inventive concepts of the invention are summarized and described in detail by the following examples.

Embodiment one:

referring to fig. 1, fig. 1 is a flowchart of a real partition point determination method. As shown in fig. 1, the true segmentation point judging method of the present invention includes:

video feature dimension acquisition step S1: dividing a video into a plurality of video equal parts according to time, extracting features from the video equal parts by using a deep learning pre-training model, and obtaining video features;

model processing step S2: inputting the video features into a real segmentation point judgment model to process so as to obtain the classification probability of each candidate segmentation point;

and a judging step S3, namely judging the candidate segmentation points according to the classification probability to determine the real segmentation points of the scene.

Referring to fig. 2, fig. 2 is a flowchart of a video feature dimension obtaining step S1. As shown in fig. 2, the video feature dimension obtaining step S1 includes:

video equal part obtaining step S11: dividing the video into a plurality of video equal parts according to time;

and S12, extracting features from each video equal part by using a deep learning pre-training model to obtain first features corresponding to each video equal part.

Referring to fig. 3, fig. 3 is a flowchart of a model processing step S2. As shown in fig. 3, the model processing step S2 includes:

sample video aliquot obtaining step S21: dividing the sample video into a plurality of sample video equal parts according to time;

a step S22 of extracting characteristics of the sample video by using a deep learning pre-training model to extract characteristics of each sample video equal part, and obtaining a plurality of sample video characteristics of the video screen;

a candidate segmentation point feature construction step S23: taking sample video features of equal parts of video where the candidate dividing points are located and sample video features between the last candidate dividing point and sample video features between the next candidate dividing point for each candidate dividing point, sequentially building an Encoder network and a Predictor network, designing a loss function, and then building a real dividing point judging model;

classification probability obtaining step S24: and obtaining the classification probability of each candidate segmentation point through a real segmentation point judgment model according to the video features.

Wherein, the judging step comprises the following steps: and judging the classification probability of each candidate segmentation point by setting a threshold value to determine whether the candidate segmentation point is the real segmentation point of the scene.

Specifically, as shown in fig. 4 and 5, the training phase includes:

step 1, dividing a video into L equal parts according to time, wherein each equal part of video is called a clip.

And 2, extracting features from each clip by using a deep learning pre-training model, wherein each clip obtains a 1*D feature expression (D is a feature dimension), and then L clips obtain L-D features.

And 3, constructing an encoder network, wherein the aim is to characterize the features of each point by higher-level semantics and reduce D to 128 dimensions.

And 4, constructing the characteristics of each candidate partition point, wherein each candidate partition point takes the characteristics of the clip where the candidate partition point is located, the characteristics between the last candidate partition point and the characteristics between the next candidate partition point. As shown in fig. 5, the feature of the division point P5 is selected from [ F3, F4, F5, F6, F7, F8].

Step 5: and (3) constructing a transducer network, outputting the characteristics of the step (4), adding a classification CLS token, and directly using the output CLS token to judge whether the classification is a real division point.

Step 6, designing a loss function: the loss function contains two: the classification loss function is L _cls ＝g _mask log(p)+(1-g _mask )log(1-p)

Wherein g _mask The method comprises the following steps: when the distance between a certain point and the groudtluth is less than or equal to 1, the positive example is considered, and conversely, the negative example is considered.

The consistency regularization loss function is:

wherein: i and j are any two clips, cosine<F _i F _j > ⁺ Cosine similarity, cosine when two clips of i and j belong to the same scene<F _i F _j > ^- For cosine when i, j two clips do not belong to the same sceneSimilarity, m is the logarithm of two clips belonging to the same scene, and n is the logarithm of two clips not belonging to the same scene.

And 7, back propagation training the model.

The reasoning stage comprises:

and step 1, obtaining the characteristic L.times.D of each video according to the same training stage.

And 2, forward transmission is carried out through the encoder network and the transformer network, the classification probability of each candidate segmentation point is obtained, and a certain threshold value is clamped to judge whether the segmentation point is a real segmentation point.

The overall model scheme is shown in fig. 5.

Embodiment two:

referring to fig. 6, fig. 6 is a schematic structural diagram of a real segmentation point determination system according to the present invention. The real dividing point judging system of the present invention as shown in fig. 6, wherein it comprises:

Wherein, the video feature acquisition module includes:

Wherein the model processing module comprises:

Wherein, the judging module includes: and judging the classification probability of each candidate segmentation point by setting a threshold value to determine whether the candidate segmentation point is the real segmentation point of the scene.

Embodiment III:

referring to fig. 7, a specific implementation of an electronic device is disclosed in this embodiment. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

In particular, the processor 81 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In a particular embodiment, the Memory 82 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 82 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 81.

The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any of the real division point judgment methods of the above embodiments.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 7, the processor 81, the memory 82, and the communication interface 83 are connected to each other through the bus 80 and perform communication with each other.

The communication interface 83 is used to implement communications between various modules, devices, units, and/or units in embodiments of the present application. Communication port 83 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 80 includes hardware, software, or both that couple components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 80 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 80 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

The electronic device may determine based on the true segmentation points, thereby implementing the methods described in connection with fig. 1-3.

In addition, in combination with the method for determining the true partition point in the above embodiment, the embodiment of the application may be implemented by providing a computer readable storage medium. The computer readable storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the real segmentation point determination methods of the above embodiments.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

In summary, the method has the beneficial effects that the method uses global consistency loss, reduces the similarity of the same scene, improves the similarity of different scenes, can obtain very good expression, gradually converges the model, and does not generate loss rise; the invention also uses a transducer to realize automatic attention and learn the relation inside the video sequence.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. The scope of the invention should, therefore, be determined with reference to the appended claims.

Claims

1. The method for judging the real dividing point is characterized by comprising the following steps of: video feature dimension acquisition: dividing a video into a plurality of video equal parts according to time, extracting features from the video equal parts by using a deep learning pre-training model, and obtaining video features;

judging the classification probability of each candidate partition point by setting a threshold value to determine whether the candidate partition point is a real partition point of a scene;

the model processing step includes the steps of,

a sample video equal part obtaining step of dividing the sample video into a plurality of sample video equal parts according to time,

a sample video feature obtaining step of extracting features for each of the sample video aliquots using a deep learning pre-training model, obtaining a plurality of sample video features of the video,

a step of constructing candidate segmentation point characteristics, in which, for each candidate segmentation point, the sample video characteristics of the video equal parts where the candidate segmentation point is located, the sample video characteristics between the last candidate segmentation point and the sample video characteristics between the next candidate segmentation point are taken, an Encoder network and a Predictor network are built in sequence, a loss function is designed, then the real segmentation point judgment model is constructed,

a classification probability obtaining step of obtaining the classification probability of each of the candidate division points by a true division point judgment model based on the video features,

the loss function comprises a consistency regularized loss function, in particular

Wherein: i and j are any two clips, cosine<FiFj> ⁺ Cosine similarity, cosine when two clips of i and j belong to the same scene<FiFj> ^□ For cosine similarity when i, j two clips do not belong to the same scene, m is the logarithm of two clips belonging to the same scene, n is the logarithm of two clips not belonging to the same scene, g _mask If the distance between a certain point and the groudtluth is 1 or less, the setting is considered as a positive example, and conversely, a negative example.

2. The true partition point judgment method according to claim 1, wherein the video feature acquisition step includes: video equal part obtaining step: dividing the video into a plurality of video equal parts according to time; and obtaining video features, namely extracting features from each video equal part by using a deep learning pre-training model to obtain first features corresponding to each video equal part.

3. A true segmentation point judgment system, characterized by comprising: the video feature dimension acquisition module divides a video into a plurality of video equal parts according to time, and extracts features from the video equal parts by using a deep learning pre-training model to obtain video features;

the judging module judges the classification probability of each candidate segmentation point by setting a threshold value so as to determine whether the candidate segmentation point is a real segmentation point of a scene;

the model processing module comprises a model processing module and a model processing module,

a sample video equal part obtaining unit which divides the sample video into a plurality of sample video equal parts according to time,

a sample video feature obtaining unit that extracts features for each of the sample video aliquots using a deep learning pre-training model, obtains a plurality of sample video features for the video,

constructing candidate segmentation point feature units, wherein the candidate segmentation point feature units take sample video features of video equal parts where the candidate segmentation points are positioned, sample video features between the last candidate segmentation point and sample video features between the next candidate segmentation point for each candidate segmentation point, constructing an Encoder network and a Predictor network in sequence, designing a loss function, constructing a real segmentation point judgment model,

a classification probability obtaining unit that obtains the classification probability of each of the candidate division points by a true division point judgment model based on the video feature,

4. The true partition point determination system of claim 3 wherein the video feature acquisition module comprises: the video equal part obtaining unit divides the video into a plurality of video equal parts according to time; and obtaining a video feature unit, wherein the video feature obtaining unit uses a deep learning pre-training model to extract features from each video equal to obtain first features corresponding to each video equal.

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the real partition point determination method according to any one of claims 1 to 2 when executing the computer program.

6. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the real division point judgment method according to any one of claims 1 to 2.