CN116229328A - Video file description text generation method, device and storage medium - Google Patents

Video file description text generation method, device and storage medium Download PDF

Info

Publication number
CN116229328A
CN116229328A CN202310258197.8A CN202310258197A CN116229328A CN 116229328 A CN116229328 A CN 116229328A CN 202310258197 A CN202310258197 A CN 202310258197A CN 116229328 A CN116229328 A CN 116229328A
Authority
CN
China
Prior art keywords
feature
video file
analyzed
global
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310258197.8A
Other languages
Chinese (zh)
Inventor
常志
陈永录
孙彦南
董甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310258197.8A priority Critical patent/CN116229328A/en
Publication of CN116229328A publication Critical patent/CN116229328A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method, a device and a storage medium for generating a video file description text, and relates to the technical field of big data. The method comprises the steps of extracting global static features, global dynamic features and local features from a video file to be analyzed by acquiring the video file to be analyzed; determining action behavior characteristics of a target object in a video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics; and generating the description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object. By adopting the technical scheme, the method and the device can assist workers to quickly locate the required video files, further realize quick acquisition of the user behaviors through the video files and improve the working efficiency.

Description

Video file description text generation method, device and storage medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a method and an apparatus for generating a description text of a video file, and a storage medium.
Background
At present, a plurality of monitoring videos are installed in a bank, customer behaviors in the monitoring videos can be analyzed through the monitoring videos, behavior data of customers in a bank business hall can be known in real time through the monitoring videos, and sudden problems can be known in time through the monitoring videos, so that accident investigation is realized.
Currently, if related content is to be searched in a monitoring video of a bank, the video needs to be checked one by one, and thus huge manpower resources and time are required to be consumed.
Therefore, a method for generating a description text of a video file is needed, which can assist a worker to quickly locate a required video file, thereby realizing the quick acquisition of the user's behavior through the video file and improving the working efficiency.
Disclosure of Invention
The method, the device and the storage medium for generating the video file description text can assist workers to quickly locate the required video file, further achieve the effect of quickly acquiring the user through the video file, and improve working efficiency.
In a first aspect, the present application provides a method for generating a description text of a video file, including:
acquiring a video file to be analyzed;
extracting global static features, global dynamic features and local features in the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed;
Determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics;
generating description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
In one example, the determining the action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature includes:
determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value represents the probability distribution condition of the behavior of each object in the video file to be analyzed;
Determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of the plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.
In one example, the determining the first probability value of the plurality of objects in the video file to be analyzed, the second probability value of each of the object actions, and the third probability value of each of the object behaviors according to the global static feature, the global dynamic feature, and the local feature includes:
constructing scene features according to the global static features and the local features;
determining each object according to the local characteristics;
and determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.
In one example, the generating descriptive text from the global static feature, the global dynamic feature, the local feature, and the action behavior feature of the target object includes:
Determining a feature average value according to the global static feature, the global dynamic feature and the local feature;
and generating descriptive text according to the characteristic average value and the action behavior characteristic of the target object.
In one example, the generating descriptive text according to the feature average value and the action behavior feature of the target object includes:
fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fused characteristic;
inputting the fusion characteristics into an encoder to generate word probability distribution;
and generating descriptive text according to the word probability distribution.
In one example, after the generating the descriptive text, further comprising:
and storing the description text and the generation time of the video file to be analyzed into a preset database.
In one example, the method further comprises:
responding to the index message of the user;
inquiring a video file in the preset database according to the index message;
and feeding the video file back to the user.
In a second aspect, the present application provides a video file description text generating apparatus, the apparatus including:
the acquisition unit is used for acquiring the video file to be analyzed;
The extraction unit is used for extracting global static features, global dynamic features and local features in the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed;
the determining unit is used for determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics;
the generation unit is used for generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
In one example, the determining unit includes:
the first determining module is used for determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value represents the probability distribution condition of the behavior of each object in the video file to be analyzed;
The second determining module is used for determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of the plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.
In one example, a first determination module includes:
the construction submodule is used for constructing scene features according to the global static features and the local features;
the first determining submodule is used for determining each object according to the local characteristics;
and the second determining submodule is used for determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.
In one example, a generating unit includes:
the third determining module is used for determining a feature average value according to the global static feature, the global dynamic feature and the local feature;
and the generation module is used for generating descriptive text according to the characteristic average value and the action behavior characteristic of the target object.
In one example, a generation module includes:
the fusion sub-module is used for fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fusion characteristic;
the first generation submodule is used for inputting the fusion characteristics into the encoder to generate word probability distribution;
and the second generation sub-module is used for generating descriptive text according to the word probability distribution.
In one example, the apparatus includes:
and the storage unit is used for storing the description text and the generation time of the video file to be analyzed into a preset database.
In one example, the apparatus includes:
a response unit for responding to the index message of the user;
the query unit is used for querying the video file in the preset database according to the index message;
and the feedback unit is used for feeding the video file back to the user.
In a third aspect, the present application provides an electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the method as described in the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for performing the method according to the first aspect when executed by a processor.
In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.
The method, the device and the storage medium for generating the video file description text extract global static features, global dynamic features and local features in the video file to be analyzed by acquiring the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed; determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics; generating description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file. By adopting the technical scheme, the method and the device can assist workers to quickly locate the required video files, further realize quick acquisition of the user behaviors through the video files and improve the working efficiency.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic flow chart of a method for generating a video file description text according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a method for generating a video file description text according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of a video file description text generating apparatus according to a third embodiment of the present application;
fig. 4 is a schematic structural diagram of a video file description text generating apparatus according to a fourth embodiment of the present application;
fig. 5 is a block diagram of an electronic device, according to an example embodiment.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a method for generating a video file description text according to an embodiment of the present application. The first embodiment comprises the following steps:
s101, acquiring a video file to be analyzed.
In one example, the video file to be analyzed is retrieved from a preset database of the bank. And inputting the video file to be analyzed into a bank monitoring system so as to analyze the video file to be analyzed through the bank monitoring system.
S102, extracting global static features, global dynamic features and local features in a video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.
In this embodiment, the video file to be analyzed includes a plurality of objects, which may be tables or different characters. The global static feature is a background feature of the video file to be analyzed, which may be, for example, a scene feature within a silver line. The global dynamic feature is a moving feature of each object in the video file to be analyzed, for example, the character a moves from the a position to the B window position, and the track information from the a position to the B window position is the global dynamic feature, and it is noted that the global dynamic feature is a moving feature of a plurality of objects and is not a moving feature of a specified one of the objects. The local feature refers to a feature of a preset area, wherein the preset area may be a partial area, and for example, the local feature may be a hand feature of a person designated in a bank.
S103, determining action behavior characteristics of the target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics.
In this embodiment, the action behavior feature of the target object refers to an action performed by the target object, and further, the action behavior feature of the target object includes the target object, a verb, and a specifically performed action. In this embodiment, the action behavior feature of the target object is determined by the global static feature, the global dynamic feature and the local feature.
S104, generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
In this embodiment, the description file includes keywords and text content, and the content of the video file can be obtained through the description file. And inputting the global static characteristics, the global dynamic characteristics, the local characteristics and the local characteristics into the multi-head attention code for processing, and then outputting the action behavior characteristics of the target object in the video file to be analyzed.
According to the method for generating the video file description text, the video file to be analyzed is obtained, global static features, global dynamic features and local features in the video file to be analyzed are extracted, and then the action behavior features of the target object in the video file to be analyzed are determined according to the global static features, the global dynamic features and the local features; generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file. By adopting the technical scheme, the method and the device can assist workers to quickly locate the required video files, further realize quick acquisition of the user behaviors through the video files and improve the working efficiency.
Fig. 2 is a flowchart of a method for generating a video file description text according to a second embodiment of the present application. The second embodiment includes the following steps:
s201, acquiring a video file to be analyzed.
For example, this step may refer to step S101, and will not be described in detail.
S202, extracting global static features, global dynamic features and local features in a video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.
In this embodiment, the two-dimensional convolutional neural network InceptionR is usedesnetV2 extracts global static features in the video file to be analyzed, where the global static features can be denoted as V r Extracting global dynamic characteristics in a video file to be analyzed through a three-dimensional convolutional neural network C3D, wherein the global dynamic characteristics can be marked as V c Local features in the video file to be analyzed are extracted by the object detector Faster R-CNN, where the local features can be denoted as V o
S203, determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature.
In this embodiment, the first probability value indicates that each object has a probability value, e.g., the first probability value of object A is P 1 (S 1 ) The first probability value of object B is P 1 (S 2 ). The second probability value characterizes the probability value of each action of each object, for example, the action of the object A is window transacting business, the action of the object B is transacting business on the self-service machine, and the second probability value of the object A is P 2 (a1) The second probability value of the object B is P 2 (a2) A. The invention relates to a method for producing a fibre-reinforced plastic composite The behavior of each object characterizes the object to which the object is to be applied, which may be a material, for example, the third probability value of object A is P 3 (o 1) the second probability value of object B is P 3 (o2)。
In one example, determining a first probability value, a second probability value, and a third probability value in a video file to be analyzed based on global static features, global dynamic features, and local features includes:
constructing scene features according to the global static features and the local features;
determining each object according to the local characteristics;
and determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.
In this embodiment, the scene features are used to represent the current scene information, and specifically, may be represented by a global static feature matrix and local featuresAnd splicing the matrixes to form the matrix. In the present embodiment, the local feature V is used o And a target detector Faster R-CNN, capable of determining a plurality of objects. After determining the plurality of objects, inputting the relative position information of the plurality of objects into scene features, and determining the position information of each object, wherein the position information can be specifically represented by the following formula:
L i =Re lu(W L [X,Y,W,H]);
wherein Li represents position information of the object, relu is an activation function, X represents an abscissa of the object, Y represents an ordinate of the object, W represents a width of the object, and H represents a height of the object. Further, the determined objects are combined with the global dynamic characteristics to obtain a second probability value of each object action. Specifically, the method can be realized by the following formula:
P 2 (a)=soft max(W a Re lu[Emb(s n ):V c ]);
wherein s is n Is an object, V c Is a global dynamic feature, emb is the embedding of words in the vocabulary, [:]representing a series of two matrices, relu is the activation function.
S204, determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of a plurality of objects.
In this embodiment, the action behavior characteristics of the target object in the video file to be analyzed are determined according to the following formula by a self-attention mechanism:
P 1 (s n )=SelfAttention(Emb(s n ),V o ,V o );
P 2 (a)=SelfAttention(Emb(a),V c ,V c );
P 3 (o)=SelfAttention(Emb(o),V o ,V o );
F action =P(α)·Emb(β)
wherein s is n A and o represent objects, actions taken and the pair, respectivelyLike the object being executed, V o And V c Representing local features and global dynamic features respectively; wherein alpha is ∈ { s } n ,a,o},β∈{s n ,a,o},[·]For dot product operation, F action Representing the behavior characteristics of the action of the target object, emb is the embedding of words in the vocabulary.
S205, determining a feature average value according to the global static feature, the global dynamic feature and the local feature.
In this embodiment, the feature average is the average of the global static feature, the global dynamic feature and the local feature, and specifically, the feature average may be defined by F visual And (3) representing.
S206, generating a description text according to the characteristic average value and the action behavior characteristic of the target object.
In one example, generating descriptive text from the feature mean and the action behavior feature of the target object includes:
fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fused characteristic;
inputting the fusion characteristics into an encoder to generate word probability distribution;
and generating descriptive text according to the word probability distribution.
In this embodiment, the feature average value and the action behavior feature of the target object are fused to obtain a fused feature, which can be implemented by the following formula:
Figure BDA0004130359090000082
Figure BDA0004130359090000081
wherein F is action Representing the action behavior characteristics of the target object, F visual Represents the characteristic average value, wherein,
Figure BDA0004130359090000084
and->
Figure BDA0004130359090000083
The fusion characteristics are represented, the fusion characteristics are input into 2 full-connection layers and are input into an encoder, so that word probability distribution is obtained, and the fusion characteristics can be obtained specifically through the following formula:
Figure BDA0004130359090000085
Figure BDA0004130359090000086
P(w t )=soft max(F n );
wherein P (w) t ) Is a word probability distribution.
In this embodiment, after determining the probability distribution of the words, the word with the highest probability value is used as the description text.
S207, storing the description text and the generation time of the video file to be analyzed into a preset database.
In this embodiment, the preset database includes a plurality of description texts and a generation time of each video file, and is used for subsequent query of a user, so that the video files required by quick positioning can be realized.
In one example, the index message is responsive to a user.
In this embodiment, the index message of the user is a keyword, a key field, and a time range entered by the user.
In one example, a video file is queried in a preset database according to an index message.
In this embodiment, the keywords typed by the user are compared with the description files or the generation time in the preset database, if the keywords typed by the user are consistent with the description files or the generation time, the related sentences and time are screened out and displayed in the right information field, and meanwhile, the time mark is displayed in the time field of the bank monitoring system, so that the video file is determined.
In one example, the video file is fed back to the user.
In this embodiment, the video file is displayed to the user through the interface of the monitoring system.
According to the method for generating the video file description text, the first probability value of a plurality of objects in the video file to be analyzed, the second probability value of the action of each object and the third probability value of the action of each object are determined according to the global static feature, the global dynamic feature and the local feature, the action behavior feature of the target object in the video file to be analyzed is determined according to the first probability value, the second probability value and the third probability value, the feature average value is determined according to the global static feature, the global dynamic feature and the local feature, the description text is generated according to the feature average value and the action behavior feature of the target object, the description text and the generation time of the video file to be analyzed are stored in a preset database, and the video file is fed back to a user according to the subsequent query requirement of the user. By adopting the technical scheme, the investigation of some emergency and accidents can be realized, related contents are searched and positioned through keywords, the rapid retrieval of bank monitoring is realized under huge video data, and for the situation that the storage occupied space of video files is large and the video files cannot be stored for a long time, the storage can be carried out through the description text of the video files, and the later-stage verification is assisted.
Fig. 3 is a schematic structural diagram of a video file description text generating apparatus according to a third embodiment of the present application. Specifically, the apparatus 30 of the third embodiment includes:
an obtaining unit 301, configured to obtain a video file to be analyzed.
The extracting unit 302 is configured to extract global static features, global dynamic features and local features in the video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.
The determining unit 303 is configured to determine an action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature.
The generating unit 304 is configured to generate a description text according to the global static feature, the global dynamic feature, the local feature, and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the above-described apparatus may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
Fig. 4 is a schematic structural diagram of a video file description text generating apparatus according to a fourth embodiment of the present application. Specifically, the apparatus 40 of the fourth embodiment includes:
an obtaining unit 401, configured to obtain a video file to be analyzed.
An extracting unit 402, configured to extract global static features, global dynamic features and local features in a video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.
The determining unit 403 is configured to determine an action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature.
A generating unit 404, configured to generate a description text according to the global static feature, the global dynamic feature, the local feature, and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
In one example, the determining unit 403 includes:
a first determining module 4031, configured to determine a first probability value, a second probability value, and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature, and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value characterizes the probability distribution of the behavior of each object in the video file to be analyzed;
A second determining module 4032, configured to determine an action behavior feature of the target object in the video file to be analyzed according to the first probability value, the second probability value, and the third probability value; wherein the target object is one of a plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.
In one example, the first determination module 4031 includes:
a building sub-module 40311 is configured to build scene features from the global static features and the local features.
A first determining submodule 40312 is configured to determine each object according to the local feature.
The second determining submodule 40313 is used for determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.
In one example, the generating unit 404 includes:
the third determining module 4041 is configured to determine a feature average value according to the global static feature, the global dynamic feature, and the local feature.
The generating module 4042 is configured to generate descriptive text according to the feature average value and the action behavior feature of the target object.
In one example, the generation module 4042 includes:
And the fusion submodule 40421 is used for fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fusion characteristic.
A first generation sub-module 40422 for inputting the fusion features into the encoder to generate a word probability distribution.
A second generation sub-module 40423 is configured to generate descriptive text according to the word probability distribution.
In one example, the apparatus 40 includes:
and the storage unit 405 is configured to store the description text and the generation time of the video file to be analyzed in a preset database.
In one example, the apparatus 40 includes:
and a response unit 406 for responding to the index message of the user.
The query unit 407 is configured to query the preset database for the video file according to the index message.
And a feedback unit 408, configured to feed back the video file to the user.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the above-described apparatus may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
Fig. 5 is a block diagram of an electronic device, which may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like, in accordance with an exemplary embodiment.
The apparatus 500 may include one or more of the following components: a processing component 502, a memory 504, a power supply component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.
The processing component 502 generally controls overall operation of the apparatus 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 502 may include one or more processors 520 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interactions between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.
The memory 504 is configured to store various types of data to support operations at the apparatus 500. Examples of such data include instructions for any application or method operating on the apparatus 500, contact data, phonebook data, messages, pictures, videos, and the like. The memory 504 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 506 provides power to the various components of the device 500. The power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 500.
The multimedia component 508 includes a screen between the device 500 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 508 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the apparatus 500 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 504 or transmitted via the communication component 516. In some embodiments, the audio component 510 further comprises a speaker for outputting audio signals.
The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 514 includes one or more sensors for providing status assessment of various aspects of the apparatus 500. For example, the sensor assembly 514 may detect the on/off state of the device 500, the relative positioning of the components, such as the display and keypad of the device 500, the sensor assembly 514 may also detect a change in position of the device 500 or a component of the device 500, the presence or absence of user contact with the device 500, the orientation or acceleration/deceleration of the device 500, and a change in temperature of the device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 516 is configured to facilitate communication between the apparatus 500 and other devices in a wired or wireless manner. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 516 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 504, including instructions executable by processor 520 of apparatus 500 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform a video file description text generation method of the electronic device.
The application also discloses a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the present embodiment.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or electronic device.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data electronic device), or that includes a middleware component (e.g., an application electronic device), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and an electronic device. The client and the electronic device are generally remote from each other and typically interact through a communication network. The relationship of client and electronic devices arises by virtue of computer programs running on the respective computers and having a client-electronic device relationship to each other. The electronic equipment can be cloud electronic equipment, also called cloud computing electronic equipment or cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server or VPS for short) are overcome. The electronic device may also be an electronic device of a distributed system or an electronic device that incorporates a blockchain. It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (11)

1. A method for generating a video file description text, the method comprising:
acquiring a video file to be analyzed;
extracting global static features, global dynamic features and local features in the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed;
determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics;
generating description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
2. The method according to claim 1, wherein the determining the action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature comprises:
determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value represents the probability distribution condition of the behavior of each object in the video file to be analyzed;
determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of the plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.
3. The method of claim 2, wherein determining the first, second, and third probability values for the video file to be analyzed based on the global static feature, the global dynamic feature, and the local feature comprises:
Constructing scene features according to the global static features and the local features;
determining each object according to the local characteristics;
and determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.
4. The method of claim 1, wherein the generating descriptive text from the global static feature, the global dynamic feature, the local feature, and the behavioral feature of the target object comprises:
determining a feature average value according to the global static feature, the global dynamic feature and the local feature;
and generating descriptive text according to the characteristic average value and the action behavior characteristic of the target object.
5. The method of claim 4, wherein generating descriptive text from the feature mean and the action behavior feature of the target object comprises:
fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fused characteristic;
inputting the fusion characteristics into an encoder to generate word probability distribution;
And generating descriptive text according to the word probability distribution.
6. The method of claim 1, further comprising, after the generating the descriptive text:
and storing the description text and the generation time of the video file to be analyzed into a preset database.
7. The method of claim 6, wherein the method further comprises:
responding to the index message of the user;
inquiring a video file in the preset database according to the index message;
and feeding the video file back to the user.
8. A video file description text generation apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the video file to be analyzed;
the extraction unit is used for extracting global static features, global dynamic features and local features in the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed;
The determining unit is used for determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics;
the generation unit is used for generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.
9. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-7.
10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-7.
11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.
CN202310258197.8A 2023-03-10 2023-03-10 Video file description text generation method, device and storage medium Pending CN116229328A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310258197.8A CN116229328A (en) 2023-03-10 2023-03-10 Video file description text generation method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310258197.8A CN116229328A (en) 2023-03-10 2023-03-10 Video file description text generation method, device and storage medium

Publications (1)

Publication Number Publication Date
CN116229328A true CN116229328A (en) 2023-06-06

Family

ID=86589078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310258197.8A Pending CN116229328A (en) 2023-03-10 2023-03-10 Video file description text generation method, device and storage medium

Country Status (1)

Country Link
CN (1) CN116229328A (en)

Similar Documents

Publication Publication Date Title
CN106599070B (en) Method and device for acquiring information in first application program and terminal equipment
CN112269798B (en) Information display method and device and electronic equipment
CN111198620B (en) Method, device and equipment for presenting input candidate items
CN111046210B (en) Information recommendation method and device and electronic equipment
CN114827068A (en) Message sending method and device, electronic equipment and readable storage medium
CN108803891B (en) Information display method and device, electronic equipment and storage medium
CN110232181B (en) Comment analysis method and device
CN112381091B (en) Video content identification method, device, electronic equipment and storage medium
CN110795014A (en) Data processing method and device and data processing device
CN105487799A (en) Content conversion method and device
CN108241438B (en) Input method, input device and input device
CN114090738A (en) Method, device and equipment for determining scene data information and storage medium
CN116229328A (en) Video file description text generation method, device and storage medium
CN115357249A (en) Code generation method and device, electronic equipment and storage medium
US20170060822A1 (en) Method and device for storing string
CN111522448B (en) Method, device and equipment for providing input candidate items
CN108108356A (en) A kind of character translation method, apparatus and equipment
CN113805707A (en) Input method, input device and input device
CN110413445B (en) Input processing method, input processing device, electronic equipment and storage medium
CN112613327B (en) Information processing method and device
CN112528129B (en) Language searching method and device for multilingual translation system
US20160124921A1 (en) Method and device for selecting information
CN117472931A (en) Method, device, equipment and storage medium for calling database execution statement
CN116010811A (en) Transaction data information classification model training method, device and storage medium
CN116723272A (en) Voice information pushing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination