CN116229328A

CN116229328A - Video file description text generation method, device and storage medium

Info

Publication number: CN116229328A
Application number: CN202310258197.8A
Authority: CN
Inventors: 常志; 陈永录; 孙彦南; 董甜
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-06

Abstract

The application provides a method, a device and a storage medium for generating a video file description text, and relates to the technical field of big data. The method comprises the steps of extracting global static features, global dynamic features and local features from a video file to be analyzed by acquiring the video file to be analyzed; determining action behavior characteristics of a target object in a video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics; and generating the description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object. By adopting the technical scheme, the method and the device can assist workers to quickly locate the required video files, further realize quick acquisition of the user behaviors through the video files and improve the working efficiency.

Description

Video file description text generation method, device and storage medium

Technical Field

The present disclosure relates to the field of big data technologies, and in particular, to a method and an apparatus for generating a description text of a video file, and a storage medium.

Background

At present, a plurality of monitoring videos are installed in a bank, customer behaviors in the monitoring videos can be analyzed through the monitoring videos, behavior data of customers in a bank business hall can be known in real time through the monitoring videos, and sudden problems can be known in time through the monitoring videos, so that accident investigation is realized.

Currently, if related content is to be searched in a monitoring video of a bank, the video needs to be checked one by one, and thus huge manpower resources and time are required to be consumed.

Therefore, a method for generating a description text of a video file is needed, which can assist a worker to quickly locate a required video file, thereby realizing the quick acquisition of the user's behavior through the video file and improving the working efficiency.

Disclosure of Invention

The method, the device and the storage medium for generating the video file description text can assist workers to quickly locate the required video file, further achieve the effect of quickly acquiring the user through the video file, and improve working efficiency.

In a first aspect, the present application provides a method for generating a description text of a video file, including:

acquiring a video file to be analyzed;

extracting global static features, global dynamic features and local features in the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed;

Determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics;

generating description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.

In one example, the determining the action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature includes:

determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value represents the probability distribution condition of the behavior of each object in the video file to be analyzed;

Determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of the plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.

In one example, the determining the first probability value of the plurality of objects in the video file to be analyzed, the second probability value of each of the object actions, and the third probability value of each of the object behaviors according to the global static feature, the global dynamic feature, and the local feature includes:

constructing scene features according to the global static features and the local features;

determining each object according to the local characteristics;

and determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.

In one example, the generating descriptive text from the global static feature, the global dynamic feature, the local feature, and the action behavior feature of the target object includes:

Determining a feature average value according to the global static feature, the global dynamic feature and the local feature;

and generating descriptive text according to the characteristic average value and the action behavior characteristic of the target object.

In one example, the generating descriptive text according to the feature average value and the action behavior feature of the target object includes:

fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fused characteristic;

inputting the fusion characteristics into an encoder to generate word probability distribution;

and generating descriptive text according to the word probability distribution.

In one example, after the generating the descriptive text, further comprising:

and storing the description text and the generation time of the video file to be analyzed into a preset database.

In one example, the method further comprises:

responding to the index message of the user;

inquiring a video file in the preset database according to the index message;

and feeding the video file back to the user.

In a second aspect, the present application provides a video file description text generating apparatus, the apparatus including:

the acquisition unit is used for acquiring the video file to be analyzed;

The extraction unit is used for extracting global static features, global dynamic features and local features in the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed;

the determining unit is used for determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics;

the generation unit is used for generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.

In one example, the determining unit includes:

the first determining module is used for determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value represents the probability distribution condition of the behavior of each object in the video file to be analyzed;

The second determining module is used for determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of the plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.

In one example, a first determination module includes:

the construction submodule is used for constructing scene features according to the global static features and the local features;

the first determining submodule is used for determining each object according to the local characteristics;

and the second determining submodule is used for determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.

In one example, a generating unit includes:

the third determining module is used for determining a feature average value according to the global static feature, the global dynamic feature and the local feature;

and the generation module is used for generating descriptive text according to the characteristic average value and the action behavior characteristic of the target object.

In one example, a generation module includes:

the fusion sub-module is used for fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fusion characteristic;

the first generation submodule is used for inputting the fusion characteristics into the encoder to generate word probability distribution;

and the second generation sub-module is used for generating descriptive text according to the word probability distribution.

In one example, the apparatus includes:

and the storage unit is used for storing the description text and the generation time of the video file to be analyzed into a preset database.

In one example, the apparatus includes:

a response unit for responding to the index message of the user;

the query unit is used for querying the video file in the preset database according to the index message;

and the feedback unit is used for feeding the video file back to the user.

In a third aspect, the present application provides an electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method as described in the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for performing the method according to the first aspect when executed by a processor.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.

The method, the device and the storage medium for generating the video file description text extract global static features, global dynamic features and local features in the video file to be analyzed by acquiring the video file to be analyzed; the global static feature is used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used for representing the features of a preset area of the video file to be analyzed; determining action behavior characteristics of a target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics; generating description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file. By adopting the technical scheme, the method and the device can assist workers to quickly locate the required video files, further realize quick acquisition of the user behaviors through the video files and improve the working efficiency.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flow chart of a method for generating a video file description text according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a method for generating a video file description text according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of a video file description text generating apparatus according to a third embodiment of the present application;

fig. 4 is a schematic structural diagram of a video file description text generating apparatus according to a fourth embodiment of the present application;

fig. 5 is a block diagram of an electronic device, according to an example embodiment.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for generating a video file description text according to an embodiment of the present application. The first embodiment comprises the following steps:

s101, acquiring a video file to be analyzed.

In one example, the video file to be analyzed is retrieved from a preset database of the bank. And inputting the video file to be analyzed into a bank monitoring system so as to analyze the video file to be analyzed through the bank monitoring system.

S102, extracting global static features, global dynamic features and local features in a video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.

In this embodiment, the video file to be analyzed includes a plurality of objects, which may be tables or different characters. The global static feature is a background feature of the video file to be analyzed, which may be, for example, a scene feature within a silver line. The global dynamic feature is a moving feature of each object in the video file to be analyzed, for example, the character a moves from the a position to the B window position, and the track information from the a position to the B window position is the global dynamic feature, and it is noted that the global dynamic feature is a moving feature of a plurality of objects and is not a moving feature of a specified one of the objects. The local feature refers to a feature of a preset area, wherein the preset area may be a partial area, and for example, the local feature may be a hand feature of a person designated in a bank.

S103, determining action behavior characteristics of the target object in the video file to be analyzed according to the global static characteristics, the global dynamic characteristics and the local characteristics.

In this embodiment, the action behavior feature of the target object refers to an action performed by the target object, and further, the action behavior feature of the target object includes the target object, a verb, and a specifically performed action. In this embodiment, the action behavior feature of the target object is determined by the global static feature, the global dynamic feature and the local feature.

S104, generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.

In this embodiment, the description file includes keywords and text content, and the content of the video file can be obtained through the description file. And inputting the global static characteristics, the global dynamic characteristics, the local characteristics and the local characteristics into the multi-head attention code for processing, and then outputting the action behavior characteristics of the target object in the video file to be analyzed.

According to the method for generating the video file description text, the video file to be analyzed is obtained, global static features, global dynamic features and local features in the video file to be analyzed are extracted, and then the action behavior features of the target object in the video file to be analyzed are determined according to the global static features, the global dynamic features and the local features; generating a description text according to the global static feature, the global dynamic feature, the local feature and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file. By adopting the technical scheme, the method and the device can assist workers to quickly locate the required video files, further realize quick acquisition of the user behaviors through the video files and improve the working efficiency.

Fig. 2 is a flowchart of a method for generating a video file description text according to a second embodiment of the present application. The second embodiment includes the following steps:

s201, acquiring a video file to be analyzed.

For example, this step may refer to step S101, and will not be described in detail.

S202, extracting global static features, global dynamic features and local features in a video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.

In this embodiment, the two-dimensional convolutional neural network InceptionR is usedesnetV2 extracts global static features in the video file to be analyzed, where the global static features can be denoted as V _r Extracting global dynamic characteristics in a video file to be analyzed through a three-dimensional convolutional neural network C3D, wherein the global dynamic characteristics can be marked as V _c Local features in the video file to be analyzed are extracted by the object detector Faster R-CNN, where the local features can be denoted as V _o 。

S203, determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature.

In this embodiment, the first probability value indicates that each object has a probability value, e.g., the first probability value of object A is P ₁ (S ₁ ) The first probability value of object B is P ₁ (S ₂ ). The second probability value characterizes the probability value of each action of each object, for example, the action of the object A is window transacting business, the action of the object B is transacting business on the self-service machine, and the second probability value of the object A is P ₂ (a1) The second probability value of the object B is P ₂ (a2) A. The invention relates to a method for producing a fibre-reinforced plastic composite The behavior of each object characterizes the object to which the object is to be applied, which may be a material, for example, the third probability value of object A is P ₃ (o 1) the second probability value of object B is P ₃ (o2)。

In one example, determining a first probability value, a second probability value, and a third probability value in a video file to be analyzed based on global static features, global dynamic features, and local features includes:

determining each object according to the local characteristics;

In this embodiment, the scene features are used to represent the current scene information, and specifically, may be represented by a global static feature matrix and local featuresAnd splicing the matrixes to form the matrix. In the present embodiment, the local feature V is used _o And a target detector Faster R-CNN, capable of determining a plurality of objects. After determining the plurality of objects, inputting the relative position information of the plurality of objects into scene features, and determining the position information of each object, wherein the position information can be specifically represented by the following formula:

L _i ＝Re lu(W _L [X,Y,W,H])；

wherein Li represents position information of the object, relu is an activation function, X represents an abscissa of the object, Y represents an ordinate of the object, W represents a width of the object, and H represents a height of the object. Further, the determined objects are combined with the global dynamic characteristics to obtain a second probability value of each object action. Specifically, the method can be realized by the following formula:

P ₂ (a)＝soft max(W _a Re lu[Emb(s _n ):V _c ])；

wherein s is _n Is an object, V _c Is a global dynamic feature, emb is the embedding of words in the vocabulary, [:]representing a series of two matrices, relu is the activation function.

S204, determining action behavior characteristics of a target object in the video file to be analyzed according to the first probability value, the second probability value and the third probability value; wherein the target object is one of a plurality of objects.

In this embodiment, the action behavior characteristics of the target object in the video file to be analyzed are determined according to the following formula by a self-attention mechanism:

P ₁ (s _n )＝SelfAttention(Emb(s _n )，V _o ，V _o )；

P ₂ (a)＝SelfAttention(Emb(a)，V _c ，V _c )；

P ₃ (o)＝SelfAttention(Emb(o)，V _o ，V _o )；

F _action ＝P(α)·Emb(β)

wherein s is _n A and o represent objects, actions taken and the pair, respectivelyLike the object being executed, V _o And V _c Representing local features and global dynamic features respectively; wherein alpha is ∈ { s } _n ，a，o}，β∈{s _n ,a,o}，[·]For dot product operation, F _action Representing the behavior characteristics of the action of the target object, emb is the embedding of words in the vocabulary.

S205, determining a feature average value according to the global static feature, the global dynamic feature and the local feature.

In this embodiment, the feature average is the average of the global static feature, the global dynamic feature and the local feature, and specifically, the feature average may be defined by F _visual And (3) representing.

S206, generating a description text according to the characteristic average value and the action behavior characteristic of the target object.

In one example, generating descriptive text from the feature mean and the action behavior feature of the target object includes:

and generating descriptive text according to the word probability distribution.

In this embodiment, the feature average value and the action behavior feature of the target object are fused to obtain a fused feature, which can be implemented by the following formula:

wherein F is _action Representing the action behavior characteristics of the target object, F _visual Represents the characteristic average value, wherein,

and->

The fusion characteristics are represented, the fusion characteristics are input into 2 full-connection layers and are input into an encoder, so that word probability distribution is obtained, and the fusion characteristics can be obtained specifically through the following formula:

P(w _t )＝soft max(F _n )；

wherein P (w) _t ) Is a word probability distribution.

In this embodiment, after determining the probability distribution of the words, the word with the highest probability value is used as the description text.

S207, storing the description text and the generation time of the video file to be analyzed into a preset database.

In this embodiment, the preset database includes a plurality of description texts and a generation time of each video file, and is used for subsequent query of a user, so that the video files required by quick positioning can be realized.

In one example, the index message is responsive to a user.

In this embodiment, the index message of the user is a keyword, a key field, and a time range entered by the user.

In one example, a video file is queried in a preset database according to an index message.

In this embodiment, the keywords typed by the user are compared with the description files or the generation time in the preset database, if the keywords typed by the user are consistent with the description files or the generation time, the related sentences and time are screened out and displayed in the right information field, and meanwhile, the time mark is displayed in the time field of the bank monitoring system, so that the video file is determined.

In one example, the video file is fed back to the user.

In this embodiment, the video file is displayed to the user through the interface of the monitoring system.

According to the method for generating the video file description text, the first probability value of a plurality of objects in the video file to be analyzed, the second probability value of the action of each object and the third probability value of the action of each object are determined according to the global static feature, the global dynamic feature and the local feature, the action behavior feature of the target object in the video file to be analyzed is determined according to the first probability value, the second probability value and the third probability value, the feature average value is determined according to the global static feature, the global dynamic feature and the local feature, the description text is generated according to the feature average value and the action behavior feature of the target object, the description text and the generation time of the video file to be analyzed are stored in a preset database, and the video file is fed back to a user according to the subsequent query requirement of the user. By adopting the technical scheme, the investigation of some emergency and accidents can be realized, related contents are searched and positioned through keywords, the rapid retrieval of bank monitoring is realized under huge video data, and for the situation that the storage occupied space of video files is large and the video files cannot be stored for a long time, the storage can be carried out through the description text of the video files, and the later-stage verification is assisted.

Fig. 3 is a schematic structural diagram of a video file description text generating apparatus according to a third embodiment of the present application. Specifically, the apparatus 30 of the third embodiment includes:

an obtaining unit 301, configured to obtain a video file to be analyzed.

The extracting unit 302 is configured to extract global static features, global dynamic features and local features in the video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.

The determining unit 303 is configured to determine an action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature.

The generating unit 304 is configured to generate a description text according to the global static feature, the global dynamic feature, the local feature, and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the above-described apparatus may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

Fig. 4 is a schematic structural diagram of a video file description text generating apparatus according to a fourth embodiment of the present application. Specifically, the apparatus 40 of the fourth embodiment includes:

an obtaining unit 401, configured to obtain a video file to be analyzed.

An extracting unit 402, configured to extract global static features, global dynamic features and local features in a video file to be analyzed; the global static features are used for representing background features of the video file to be analyzed; the global dynamic feature is used for representing the moving feature of each object in the video file to be analyzed; the local features are used to characterize a preset region of the video file to be analyzed.

The determining unit 403 is configured to determine an action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature.

A generating unit 404, configured to generate a description text according to the global static feature, the global dynamic feature, the local feature, and the action behavior feature of the target object; wherein the descriptive text is used to characterize the content of the video file.

In one example, the determining unit 403 includes:

a first determining module 4031, configured to determine a first probability value, a second probability value, and a third probability value in the video file to be analyzed according to the global static feature, the global dynamic feature, and the local feature; the first probability value represents probability distribution conditions of objects in the video file to be analyzed; the second probability value represents the probability distribution condition of the action of each object in the video file to be analyzed; the third probability value characterizes the probability distribution of the behavior of each object in the video file to be analyzed;

A second determining module 4032, configured to determine an action behavior feature of the target object in the video file to be analyzed according to the first probability value, the second probability value, and the third probability value; wherein the target object is one of a plurality of objects; wherein the objects of the first probability value, the second probability value and the third probability value are the same.

In one example, the first determination module 4031 includes:

a building sub-module 40311 is configured to build scene features from the global static features and the local features.

A first determining submodule 40312 is configured to determine each object according to the local feature.

The second determining submodule 40313 is used for determining a first probability value, a second probability value and a third probability value in the video file to be analyzed according to the scene characteristics, each object and the global dynamic characteristics.

In one example, the generating unit 404 includes:

the third determining module 4041 is configured to determine a feature average value according to the global static feature, the global dynamic feature, and the local feature.

The generating module 4042 is configured to generate descriptive text according to the feature average value and the action behavior feature of the target object.

In one example, the generation module 4042 includes:

And the fusion submodule 40421 is used for fusing the characteristic average value and the action behavior characteristic of the target object to obtain a fusion characteristic.

A first generation sub-module 40422 for inputting the fusion features into the encoder to generate a word probability distribution.

A second generation sub-module 40423 is configured to generate descriptive text according to the word probability distribution.

In one example, the apparatus 40 includes:

and the storage unit 405 is configured to store the description text and the generation time of the video file to be analyzed in a preset database.

In one example, the apparatus 40 includes:

and a response unit 406 for responding to the index message of the user.

The query unit 407 is configured to query the preset database for the video file according to the index message.

And a feedback unit 408, configured to feed back the video file to the user.

Fig. 5 is a block diagram of an electronic device, which may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like, in accordance with an exemplary embodiment.

The apparatus 500 may include one or more of the following components: a processing component 502, a memory 504, a power supply component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.

The processing component 502 generally controls overall operation of the apparatus 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 502 may include one or more processors 520 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interactions between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operations at the apparatus 500. Examples of such data include instructions for any application or method operating on the apparatus 500, contact data, phonebook data, messages, pictures, videos, and the like. The memory 504 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 506 provides power to the various components of the device 500. The power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 500.

The multimedia component 508 includes a screen between the device 500 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 508 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the apparatus 500 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 504 or transmitted via the communication component 516. In some embodiments, the audio component 510 further comprises a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 514 includes one or more sensors for providing status assessment of various aspects of the apparatus 500. For example, the sensor assembly 514 may detect the on/off state of the device 500, the relative positioning of the components, such as the display and keypad of the device 500, the sensor assembly 514 may also detect a change in position of the device 500 or a component of the device 500, the presence or absence of user contact with the device 500, the orientation or acceleration/deceleration of the device 500, and a change in temperature of the device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate communication between the apparatus 500 and other devices in a wired or wireless manner. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 516 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 504, including instructions executable by processor 520 of apparatus 500 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform a video file description text generation method of the electronic device.

The application also discloses a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the present embodiment.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or electronic device.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data electronic device), or that includes a middleware component (e.g., an application electronic device), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and an electronic device. The client and the electronic device are generally remote from each other and typically interact through a communication network. The relationship of client and electronic devices arises by virtue of computer programs running on the respective computers and having a client-electronic device relationship to each other. The electronic equipment can be cloud electronic equipment, also called cloud computing electronic equipment or cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual Private Server or VPS for short) are overcome. The electronic device may also be an electronic device of a distributed system or an electronic device that incorporates a blockchain. It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for generating a video file description text, the method comprising:

acquiring a video file to be analyzed;

2. The method according to claim 1, wherein the determining the action behavior feature of the target object in the video file to be analyzed according to the global static feature, the global dynamic feature and the local feature comprises:

3. The method of claim 2, wherein determining the first, second, and third probability values for the video file to be analyzed based on the global static feature, the global dynamic feature, and the local feature comprises:

determining each object according to the local characteristics;

4. The method of claim 1, wherein the generating descriptive text from the global static feature, the global dynamic feature, the local feature, and the behavioral feature of the target object comprises:

5. The method of claim 4, wherein generating descriptive text from the feature mean and the action behavior feature of the target object comprises:

And generating descriptive text according to the word probability distribution.

6. The method of claim 1, further comprising, after the generating the descriptive text:

7. The method of claim 6, wherein the method further comprises:

responding to the index message of the user;

inquiring a video file in the preset database according to the index message;

and feeding the video file back to the user.

8. A video file description text generation apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the video file to be analyzed;

9. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-7.

10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-7.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.