CN113821681A

CN113821681A - Video tag generation method, device and equipment

Info

Publication number: CN113821681A
Application number: CN202111091260.0A
Authority: CN
Inventors: 曹军伟; 徐高峰
Original assignee: Shenzhen ZNV Technology Co Ltd; Nanjing ZNV Software Co Ltd
Current assignee: Shenzhen ZNV Technology Co Ltd; Nanjing ZNV Software Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-21
Anticipated expiration: 2041-09-17
Also published as: CN113821681B

Abstract

A video tag generation method, a video tag generation device and video tag generation equipment are provided, pictures are obtained from a video at a preset frequency, each obtained picture is respectively input into a safety event recognition model which is trained in advance, and whether each picture contains preset safety event identification information or not is judged; respectively carrying out image recognition on each picture containing preset security event identification information to obtain structured text information for describing the picture, and storing the picture and the corresponding structured text information in a view library in an associated manner; for each picture in a view library, extracting security event key information from the structured text information of the picture through an NLP (line of sight) recognition algorithm to generate a video tag of the picture, wherein the security event key information comprises: the occurrence time, the occurrence place and the content of the safety event do not need to play back a large amount of monitoring videos and generate video tags through a large amount of manpower, so that the generation efficiency of the video tags is improved.

Description

Video tag generation method, device and equipment

Technical Field

The invention relates to the technical field of security video monitoring, in particular to a method, a device and equipment for generating a video tag.

Background

With the development of artificial intelligence, the artificial intelligence technology is applied to various services of security video monitoring, such as services of post playback, in-process real-time response, pre-warning and the like. The video label is used as a video index mode, namely, the video content at a specific time is subjected to abstract recording during playback afterwards, so that subsequent retrieval and retrieval are facilitated, and the efficiency of accurately accessing the monitored video is greatly improved.

At present, a mode of manually playing back a monitoring video can be used for manually generating and recording a video label after event identification is carried out on the monitoring video at a specific time afterwards. However, this approach relies on human labor to play back the surveillance video and generate video tags, resulting in inefficient generation of video tags.

Disclosure of Invention

The embodiment of the invention provides a video tag generation method, a video tag generation device and video tag generation equipment, which are used for improving the generation efficiency of video tags.

According to a first aspect, there is provided in one embodiment a method of video tag generation, the method comprising:

acquiring pictures from a video at a preset frequency, respectively inputting each acquired picture into a safety event recognition model trained in advance, and judging whether each picture contains preset safety event identification information, wherein the safety event recognition model is obtained by training based on a sample picture marked with the preset safety event identification information;

respectively carrying out image recognition on each picture containing preset security event identification information to obtain structured text information for describing the picture, and storing the picture and the corresponding structured text information in a view library in an associated manner;

for each picture in the view library, extracting security event key information from the structured text information of the picture through a Natural Language Processing (NLP) recognition algorithm to generate a video tag of the picture, wherein the security event key information comprises: the time of occurrence of the security event, the location of occurrence of the security event, and the content of the security event.

Optionally, the method further includes:

acquiring target retrieval text information of a user;

respectively calculating a first matching degree of the target retrieval text information and the structured text information of each picture in the view library;

determining the pictures in the view library with the first matching degree greater than or equal to a first preset threshold value as target pictures;

and determining a security event according to the target picture.

Optionally, the video tag includes a recording time of a picture in the video, and the method further includes:

if the first matching degrees are smaller than the first preset threshold value, respectively calculating second matching degrees of the target retrieval text information and the video tags of the pictures in the view library;

determining the video recording time included by the video label with the second matching degree larger than a second preset threshold value as target video recording time;

determining a video clip needing to be played back according to the target video recording time;

and determining a security event according to the video clip needing to be played back.

Optionally, the obtaining target retrieval text information of the user includes:

acquiring original retrieval information input by a user;

and extracting key information of the retrieval condition from the original retrieval information through the NLP identification algorithm to generate target retrieval text information.

Optionally, the method further includes:

and displaying the generated video label on the video in a preset display mode.

Optionally, the method further includes:

and verifying the video label of each picture in the view library, and modifying the video label with errors to obtain a modified video label.

According to a second aspect, an embodiment provides a video tag generation apparatus, the apparatus comprising:

the system comprises a judging module, a processing module and a processing module, wherein the judging module is used for acquiring pictures from a video at a preset frequency, respectively inputting each acquired picture into a safety event recognition model trained in advance, and judging whether each picture contains preset safety event identification information, and the safety event recognition model is obtained by training based on a sample picture marked with whether the picture contains the preset safety event identification information;

the acquisition module is used for respectively carrying out image recognition on each picture containing the preset security event identification information to obtain structured text information for describing the picture, and storing the picture and the corresponding structured text information in a view library in an associated manner;

a generating module, configured to, for each picture in the view library, extract security event key information from the structured text information of the picture through a natural language processing NLP recognition algorithm, and generate a video tag of the picture, where the security event key information includes: the time of occurrence of the security event, the location of occurrence of the security event, and the content of the security event.

Optionally, the apparatus further comprises: the determining module is used for acquiring target retrieval text information of a user; respectively calculating a first matching degree of the target retrieval text information and the structured text information of each picture in the view library; determining the pictures in the view library with the first matching degree greater than or equal to a first preset threshold value as target pictures; and determining a security event according to the target picture.

Optionally, the video tag includes a recording time of a picture in the video, and the determining module is further configured to calculate a second matching degree between the target retrieval text information and the video tag of each picture in the view library, respectively, if each first matching degree is smaller than the first preset threshold; determining the video recording time included by the video label with the second matching degree larger than a second preset threshold value as target video recording time; determining a video clip needing to be played back according to the target video recording time; and determining a security event according to the video clip needing to be played back.

Optionally, the determining module is specifically configured to obtain original retrieval information input by a user; and extracting key information of the retrieval condition from the original retrieval information through the NLP identification algorithm to generate target retrieval text information.

Optionally, the apparatus further comprises: and the display module is used for displaying the generated video label on the video in a preset display mode.

Optionally, the apparatus further comprises: and the correction module is used for verifying the video label of each picture in the view library and modifying the video label with errors to obtain a modified video label.

According to a third aspect, there is provided in one embodiment an electronic device comprising: a memory for storing a program; a processor configured to execute the program stored in the memory to implement the video tag generation method according to any one of the first aspect.

According to a fourth aspect, an embodiment provides a computer-readable storage medium having a program stored thereon, the program being executable by a processor to implement the video tag generation method of any of the first aspect described above.

The embodiment of the invention provides a video tag generation method, a video tag generation device and video tag generation equipment, wherein pictures are acquired from a video at a preset frequency, each acquired picture is respectively input into a safety event recognition model which is trained in advance, and whether each picture contains preset safety event identification information is judged, wherein the safety event recognition model is obtained by training based on a sample picture which is marked whether the picture contains the preset safety event identification information; respectively carrying out image recognition on each picture containing preset security event identification information to obtain structured text information for describing the picture, and storing the picture and the corresponding structured text information in a view library in an associated manner; for each picture in a view library, extracting security event key information from the structured text information of the picture through a Natural Language Processing (NLP) recognition algorithm to generate a video tag of the picture, wherein the security event key information comprises: the occurrence time of the safety event, the occurrence place of the safety event and the content of the safety event do not need to replay a large amount of monitoring videos and generate video tags through a large amount of manpower, so that the manpower input cost for generating the video tags is saved, the generation time of the video tags is saved, and the generation efficiency of the video tags is improved.

Drawings

Fig. 1 is a schematic flowchart of a first embodiment of a video tag generation method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of NLP technique;

fig. 3 is a schematic flowchart of a second embodiment of a video tag generation method according to the present invention;

fig. 4 is a schematic flowchart of a third embodiment of a video tag generation method according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of a fourth embodiment of a video tag generation method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a video tag generation apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).

In the prior art, a mode of manually playing back a monitoring video can be used, and after event identification is performed on the monitoring video at a specific time afterwards, generation and entry of a video tag are manually performed. However, this approach relies on human labor to play back the surveillance video and generate video tags, resulting in inefficient generation of video tags. In order to improve the generation efficiency of the video tag, embodiments of the present invention provide a method, an apparatus, and a device for generating a video tag, which are described in detail below.

Fig. 1 is a flowchart illustrating a first embodiment of a video tag generation method according to an embodiment of the present invention, where an execution subject in the embodiment of the present invention is any device with processing capability. As shown in fig. 1, the video tag generation method provided in this embodiment may include:

s101, acquiring pictures from a video at a preset frequency.

Specifically, pictures can be obtained from the video at a preset frequency by a frame extraction technique. For example, one picture may be captured every 1 minute, or every 5 minutes, and the specific capturing frequency may be set according to actual needs.

And S102, respectively inputting each acquired picture into a safety event recognition model trained in advance, and judging whether each picture contains preset safety event identification information.

If yes, executing S103; if not, go to step S106.

The security event recognition model may be obtained by training based on a plurality of sample pictures labeled with information indicating whether the security event includes the preset security event identification information. The specific model training process may refer to a general model training mode, which is not limited herein.

The preset security event identification information may include, for example: the driver does not fasten the safety belt, the driver makes a call, the vehicle annual inspection mark is not pasted, and the pedestrian runs the red light. Optionally, the security event recognition model may output result information indicating whether the picture includes the preset security event identification information, and the security event recognition model may also output category information of the security event corresponding to the picture.

S103, respectively carrying out image recognition on each picture containing the preset safety event identification information to obtain structured text information for describing the pictures.

In the specific implementation process, each picture containing the preset security event identification information can be identified through the existing image identification algorithm, and the structured text information of each picture is obtained. For example, video information of interest in security monitoring is mainly: personnel, vehicles, and behaviors. Aiming at personnel, the information of facial feature information, gender, age, clothes, motion direction, whether a hat is worn, whether glasses are worn, whether a backpack is carried, whether a bag is carried, whether an umbrella is opened, whether a bicycle is ridden and the like of the personnel included in the picture can be structurally described, so that the structural text information of the personnel included in the picture is obtained; aiming at the vehicle, the information of the license plate number, the vehicle type, the license plate color, the vehicle brand, the vehicle type, the vehicle body color, the sun shield, whether a safety belt is fastened, whether a call is made, whether a vehicle annual inspection mark exists, whether a decoration pendant exists, the face of a driver and the like of the vehicle included in the picture can be structurally described, so that the structural text information of the vehicle included in the picture is obtained; for behaviors, person behaviors (behaviors such as border crossing, loitering, lingering, staying and gathering) and vehicle behaviors (behaviors such as staying, pressing lines, running red light and courtesy pedestrians) included in the picture can be structurally described, so that structured text information of the behaviors included in the picture is obtained. Optionally, the structured text information may further include description information of other contents in the picture, for example, description information of whether the picture includes a traffic light, whether the picture includes a sidewalk, whether the picture includes a big tree, and the like may be included.

And S104, associating and storing the picture and the corresponding structured text information in a view library.

And associating each picture containing the preset security event identification information and the structured text information corresponding to each picture obtained through S103, and storing the associated pictures and the structured text information in the view library together, so that subsequent retrieval and search are facilitated. And, because the amount of data stored in the view library is far smaller than the storage amount of the video, the storage space can be saved in such a way.

And S105, for each picture in the view library, extracting the security event key information from the structured text information of the picture through an NLP (line segment) recognition algorithm, and generating a video tag of the picture.

The security event key information may include: the time of occurrence of the security event, the location of occurrence of the security event, and the content of the security event. When the occurrence time of the security event, the occurrence location of the security event, and the content of the security event are obtained, a sentence (video tag) including the occurrence time of the security event, the occurrence location of the security event, and the content of the security event may be generated, thereby facilitating subsequent analysis.

Optionally, each picture including the preset security event identification information and the video tag corresponding to each picture obtained through S105 may be stored in the video tag library in an associated manner, so as to facilitate subsequent retrieval and search.

And S106, discarding the picture which does not contain the preset safety event identification information.

The video tag generation method provided by the embodiment of the invention comprises the steps of obtaining pictures from a video at a preset frequency, respectively inputting each obtained picture into a safety event recognition model trained in advance, and judging whether each picture contains preset safety event identification information, wherein the safety event recognition model is obtained by training based on a sample picture marked with whether the picture contains the preset safety event identification information; respectively carrying out image recognition on each picture containing preset security event identification information to obtain structured text information for describing the picture, and storing the picture and the corresponding structured text information in a view library in an associated manner; for each picture in a view library, extracting security event key information from the structured text information of the picture through a Natural Language Processing (NLP) recognition algorithm to generate a video tag of the picture, wherein the security event key information comprises: the occurrence time of the safety event, the occurrence place of the safety event and the content of the safety event do not need to replay a large amount of monitoring videos and generate video tags through a large amount of manpower, so that the manpower input cost for generating the video tags is saved, the generation time of the video tags is saved, and the generation efficiency of the video tags is improved.

Specifically, the Natural Language Processing (NLP) technology is mainly used for studying the Language problem of human interaction with a computer. Fig. 2 is a schematic flow chart of NLP technology, as shown in fig. 2:

s201, obtaining the corpus.

The corpus is the content of NLP task research, and usually a text set is used as a corpus, and the corpus can be obtained through the existing data, the public data set, crawler capture and other modes.

And S202, preprocessing data.

In a specific implementation, the corpus preprocessing may include the following steps:

step a: and (3) corpus cleaning: the useful data is retained and the noisy data is deleted. Common cleaning methods are as follows: manual duplicate removal, alignment, deletion, labeling and the like; or the content can be extracted through a preset rule, the matching is carried out through a regular expression, the extraction is carried out according to the part of speech and the named entity, and scripts or codes are written in batch processing.

Step b: word segmentation: text is divided into words. For example, word segmentation may be performed by string matching based, understanding based, rule based, and statistics based word segmentation methods.

Step c: part of speech tagging: and labeling the words with word class labels. Such as nouns, verbs, adjectives, etc. Common part-of-speech tagging methods are rule-based, statistical-based algorithms, such as: part-of-speech tagging based on maximum entropy, outputting part-of-speech based on statistical maximum probability, part-of-speech tagging based on Hidden Markov Models (HMMs), and the like.

Step d: to stop the word. I.e. words that do not contribute to the text features are removed, such as: punctuation, tone, etc. may be removed.

And S203, characteristic engineering.

This step is primarily intended to represent the participles as a type of computation recognized by a computer, typically a vector. Common presentation models are bag-of-words models, such as the common weighting techniques (TF-IDF) used for information retrieval and data mining; the representation model may also be a word vector, such as a one-hot algorithm, a word2vec (a correlation model used to generate word vectors), etc.

And S204, selecting the features.

The feature selection is mainly based on the features obtained in the feature engineering obtained in S203, specifically, in order to select suitable features with strong expression ability, a common feature selection method includes: degree of Freedom (DF), Mutual Information (MI), Information Gain (IG), Weighted Frequency and probability (WFO), etc.

And S205, training a model.

When the features are selected, model selection is required. That is, what kind of model is selected for training, and common Machine learning models include a Nearest Neighbor classification algorithm (KNN), a Support Vector Machine (SVM), Naive Bayes (Naive Bayes), a decision tree, K-means (K-means clustering algorithm), and the like; common deep learning models include Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), Seq2Seq (a model used when the output length is uncertain), FastText classifiers (fast text classifiers), textcnns (a model with a strong extraction capability for shallow features of a text), and the like.

In the model training process after the model is selected, the over-fitting and under-fitting phenomena need to be avoided. Wherein, overfitting: the model learning ability is so strong that the features of the noisy data are also learned, resulting in a reduced generalization ability of the model, which performs well on the training set but performs poorly on the test set. Common solutions are: increasing the training amount of the data; adding regularization terms such as L1 regularization and L2 regularization; manually screening features and using a feature selection algorithm; dropout method (discard method) and the like are employed. Under-fitting: the model is not able to fit the data well, in that the model is too simple. Common solutions are: adding other characteristic items; increasing the complexity of the model, such as adding more layers to the neural network and enabling the generalization capability of the model to be stronger by adding a polynomial to the linear model; the regularization parameters are reduced, the purpose of regularization is to prevent overfitting, but now the model appears under-fitting, and the regularization parameters need to be reduced. Meanwhile, for the neural network, attention needs to be paid to gradient extinction and gradient explosion problems.

And S206, evaluating the model.

The evaluation indexes of the model mainly comprise: error rate, accuracy, recall, F1 value, Receiver Operating Characteristic Curve (ROC), Area Under ROC Curve (AUC), etc.

And S207, applying the model on line.

The model has two main modes of putting into production and getting on line: one is offline training model, then the model is deployed online to provide service; the other is an online training model, and the model is persisted after the online training is finished.

In specific implementation, the research directions of the NLP technology can be roughly divided into the following:

(1) information extraction: important information such as time, place, people, events, reasons, results, numbers, dates, currency, proper nouns, etc. is extracted from a given text. Colloquially, it is to know who is when, what reason, to whom, what has been done, and what results.

(2) Text generation: robots use natural language for expression and writing like humans. Text generation techniques mainly include data-to-text generation and text-to-text generation depending on the input. Data-to-text generation refers to converting data containing key-value pairs into natural language text; text-to-text generation refers to the conversion and processing of input text to produce new text.

(3) A question-answering system: for a natural language expressed question, a question-answering system gives an accurate answer. Semantic analysis of natural language query statements to some extent is required, including entity linking, relationship recognition, forming logical expressions, finding possible candidate answers in a knowledge base, and finding the best answer through a sorting mechanism.

(4) A dialog system: the system chats, answers and completes a certain task with the user through a series of conversations. Techniques related to user intent understanding, general chat engines, question and answer engines, dialog management, and the like. In addition, to embody context correlation, multiple rounds of conversation capability are required.

(5) Text mining: including text clustering, classification, sentiment analysis, and visualization of mined information and knowledge, interactive presentation interfaces. The current mainstream technology is based on statistical machine learning.

(6) Speech recognition and generation: the voice recognition is to convert the voice symbol recognition input into the computer into written language representation; speech generation, also known as text-to-speech conversion, speech synthesis, refers to the automatic conversion of written text into corresponding speech representations.

(7) Information filtering: document information meeting specific conditions is automatically identified and filtered through a computer system. Generally refers to automatic identification and filtering of harmful information in a network, and is mainly used for information security and protection, network content management and the like.

(8) Public opinion analysis: the method collects and processes mass information, and automatically analyzes the network public sentiment so as to fulfill the aim of dealing with the network public sentiment in time.

(9) And (3) information retrieval: large-scale documents are indexed. The words in the document can be simply assigned with different weights to establish the index, and a deeper index can also be established. During query, an input query expression such as a search term or a sentence is analyzed, matched candidate documents are searched in the index, the candidate documents are ranked according to a ranking mechanism, and finally the document with the highest ranking score is output.

(10) And (3) machine translation: and automatically translating the input source language text to obtain the text of the other language. Machine translation has gradually formed a more rigorous set of methodologies from the earliest rule-based approaches to the twenty-year old statistical-based approaches to today's neural network (encoding-decoding) based approaches.

As a possible implementation manner, on the basis of the first embodiment, the video tag generation method may further include: and displaying the generated video label on the video in a preset display mode. For example, a red dot may be identified at a certain time of the video, and the video tag corresponding to the certain time is displayed after the red dot is touched.

As a possible implementation manner, on the basis of the first embodiment, the video tag generation method may further include: and verifying the video label of each picture in the video library, and modifying the video label with errors to obtain the modified video label.

Fig. 3 is a flowchart illustrating a second embodiment of a video tag generation method according to an embodiment of the present invention, and as shown in fig. 3, on the basis of the first embodiment, the video tag generation method according to the present embodiment may further include:

s301, acquiring target retrieval text information of the user.

In the specific implementation, the original retrieval information input by the user can be obtained firstly, the original retrieval information can be voice information input by the user or text information, and when the voice information is input by the user, the audio content can be converted into text content through the existing voice recognition algorithm; then, through an NLP recognition algorithm, retrieval condition key information is extracted from the original retrieval information, and target retrieval text information is generated. Wherein, retrieving the condition key information may include: time, place, people, events, etc. Moreover, the retrieval condition key information is extracted from the original retrieval information through the NLP identification algorithm, so that the retrieval efficiency and the retrieval accuracy can be improved.

S302, respectively calculating a first matching degree of the target retrieval text information and the structured text information of each picture in the view library.

In the concrete implementation, the target retrieval text information can be matched with the structured text information of the pictures in the view library by the conventional text matching method.

And S303, determining the picture in the view library with the first matching degree greater than or equal to a first preset threshold value as a target picture.

When the first matching degree is greater than or equal to a first preset threshold value, the picture in the view library corresponding to the first matching degree is considered to reflect the event content which needs to be retrieved by the user with high probability.

And S304, determining the security event according to the target picture.

During specific implementation, the target picture can be observed artificially, so that the security event corresponding to the target picture is determined; alternatively, the existing security event recognition model may be used to perform security event recognition on the target picture, so as to obtain the security event type corresponding to the target picture.

According to the video tag generation method provided by the embodiment of the invention, the target retrieval text information of a user is acquired; respectively calculating a first matching degree of the target retrieval text information and the structured text information of each picture in the view library; determining pictures in the view library with the first matching degree greater than or equal to a first preset threshold value as target pictures; according to the target picture, the security event is determined, and the user does not need to spend a large amount of time to play back the monitoring video, so that the user can quickly search the video.

Fig. 4 is a schematic flowchart of a third embodiment of a method for generating a video tag according to an embodiment of the present invention, and as shown in fig. 4, on the basis of the second embodiment, when the video tag includes a video recording time of a picture in a video, the method for generating a video tag according to this embodiment may further include:

s401, if the first matching degrees are smaller than a first preset threshold value, respectively calculating second matching degrees of the target retrieval text information and the video tags of the pictures in the view library.

If the first matching degrees are all smaller than the first preset threshold value, the matching degree of each picture in the view library and the retrieval condition input by the user is indicated to be low, and at the moment, the target retrieval text information and the video label of each picture in the view library can be subjected to text matching.

S402, determining the video recording time included by the video label with the second matching degree larger than a second preset threshold value as target video recording time.

Specifically, the recording time of the picture included in the video tag in the video may be a relative time, for example, the duration of the video is 2 hours, and the recording time of the picture corresponding to the video tag in the video is 1 hour and 14 minutes.

And S403, determining the video clip needing to be played back according to the target video recording time.

After the target video recording time is determined, the video clip required to be played back can be determined according to a preset rule. For example, for a video with a duration of 2 hours, when the target recording time is determined to be 1 hour 14 minutes, a video clip of 1 hour 9 minutes to 1 hour 19 minutes may be determined as a video clip that needs to be played back.

S404, according to the video clip needing to be played back, the security event is determined.

In specific implementation, the security event can be determined by artificially viewing the video segment to be played back; alternatively, other existing video recognition methods may be utilized to determine the security events included in the video segment that needs to be played back.

According to the video tag generation method provided by the embodiment of the invention, if each first matching degree is smaller than a first preset threshold value, second matching degrees of the target retrieval text information and the video tags of each picture in the view library are respectively calculated; determining the video recording time included by the video label with the second matching degree larger than a second preset threshold value as target video recording time; determining a video clip needing to be played back according to the target video recording time; the security event is determined according to the video clip needing to be played back, the security event can be determined through richer monitoring information, but the whole video does not need to be played back, so that the case processing speed is accelerated.

The following describes a video tag generation method provided by an embodiment of the present invention by taking a specific implementation manner as an example. Fig. 5 is a schematic flow chart of a fourth embodiment of a video tag generation method according to an embodiment of the present invention, as shown in fig. 5, a video is obtained by a camera, and for the video, on one hand, a picture is obtained from the video by video inspection and image analysis is performed to obtain a structured text of the picture, the picture and the corresponding structured text are stored in a view library in an associated manner, key information is extracted by an NLP recognition algorithm, and then a video tag of the generated picture is stored in a video tag library; on the other hand, the video is stored, and in the subsequent video playback process, if the video tags in the video tag library are found to be not accurate and abundant enough, the video tags can be revised, supplemented and perfected, so that the video tags are more accurate and convenient to reuse. When a user searches and uses the video, the user can use voice or text as input, key information is extracted by using an NLP recognition algorithm to obtain a search condition, then the search condition is matched with data (pictures and corresponding structured texts) in a video library to obtain a target picture, if the information of the target picture is not rich enough, the user can continue to use the NLP recognition algorithm to search video tags in a video tag library to obtain target video recording time with the most matched description information, and then the user can check video clips needing to be played back according to the target video recording time.

Fig. 6 is a schematic structural diagram of a video tag generating apparatus according to an embodiment of the present invention, and as shown in fig. 6, the video tag generating apparatus 60 may include:

the determining module 610 is configured to obtain pictures from a video at a preset frequency, input each obtained picture into a security event recognition model trained in advance, and determine whether each picture contains preset security event identification information, where the security event recognition model is obtained by training based on a sample picture labeled with whether the sample picture contains the preset security event identification information.

The obtaining module 620 is configured to perform image recognition on each picture including the preset security event identification information, obtain structured text information for describing the picture, and store the picture and the corresponding structured text information in a view library in an associated manner.

A generating module 630, configured to, for each picture in the view library, extract security event key information from the structured text information of the picture through a natural language processing NLP recognition algorithm, and generate a video tag of the picture, where the security event key information includes: the time of occurrence of the security event, the location of occurrence of the security event, and the content of the security event.

The video tag generation device provided by the embodiment of the invention acquires pictures from a video at a preset frequency through the judgment module, respectively inputs each acquired picture into a safety event recognition model trained in advance, and judges whether each picture contains preset safety event identification information, wherein the safety event recognition model is obtained based on sample pictures marked with the preset safety event identification information; respectively carrying out image recognition on each picture containing preset security event identification information through an acquisition module to obtain structured text information for describing the picture, and storing the picture and the corresponding structured text information in a view library in an associated manner; through a generation module, for each picture in a view library, extracting security event key information from the structured text information of the picture through a Natural Language Processing (NLP) recognition algorithm to generate a video tag of the picture, wherein the security event key information comprises: the occurrence time of the safety event, the occurrence place of the safety event and the content of the safety event do not need to replay a large amount of monitoring videos and generate video tags through a large amount of manpower, so that the manpower input cost for generating the video tags is saved, the generation time of the video tags is saved, and the generation efficiency of the video tags is improved.

Optionally, the apparatus may further include: a determining module (not shown in the figure) which can be used for obtaining target retrieval text information of the user; respectively calculating a first matching degree of the target retrieval text information and the structured text information of each picture in the view library; determining pictures in the view library with the first matching degree greater than or equal to a first preset threshold value as target pictures; and determining the security event according to the target picture.

Optionally, when the video tag includes a video recording time of a picture in the video, the determining module may be further configured to calculate a second matching degree between the target retrieval text information and the video tag of each picture in the video library, respectively, if each first matching degree is smaller than a first preset threshold; determining the video recording time included by the video label with the second matching degree larger than a second preset threshold value as target video recording time; determining a video clip needing to be played back according to the target video recording time; and determining the safety event according to the video clip needing to be played back.

Optionally, the determining module may be specifically configured to obtain original retrieval information input by a user; and extracting key information of the retrieval condition from the original retrieval information through an NLP (non line segment) identification algorithm to generate target retrieval text information.

Optionally, the apparatus may further include: and a display module (not shown in the figure) configured to display the generated video tag on the video in a preset display manner.

Optionally, the apparatus may further include: and the correction module (not shown in the figure) may be configured to verify the video tag of each picture in the picture library, and modify the video tag with the error to obtain a modified video tag.

In addition, corresponding to the video tag generation method provided by the above embodiment, an embodiment of the present invention further provides an electronic device, where the electronic device may include: a memory for storing a program; and a processor for executing the program stored in the memory to implement all the steps of the video tag generation method provided by the embodiment of the invention.

In addition, corresponding to the video tag generation method provided in the foregoing embodiment, an embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, all the steps of the video tag generation method according to the embodiment of the present invention are implemented.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims

1. A method for generating a video tag, the method comprising:

2. The method of claim 1, wherein the method further comprises:

acquiring target retrieval text information of a user;

and determining a security event according to the target picture.

3. The method of claim 2, wherein the video tag comprises a recording time of a picture in the video, the method further comprising:

4. The method of claim 2, wherein the obtaining the target retrieval text information of the user comprises:

acquiring original retrieval information input by a user;

5. The method of claim 1, wherein the method further comprises:

and displaying the generated video label on the video in a preset display mode.

6. The method of claim 1, wherein the method further comprises:

7. An apparatus for generating a video tag, the apparatus comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises: the determining module is used for acquiring target retrieval text information of a user; respectively calculating a first matching degree of the target retrieval text information and the structured text information of each picture in the view library; determining the pictures in the view library with the first matching degree greater than or equal to a first preset threshold value as target pictures; and determining a security event according to the target picture.

9. An electronic device, comprising:

a memory for storing a program;

a processor for implementing the method of any one of claims 1-6 by executing a program stored by the memory.

10. A computer-readable storage medium, characterized in that the medium has stored thereon a program which is executable by a processor to implement the method according to any one of claims 1-6.