CN112749672A

CN112749672A - Photo album video identification method, system, equipment and storage medium

Info

Publication number: CN112749672A
Application number: CN202110068139.XA
Authority: CN
Inventors: 范博; 罗超; 成丹妮; 邹宇; 李巍
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-05-04

Abstract

The invention provides a method, a system, equipment and a storage medium for identifying a video of an album, wherein the method comprises the following steps: constructing a video identification model based on deep learning, wherein the video identification model comprises a feature extraction layer, a feature mining layer and an output layer which are sequentially connected in series, the input of the feature extraction layer is video data, and the output of the output layer is the probability that the input video data is predicted to be an album video; inputting video data to be recognized into the video recognition model; and determining whether the video data to be identified is an album video or not according to the output of the output layer of the video identification model. The method for identifying the photo album video by the deep learning method can be used for identifying whether the video is the photo album video or not, rapidly and accurately filtering the photo album video, finding the inherent defects of the video content in time, greatly saving the operation and maintenance cost, ensuring the accuracy of front-end video display and video recommendation and effectively improving the service experience of users.

Description

Photo album video identification method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a method, a system, equipment and a storage medium for recognizing a video of an album.

Background

In the current internet environment, video is an important information medium. In an OTA (Online Travel Agency) scene, the quality of the publicized videos of scenic spots or hotels directly affects the user experience, and the sources of the videos are complex and various, so that the quality of the videos is guaranteed to have great challenge in Online Travel companies (OTA). The quality of the current video mainly depends on manual review, and the quantity of the videos in the hotel industry is increasing day by day, so that the maintenance needs to consume larger labor cost. At present, in the OTA industry, album videos uploaded by users are often mixed with videos shot normally, and experiments prove that the attraction of the album videos to the users is greatly reduced compared with videos made normally. However, the prior art does not have a good detection and identification mode for album videos.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide an album video identification method, an album video identification system, equipment and a storage medium, aiming at the situation that the album video in the existing video library is mixed with the propaganda video shot normally, the album video is identified rapidly and accurately, and the accuracy of video recommendation is improved.

The embodiment of the invention provides a method for identifying a video of an album, which comprises the following steps:

constructing a video identification model based on deep learning, wherein the video identification model comprises a feature extraction layer, a feature mining layer and an output layer which are sequentially connected in series, the input of the feature extraction layer is video data, and the output of the output layer is the probability that the input video data is predicted to be an album video;

inputting video data to be recognized into the video recognition model;

and determining whether the video data to be identified is an album video or not according to the output of the output layer of the video identification model.

In some embodiments, the building of the video recognition model based on deep learning includes building the video recognition model based on a long-term cyclic convolutional neural network model, the video recognition model including a convolutional neural network-based feature extraction layer and a cyclic neural network-based feature mining layer.

In some embodiments, after the building of the video identification model based on the long-term cyclic convolution neural network model, the method further includes the following steps:

training a residual error network layer for extracting image features;

and replacing a feature extraction layer in the long-term cyclic convolution neural network model with the residual error network layer.

In some embodiments, training the residual network layer for extracting the image features includes training the residual network layer for extracting the image features using the image sets in the plurality of scenes.

In some embodiments, replacing a feature extraction layer in the long-term cyclic convolutional neural network model with the residual network layer comprises the following steps:

modifying the input size of a feature extraction layer in the long-term cyclic convolution neural network model to be suitable for the input size of a residual error network layer;

removing an output layer of the trained residual error network layer, and adjusting the global pooling layer vector output by the residual error network layer to be suitable for the dimension input into the feature mining layer;

and connecting the residual error network layer with the feature mining layer in series.

In some embodiments, the inputting the video data to be recognized into the video recognition model includes the following steps:

and inputting the multi-frame image in the video data to be identified within a preset time range into a residual error network layer of the video identification model.

and replacing the long-short term memory network in the feature mining layer with a bidirectional long-short term memory network.

In some embodiments, the feature extraction layer comprises a trained residual network layer, and/or the feature mining layer comprises a bidirectional long-short term memory network layer.

In some embodiments, after determining whether the video data to be identified is an album video according to the output of the output layer of the video identification model, the method further includes the following steps:

acquiring an artificial feedback result, wherein the artificial feedback result comprises an artificial judgment result of whether the video data to be identified is an album video;

comparing the manual feedback result with the output of the video identification model, and judging whether the model is predicted accurately according to the comparison result;

and if the model prediction is not accurate, optimally training the video recognition model based on the artificial feedback result.

The embodiment of the invention also provides an album video identification system for realizing the album video identification method, and the system comprises:

the model building module is used for building a video recognition model based on deep learning, the video recognition model comprises a feature extraction layer, a feature mining layer and an output layer which are sequentially connected in series, the input of the feature extraction layer is video data, and the output of the output layer is the probability that the input video data is album video;

the video input module is used for inputting video data to be identified into the video identification model;

and the video identification module is used for determining whether the video data to be identified is the album video according to the output of the output layer of the video identification model.

An embodiment of the present invention further provides an album video identification apparatus, including:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the album video identification method via execution of the executable instructions.

The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the album video identification method when being executed by a processor.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

The photo album video identification method, the system, the equipment and the storage medium have the following beneficial effects:

based on massive videos in the scene of the online travel agency, aiming at the condition that the album videos in the existing video library are mixed with the publicity videos shot normally, the method for identifying whether the videos are the album videos by using the deep learning method can quickly and accurately filter the album videos and timely find the internal defects of the video contents, thereby greatly saving the operation and maintenance cost, ensuring the accuracy of front-end display videos and video recommendation and effectively improving the service experience of users in the scene of the online travel agency. The album video identification model and the identification method can be used for video detection in scenes of online travel agencies, can also be used for video detection in various scenes such as shopping platform scenes and culture propaganda scenes, and effectively improve the video recommendation accuracy.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for identifying album videos according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an initially constructed video recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an improved video recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an album video identification system according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an album video identification apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

As shown in fig. 1, an embodiment of the present invention provides a method for identifying a video of an album, including the following steps:

s100: constructing a video identification model based on deep learning, wherein the video identification model comprises a feature extraction layer, a feature mining layer and an output layer which are sequentially connected in series, the input of the feature extraction layer is video data, and the output of the output layer is the probability that the input video data is predicted to be an album video;

s200: inputting video data to be recognized into the video recognition model;

s300: and determining whether the video data to be identified is an album video or not according to the output of the output layer of the video identification model.

Based on massive videos in the scene of the online travel agency, the method for identifying the photo album videos comprises the steps of S100 constructing a video identification model for predicting whether videos are photo album videos or not aiming at the condition that the photo album videos in the existing video library are mixed with publicity videos shot normally, S200 and S300 utilizing the constructed video identification model to identify and predict video data to be identified, and therefore the videos are identified whether to be the photo album videos or not by utilizing a deep learning method, the photo album videos can be filtered quickly and accurately, internal defects of video contents can be found in time, operation and maintenance costs can be greatly saved, accuracy of front-end display videos and video recommendation is guaranteed, and service experience of users in the scene of the online travel agency is effectively improved. The method and the device can be used for identifying the recommended videos advertised in hotels, scenic spots and the like of online travel agencies, can quickly and accurately identify the album videos, and avoid the album videos from entering a recommendation system to influence user experience.

The photo album video refers to photo album video generated by a video album technology by using a shot photo. Each frame of image in the photo album video corresponds to a photo which is relatively independent, the relevance and the continuity between the adjacent images are not high, the ornamental value is relatively low compared with the publicity video which is formally shot, and the watching experience of a user is influenced. The video corresponding to the photo album video is the normal video, and the relevance and the continuity between adjacent frames are high for the video obtained by shooting by adopting a normal video shooting means. The album video identification model and the identification method can be used for video detection in scenes of online travel agencies, can also be used for video detection in various scenes such as shopping platform scenes and culture propaganda scenes, and effectively improve the video recommendation accuracy.

In this embodiment, the step S100: building the video recognition model based on deep learning includes building the video recognition model based on a long-term cyclic convolutional neural network model (LRCN). The Long-Term cyclic Convolutional Neural Network model uses a Convolutional Neural Network (CNN) for input, and then uses a Long Short-Term Memory (LSTM) in a cyclic Neural Network (RNN) for recursive sequence modeling and prediction generation. As shown in fig. 2, a schematic diagram of a video identification model constructed based on the long-term cyclic convolution neural network model according to this embodiment is shown. The video identification model comprises a CNN feature extraction layer based on a convolutional neural network and an LSTM feature mining layer based on the convolutional neural network. The CNN feature extraction layer includes an inceptionV3 network. Where density is the output layer, and the probability that the output video data is predicted as the album video.

In this embodiment, the architecture of the video recognition model shown in fig. 2 can be further improved. Specifically, the network type of the feature extraction layer and/or the network type of the feature mining layer may be selected to be replaced, and the architecture of the improved video recognition model is shown in fig. 3.

Specifically, in step S100, after the video identification model is built based on the long-term cyclic convolution neural network model, the method further includes the following steps:

training a residual error network layer resnet for extracting image features;

In this embodiment, the training the residual network layer for extracting the image features includes training the residual network layer for extracting the image features by using image sets in a plurality of scenes. Specifically, a data set for training is first constructed, which includes a plurality of image sets in different scenes. The data sets under different scenes can be constructed based on the pictures existing in the online travel agency picture library and the labels of the pictures. For example, in a hotel scene, a ten thousand-level training set in the hotel scene is constructed depending on a label restaurant, a swimming pool, a gymnasium, a lobby and the like of a hotel gallery; and constructing ten thousand-level data under the travel shooting scene depending on labels of mountains, rivers, flowers, seas and the like of the travel shooting image library.

And constructing corresponding deep convolutional neural networks under different scenes and training a residual error resnet network based on the image set. In the implementation of the scheme, the image information is extracted by using a convolutional neural network method, the CNN-based inception V3 network carried by the LRCN is compared with a self-trained resnet network, and the final result shows that the resnet is superior to the CNN network carried by the LRCN. Therefore, by replacing the CNN network of the LRCN itself with the resnet network, the feature extraction effect of the video recognition model can be further improved.

In this embodiment, the input of the feature extraction layer is a plurality of frames of images in a preset time range in the video data. For example, the first 20s multiframe image in the input video may be set. The starting point, the ending point and the time range of the preset time range can be adjusted according to the requirement. The step S200: inputting the video data to be recognized into the video recognition model, comprising the following steps:

and inputting the multi-frame image in the video data to be identified within a preset time range into a residual error network layer of the video identification model. Here, the time length of the preset time range is the same as the time length required for the input of the feature extraction layer.

In this embodiment, replacing the feature extraction layer in the long-term cyclic convolutional neural network model with the residual network layer includes the following steps:

modifying the input size of a feature extraction layer in the long-term cyclic convolution neural network model to be suitable for the input size of a residual error network layer; firstly, changing the network input size to 224 × 224 instead of the original 299 × 299 to adapt to the network input of the resnet;

removing an output layer dense layer of the trained residual network layer, and adjusting the global pooling layer vector output by the residual network layer to be suitable for the dimension input into the feature mining layer, namely, adjusting the global pooling layer vector output by the residual network layer to be 2048;

and connecting the residual error network layer and the feature mining layer in series, thus finishing the step of replacing the CNN feature extraction layer.

In the whole operation process of the feature extraction layer, original images in input video data are cut into 299 and then input into an inception V3 network, 80 frames are taken in the first 20 seconds of each video, a two-dimensional matrix with the output features of 80 and 2048 is changed into a two-dimensional matrix, the original images are cut into 224 and input into a resnet network, 80 frames are taken in the first 20 seconds of each video, and the two-dimensional matrix with the output features of 80 and 2048 is output.

In step S100, after the video identification model is constructed based on the long-term cyclic convolution neural network model, the method further includes the following steps:

and replacing the long and short term memory network in the feature mining layer with a bidirectional long and short term memory network (BilSTM). The original long and short term memory network is replaced by a bidirectional long and short term memory network, and an attention mechanism is added, so that the disadvantage that a feature sequence output by a feature extraction layer can only be transmitted in a one-way mode in an LSTM network is avoided, a long-distance dependency relationship can be captured better by using an LSTM model, and a bidirectional dependency relationship can be captured better by using a BiLSTM through bidirectional transmission.

In this embodiment, the video recognition model takes the example that the feature extraction layer includes a trained residual network layer, and the feature mining layer includes a bidirectional long-short term memory network layer. In yet another alternative embodiment, the video recognition model may not be modified, i.e., the video recognition model still employs the feature extraction layer and the feature mining layer that are originally in the LRCN model. In yet another alternative embodiment, it is also possible to modify only the feature extraction layer, i.e. the feature extraction layer uses a residual network layer, and the feature mining layer still uses a long-short term memory network layer. In yet another alternative, only the feature mining layer may be modified, that is, the feature extraction layer still adopts an inceptionV3 network layer, and the feature mining layer adopts a bidirectional long-short term memory network layer.

In addition, in another alternative embodiment, the video recognition model can also be obtained by directly connecting the residual network layer and the bidirectional long-short term memory network layer in series and then connecting the output layers in parallel, rather than improving on the basis of the LRCN model. The training of the residual network layer may also adopt other types of data sets and training modes, and all fall within the protection scope of the present invention.

In other alternative embodiments, the video recognition model may also use other types of network models constructed based on deep learning, and is not limited to the examples described herein, and the scope of the present invention is also covered.

In this embodiment, the step S300: and determining whether the video data to be identified is the album video according to the output of the output layer of the video identification model, wherein the determination can be performed according to the probability value of the predicted video data output by the video identification model as the album video, if the probability value is greater than a preset probability threshold value, the video data to be identified is determined as the album video, otherwise, the video data to be identified is determined as the normal video.

In this embodiment, after determining whether the video data to be identified is an album video according to the output of the output layer of the video identification model, the method further includes the following steps:

acquiring an artificial feedback result, wherein the artificial feedback result comprises an artificial judgment result of whether the video data to be identified is an album video; specifically, an operator can manually perform manual sampling inspection on the prediction result of the video recognition model, or manually conform the video predicted as the album video by the video recognition model, and give an artificial judgment result of whether the video is the album video or the normal video, wherein the normal video is the video which is shot by adopting a normal video shooting means and is opposite to the album video;

In application, feedback results of operators can be continuously collected, and the video recognition model is iteratively optimized and trained to improve the image information mining capability based on deep learning and the performance of the video recognition model.

As shown in fig. 4, an embodiment of the present invention further provides an album video identification system, configured to implement the album video identification method, where the system includes:

the model building module M100 is used for building a video identification model based on deep learning, the video identification model comprises a feature extraction layer, a feature mining layer and an output layer which are sequentially connected in series, the input of the feature extraction layer is video data, and the output of the output layer is the probability that the input video data is album video;

the video input module M200 is used for inputting video data to be identified into the video identification model;

and the video identification module M300 is used for determining whether the video data to be identified is the album video according to the output of the output layer of the video identification model.

The album video identification system is based on massive videos in the scene of the online travel agency, and aims at the situation that the album videos in the existing video library are mixed with publicity videos shot normally, a video identification model for predicting whether the videos are album videos is built through the model building module M100, and identification and prediction are carried out on video data to be identified through the video input module M200 and the video identification module M300 by using the built video identification model, so that whether the videos are album videos is identified by using a deep learning method, the album videos can be filtered quickly and accurately, the internal defects of video contents can be found in time, the operation and maintenance cost can be greatly saved, the accuracy of front-end display of the videos and video recommendation is ensured, and the service experience of users in the scene of the online travel agency is effectively improved. The method and the device can be used for identifying the recommended videos advertised in hotels, scenic spots and the like of online travel agencies, can quickly and accurately identify the album videos, and avoid the album videos from entering a recommendation system to influence user experience.

The album video identification model and the identification system can be used for video detection in scenes of online travel agencies, can also be used for video detection in various scenes such as shopping platform scenes and culture propaganda scenes, and effectively improve the video recommendation accuracy.

In this embodiment, the model building module M100 building the video recognition model based on the deep learning includes building the video recognition model based on a long-term cyclic convolution neural network model. The video identification model comprises a CNN feature extraction layer based on a convolutional neural network and a feature mining layer based on a cyclic neural network. In one embodiment, the CNN feature extraction layer includes an inceptionV3 network and the feature mining layer includes an LSTM network. In another alternative embodiment, the inceptionV3 network may be replaced with a residual network. In yet another alternative embodiment, the LSTM network may be replaced with a BiLSTM network. In other alternative embodiments, other network models constructed based on deep learning may also be used, and all fall within the scope of the present invention.

In this embodiment, the input of the feature extraction layer is a plurality of frames of images in a preset time range in the video data. For example, the first 20s multiframe image in the input video may be set. The starting point, the ending point and the time range of the preset time range can be adjusted according to the requirement. The video input module M200 inputs the video data to be recognized into the video recognition model, including: the video input module M200 inputs a plurality of frames of images within a preset time range in the video data to be recognized into the residual network layer of the video recognition model. Here, the time length of the preset time range is the same as the time length required for the input of the feature extraction layer.

The video identification module M300 determines whether the video data to be identified is an album video according to the output of the output layer of the video identification model, and may determine whether the video data to be identified is the album video according to a probability value of the predicted video data output by the video identification model being the album video, if the probability value is greater than a preset probability threshold, the video data to be identified is determined to be the album video, otherwise, the video data to be identified is determined to be a normal video.

The embodiment of the invention also provides an album video identification device, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the album video identification method via execution of the executable instructions.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.

An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 600 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 that connects the various system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.

Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned album video identification method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In the album video identification device, the program in the memory is executed by the processor to realize the steps of the album video identification method, so the computer storage medium can also obtain the technical effect of the album video identification method.

The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the album video identification method when being executed by a processor. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the invention described in the above-mentioned album video identification method section of this specification, when the program product is executed on the terminal device.

Referring to fig. 6, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be executed on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The program in the computer storage medium realizes the steps of the album video identification method when being executed by a processor, so the computer storage medium can also obtain the technical effect of the album video identification method.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method for identifying a video of an album is characterized by comprising the following steps:

inputting video data to be recognized into the video recognition model;

2. The photo album video identification method according to claim 1, wherein the building of the video identification model based on the deep learning comprises building the video identification model based on a long-term cyclic convolutional neural network model, and the video identification model comprises a convolutional neural network-based feature extraction layer and a cyclic neural network-based feature mining layer.

3. The photo album video identification method according to claim 2, wherein after the video identification model is constructed based on the long-term cyclic convolution neural network model, the method further comprises the following steps:

training a residual error network layer for extracting image features;

4. The method for photo album video identification according to claim 3, wherein the training the residual network layer for extracting image features comprises training the residual network layer for extracting image features by using image sets under a plurality of scenes.

5. The method for identifying photo album video according to claim 4, wherein the step of replacing the feature extraction layer in the long-term cyclic convolution neural network model with the residual error network layer comprises the following steps:

6. The method for identifying photo album video according to claim 3, wherein the step of inputting the video data to be identified into the video identification model comprises the following steps:

7. The photo album video identification method according to claim 2 or 3, wherein after the video identification model is constructed based on the long-term cyclic convolution neural network model, the method further comprises the following steps:

8. The method for photo album video identification according to claim 1, wherein the feature extraction layer comprises a trained residual network layer, and/or the feature mining layer comprises a bidirectional long-short term memory network layer.

9. The album video identification method according to claim 1, wherein after determining whether the video data to be identified is an album video according to the output of the output layer of the video identification model, further comprising the steps of:

10. An album video recognition system for implementing the album video recognition method according to any one of claims 1 to 9, the system comprising:

11. An apparatus for video recognition of an album, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the album video identification method according to any of claims 1 to 9 via execution of the executable instructions.

12. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the steps of the album video recognition method according to any one of claims 1 to 9.