CN113011254B

CN113011254B - Video data processing method, computer equipment and readable storage medium

Info

Publication number: CN113011254B
Application number: CN202110159590.2A
Authority: CN
Inventors: 尚焱; 李松南
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2023-11-07
Anticipated expiration: 2041-02-04
Also published as: CN113011254A

Abstract

The embodiment of the application discloses a video data processing method, computer equipment and a readable storage medium, which relate to a block chain technology and a video processing technology in artificial intelligence, wherein the method comprises the following steps: acquiring a key frame image from video data; identifying key image features of the key frame image based on the character detection model, performing character region feature matching on the key image features, and determining character regions in the key frame image; extracting features of the character area based on the image recognition model, recognizing character data of the key frame image from the character area according to the extracted features, and matching the character data with a character database to obtain a character detection result of the character area; if the character detection result is a result of matching the character data with the character database, a target character string matched with the character data is obtained, and the data type of the target character string is determined as the video type of the video data. By adopting the embodiment of the application, the efficiency and the accuracy of data detection can be improved.

Description

Video data processing method, computer equipment and readable storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video data processing method, a computer device, and a readable storage medium.

Background

The user may record and upload video data to the social platform to facilitate other users viewing interactions. Some social platforms are used for protecting copyrights of original users, and watermarks are added to video data uploaded by users to prevent illegal users from maliciously stealing original videos of others for uploading. Therefore, by detecting the video data uploaded by the user, whether the video data includes a watermark is determined, so as to determine whether the video data is infringed by other people.

The existing video data detection method generally detects only specific positions in video data, such as the upper left corner, the lower left corner, the upper right corner and the lower right corner, so as to determine whether the video data contains the watermark.

Disclosure of Invention

The embodiment of the application provides a video data processing method, computer equipment and a readable storage medium, which can improve the accuracy and efficiency of data detection.

An aspect of an embodiment of the present application provides a video data processing method, including:

acquiring key frame images from at least two video frame images constituting video data;

identifying key image features of the key frame image based on a character detection model, performing character region feature matching on the key image features, and determining character regions in the key frame image;

extracting features of the character region based on an image recognition model, recognizing character data of the key frame image from the character region according to the extracted features, and performing character matching on the character data of the key frame image and a character database to obtain a character detection result of the character region in the key frame image;

if the character detection result is a result of matching the character data with the character database, a target character string matched with the character data is obtained from the character database, and a data type corresponding to the target character string is determined as a video type to which the video data belongs.

Another aspect of the embodiment of the present application provides a video data processing method, including:

acquiring sample key frame images from at least two sample video frame images forming sample video data, and acquiring sample area labels in the sample key frame images;

Identifying sample key image features of the sample key frame image based on an initial character detection model, performing character region feature matching on the sample key image features, and determining sample character regions in the sample key frame image;

generating a first loss function based on the sample character region and the sample region label, training the initial character detection model based on the first loss function, and generating a character detection model.

An aspect of an embodiment of the present application provides a video data processing apparatus, including:

the image acquisition module is used for acquiring key frame images from at least two video frame images forming video data;

the character recognition module is used for recognizing key image features of the key frame image based on the character detection model, carrying out character region feature matching on the key image features and determining character regions in the key frame image;

the character matching module is used for extracting the characteristics of the character area based on the image recognition model, recognizing the character data of the key frame image from the character area according to the extracted characteristics, and carrying out character matching on the character data of the key frame image and a character database to obtain a character detection result of the character area in the key frame image;

And the category determining module is used for acquiring a target character string matched with the character data from the character database if the character detection result is a result of matching the character data with the character database, and determining a data category corresponding to the target character string as a video category to which the video data belongs.

Optionally, the image acquisition module includes:

the image matching unit is used for carrying out image feature matching on an ith video frame image and an (i+1) th video frame image in the at least two video frame images to obtain the similarity between the ith video frame image and the (i+1) th video frame image; i is a positive integer;

a first image determining unit, configured to determine the (i+1) th video frame image as a key frame image of the video data if the similarity between the i-th video frame image and the (i+1) th video frame image is smaller than the video similarity threshold, and perform image feature matching on the (i+1) th video frame image and the (i+2) th video frame image to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image;

a second image determining unit, configured to perform image feature matching on the (i+1) th video frame image and the (i+2) th video frame image if the similarity between the (i+1) th video frame image and the (i+1) th video frame image is greater than or equal to a video similarity threshold, so as to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image; and obtaining a key frame image of the video data until the (i+2) th video frame image is the last video frame image of the at least two video frame images.

The character recognition module includes:

the feature extraction unit is used for extracting features of the key frame image based on the convolution layer in the character detection model to obtain key image features of the key frame image;

the feature stitching unit is used for stitching the features of the key images to obtain stitched feature images corresponding to the key frame images; the pixel values of the pixel points in the spliced characteristic image are used for representing the probability that the pixel points in the corresponding key frame image are characters;

the image determining unit is used for acquiring the probability range of each pixel value in the spliced characteristic image and generating a probability image and a character frame image according to the probability range of each pixel value;

and the region determining unit is used for carrying out feature fusion on the probability image and the character frame image, generating a fused character image, and determining a character region in the key frame image based on the fused character image.

The character matching module comprises:

the sequence acquisition unit is used for extracting the characteristics of the character area based on the convolution layer in the image recognition model to obtain the convolution characteristics corresponding to the character area, and carrying out serialization processing on the convolution characteristics corresponding to the character area to obtain the characteristic sequence corresponding to the character area;

The circulation processing unit is used for carrying out recognition processing on the feature sequence based on a circulation layer in the image recognition model and determining sequence character features corresponding to the feature sequence;

and the feature conversion unit is used for carrying out feature conversion on the sequence character features based on the transcription layer in the image recognition model to obtain character data of the key frame image.

Optionally, the number of key frame images in the video data is N; n is a positive integer; the character matching module comprises:

the character combination unit is used for combining the character data of the N key frame images in the video data to obtain combined character data;

the word segmentation determining unit is used for carrying out word segmentation processing on the combined character data and determining M word segmentation character data corresponding to the video data; m is a positive integer;

the character matching unit is used for respectively carrying out character matching on M word segmentation character data corresponding to the video data and the character database to obtain k matching character strings and matching numbers respectively corresponding to the k matching character strings; the matching number is used for representing the number of character data matched with the matching character string in the M word segmentation character data; k is a positive integer;

The result determining unit is used for determining that the character detection result is a result of matching the character data with the character database if the matching character strings with the matching quantity larger than the matching threshold value exist;

and a character determining unit for determining the matching character strings with the matching quantity larger than the matching threshold value as target character strings matched with the character data.

Optionally, the apparatus further comprises:

the data response module is used for responding to the uploading request of the user terminal for the video data;

the data prompt module is used for sending a data uploading abnormal prompt to the user terminal if the video category to which the video data belongs to the marked video category; the data uploading abnormal prompt comprises a video category to which the video data belongs;

and the data uploading module is used for uploading the video data to an application program if the video category to which the video data belongs does not belong to the marked video category.

In one aspect, an embodiment of the present application provides another video data processing apparatus, including:

the regional tag acquisition module is used for acquiring sample key frame images from at least two sample video frame images forming sample video data and acquiring sample regional tags in the sample key frame images;

The sample region determining module is used for identifying sample key image features of the sample key frame image based on the initial character detection model, carrying out character region feature matching on the sample key image features, and determining sample character regions in the sample key frame image;

the detection model generation module is used for generating a first loss function based on the sample character area and the sample area label, training the initial character detection model based on the first loss function, and generating a character detection model.

Optionally, the apparatus further comprises:

the character tag acquisition module is used for acquiring sample character tags in the sample key frame images;

the sample character acquisition module is used for extracting the characteristics of the sample character area based on the initial image recognition model, and recognizing sample character data of the sample key frame image from the sample character area according to the extracted sample characteristics;

and the recognition model generation module is used for generating a second loss function based on the sample character data and the sample character label, training the initial image recognition model based on the second loss function and generating an image recognition model.

In one aspect, the application provides a computer device comprising: a processor, a memory, a network interface;

The processor is connected to a memory for providing data communication functions, a network interface for storing a computer program, and for calling the computer program to cause a computer device comprising the processor to perform the method.

In one aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored therein, the computer program being adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the above-described method.

In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternatives in an aspect of the embodiments of the application.

In the embodiment of the application, the key frame images are acquired from at least two video frame images forming the video data, and the key frame images can be representative images in the at least two video frame images contained in the video data, so that the efficiency of data detection can be improved by acquiring the key frame images from the video data for identification processing. The character area in the key frame image is determined by identifying the image features in the key frame image, so that when character data in the character area is identified, only the character area in the key frame image is required to be identified, the whole key frame image is not required to be identified, and the data identification efficiency can be improved. Further, as the first detection and recognition are carried out on the video frame image, the character area in the key frame image is determined, the character area is recognized, and the character data in the character area is determined, which is equivalent to the two times of recognition on the key frame image, so that the accuracy of data detection can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a video data processing system according to an embodiment of the present application;

fig. 2 is an application scenario schematic diagram of a video data processing method according to an embodiment of the present application;

fig. 3 is a flowchart of a video data processing method according to an embodiment of the present application;

FIG. 4 is a schematic view of a scene of determining a character region in a key frame image based on a character detection model according to an embodiment of the present application;

FIG. 5 is a schematic view of a scene of determining character data of a key frame image based on an image recognition model according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for determining a key frame image according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a scenario in which a key frame sequence is extracted according to an embodiment of the present application;

FIG. 8 is a flowchart of another video data processing method according to an embodiment of the present application;

fig. 9 is a schematic diagram of a composition structure of a video data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic diagram of the composition of another video data processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The Computer Vision technology (CV) is a science for researching how to make a machine "look at", and more specifically, a camera and a Computer are used to replace human eyes to perform machine Vision such as identifying, tracking and measuring on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission (P2P transmission), consensus mechanism, encryption algorithm and the like, is essentially a decentralised database, and is a series of data blocks which are generated by association by using a cryptography method, and each data block contains information of a batch of network transactions and is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, and an application service layer; the blockchain may be composed of a plurality of serial transaction records (also known as blocks) that are cryptographically concatenated and protected from content, and the distributed ledgers concatenated by the blockchain enable multiple parties to effectively record transactions and permanently verify the transactions (non-tamperable). The consensus mechanism is a mathematical algorithm for realizing trust establishment and rights acquisition among different nodes in the blockchain network; that is, the consensus mechanism is a mathematical algorithm commonly recognized by the network nodes of the blockchain.

The application relates to a block chain technology and a video processing technology in artificial intelligence, wherein video data can be stored in a block chain network by utilizing the block chain technology, image detection is carried out on the video data by utilizing the video processing technology, character areas in images are determined, feature extraction is carried out on the character areas, character data of the images are determined, character matching is carried out on character data obtained by recognition and a character database, and video categories to which the video data belong are determined on the basis of character matching results. The application can also utilize the blockchain technology to store character data in a character database, video categories to which the video data belong in a blockchain network, and the like. By detecting the key frame image, determining the character area in the key frame image and identifying the character area in the key frame image, a character detection result is obtained, and therefore the video category to which the video data belongs is determined, and the efficiency and the accuracy of data detection can be improved.

Referring to fig. 1, fig. 1 is a network architecture diagram of a video data processing system according to an embodiment of the present application, as shown in fig. 1, a computer device 101 may perform data interaction with a user terminal, and the number of user terminals may be one or more, for example, when the number of user terminals is a plurality, the user terminals may include the user terminal 102a, the user terminal 102b, and the user terminal 102c in fig. 1. Taking the user terminal 102a as an example, the computer device 101 may respond to an upload request of the user terminal 102a for video data, and acquire a key frame image from at least two video frame images constituting the video data based on the upload request. Further, the computer device 101 may identify key image features of the key frame image based on the character detection model, perform character region feature matching on the key image features, and determine character regions in the key frame image; and carrying out feature extraction on the character region based on the image recognition model, recognizing character data of the key frame image from the character region according to the extracted features, and carrying out character matching on the character data of the key frame image and a character database to obtain a character detection result of the character region in the key frame image. Further, if the character detection result is a result that the character data matches the character database, the computer device 101 may acquire a target character string matching the character data from the character database, and determine a data category corresponding to the target character string as a video category to which the video data belongs.

The character area in the key frame image is determined by identifying the image features in the key frame image, so that when character data in the character area is identified, only the character area in the key frame image is required to be identified, the whole key frame image is not required to be identified, and the data identification efficiency can be improved. Further, as the first detection and recognition are carried out on the video frame image, the character area in the key frame image is determined, the character area is recognized, and the character data in the character area is determined, which is equivalent to the two times of recognition on the key frame image, so that the accuracy of data detection can be improved.

It is understood that the computer devices mentioned in the embodiments of the present application include, but are not limited to, terminal devices or servers. In other words, the computer device or the user terminal may be a server or a terminal device, or may be a system formed by the server and the terminal device. The above-mentioned terminal device may be an electronic device, including, but not limited to, a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted device, an augmented Reality/Virtual Reality (AR/VR) device, a head-mounted display, a wearable device, a smart speaker, a digital camera, a camera, and other mobile internet devices (mobile internet device, MID) with network access capability. The servers mentioned above may be independent physical servers, or may be server clusters or distributed systems formed by a plurality of physical servers, or may be cloud servers that provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, vehicle-road collaboration, content distribution networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Further, referring to fig. 2, fig. 2 is a schematic application scenario diagram of a video data processing method according to an embodiment of the present application. As shown in fig. 2, the user terminal 20 sends an upload request for video data to the computer device 22, where the upload request carries the video data, the computer device 22 obtains a key frame image 21 from at least two video frame images that constitute the video data, identifies key image features of the key frame image 21 based on a character detection model, performs character region feature matching on the key image features, and determines a character region 23 in the key frame image 21; the computer device 22 then performs feature extraction of the character region 23 based on the image recognition model, and recognizes character data 24 of the key frame image from the character region 23 based on the extracted features. For example, if the character data 24 of the identified key frame image is "vacation video", the character data "vacation video" is character-matched with the character database, and the character detection result of the character region in the key frame image 21 is obtained. If the character detection result is a result of matching the character data with the character database, a target character string matched with the character data is obtained from the character database, and a data category corresponding to the target character string is determined as a video category to which the video data belongs. Optionally, if the video category belongs to the tagged video category, the computer device 22 may also send a data upload exception prompt to the user terminal 20, for example, the data upload exception prompt may include "because the upload video includes a vacation video flag, to avoid risk prohibition of uploading", so that the user may look at the data upload exception prompt through the user terminal, and thus make a corresponding modification.

Further, referring to fig. 3, fig. 3 is a flow chart of a video data processing method according to an embodiment of the present application; as shown in fig. 3, the method includes:

s101, acquiring key frame images from at least two video frame images forming video data.

In the embodiment of the application, the computer equipment can acquire video data from a local database; alternatively, video data may be acquired from other storage media; alternatively, the computer device may also obtain video data from the user terminal. The computer equipment obtains at least two video frame images by splitting the obtained video data, and obtains a key frame image by performing frame extraction processing on the at least two video frame images. Taking the example that the computer equipment acquires video data from the user terminal, when a user sends an uploading request for the video data through the user terminal, the computer equipment acquires the video data based on the uploading request, and if the video data is data composed of one video frame image, the video frame image is determined to be a key frame image. If the video data is data composed of at least two video frame images, the computer device may split the video data to obtain at least two video frame images composing the video data, and perform frame extraction processing on the at least two video frame images to obtain a key frame image. The key frame image may reflect a large amount of image information in the video data. The computer device processes the key frame images, and because the key frame images are representative images in at least two video frame images contained in the video data and the number of the key frame images is smaller than the total number of the video frame images in the video data, the efficiency of data processing can be improved by processing the key frame images, and the content of the video data can be accurately reflected by the data detection result. In the embodiment of the present application, if the video data includes a key frame image, the processing of steps S102 to S104 is performed for the key frame image. If the video data includes a plurality of key frame images, the processing of steps S102 to S104 is performed for each of the plurality of key frame images.

Optionally, the computer device may determine the key frame image based on a similarity between adjacent ones of the at least two video frame images; alternatively, the key frame images may be determined based on the number of video frame images in the video data, e.g., the key frame number is determined based on the number of video frame images in the video data, and the key frame images are extracted from at least two video frame images based on the key frame number; alternatively, the key frame image is determined based on the duration of the video data, for example, the number of key frames is determined based on the duration of the video data, the key frame position is determined based on the number of key frames, the video frame image located at the key frame position of at least two video frame images constituting the video data is determined as the key frame image, and the like, without limitation.

S102, identifying key image features of the key frame image based on the character detection model, performing character region feature matching on the key image features, and determining character regions in the key frame image.

In the embodiment of the application, the computer equipment identifies the key image features of the key frame image based on the character detection model, performs character region feature matching on the key image features, and determines the character region in the key frame image. The key image features may refer to the image features in step S101 described above. The computer device may extract features in the key frame image as key image features for reflecting image information in the key frame image, such as objects included in the key frame image, such as characters and object information other than characters. The computer device determines character areas in the key frame image by character area feature matching of the key image features.

Wherein the character region feature matching is used for determining the probability of indicating characters in the key image features, and determining the region in the key frame image where the characters are likely to be displayed based on the probability of indicating the characters in the key image features, namely, the character region feature matching process specifically refers to a process of performing feature matching to determine character regions. Specifically, the computer device may perform feature extraction on the key frame image based on the convolution layer in the character detection model, to obtain key image features of the key frame image. Further, character region feature matching is carried out on the key image features, character regions in the key frame images are determined, and specifically, feature stitching can be carried out on the key image features to obtain stitched feature images corresponding to the key frame images; the pixel values of the pixel points in the spliced characteristic images are used for representing the probability that the pixel points in the corresponding key frame images are characters; acquiring a probability range of each pixel value in the spliced characteristic image, and generating a probability image and a character frame image according to the probability range of each pixel value; and carrying out feature fusion on the probability image and the character frame image to generate a fused character image, and determining a character area in the key frame image based on the fused character image.

The number of the convolution layers in the character detection model may be multiple, and the convolution kernel of each convolution layer is different, and the physical meaning of the convolution kernel is a matrix of a×a (such as 1*1, 3*3, etc.). In a specific implementation, the key frame image may be quantized to obtain a pixel matrix corresponding to the key frame image, where the pixel matrix is a matrix with m×n, where m×n is equal to a pixel of the key frame image, and a value in the pixel matrix is a quantized value obtained by comprehensively quantizing luminance, chromaticity, and the like in the key frame image. For example, if the key frame image is a 1920×2040 picture, the matrix of pixels corresponding to the key frame image is a 1920×2040 matrix, and the value in the matrix is the quantized value of the pixel corresponding to the value. And then multiplying the pixel matrix of the key frame image by the matrix corresponding to the convolution kernel to obtain the pixel matrix corresponding to the key frame image, and obtaining the key image characteristics. Because the convolution kernels of all the convolution layers are different, after feature extraction is carried out on the key frame images by using different convolution layers, the obtained key image features are different, the number of the corresponding key image features is also different, and the feature information in the key frame images can be reflected more completely by carrying out feature stitching on the obtained key image features.

Referring to fig. 4, fig. 4 is a schematic view of a scene for determining a character area in a key frame image based on a character detection model, a computer device inputs a key frame image 41 into the character detection model, and performs feature extraction on the key frame image based on a convolution layer 42 in the character detection model to obtain key image features of the key frame image, where the convolution layer 42 may include h convolution layers, and h is a positive integer. For example, h is 5, the convolution layers 42 include a first convolution layer f1, a second convolution layer f2, a third convolution layer f3, a fourth convolution layer f4, and a fifth convolution layer f5, where the key image features extracted by the 5 convolution layers are different. The first convolution layer f1 performs feature extraction on the key frame image 41 to obtain first key image features; the second convolution layer f2 performs feature extraction on the first key image features to obtain second key image features; the third convolution layer f3 performs feature extraction on the second key image feature to obtain a third key image feature; the fourth convolution layer f4 performs feature extraction on the third key image feature to obtain a fourth key image feature; and the fifth convolution layer f5 performs feature extraction on the fourth key image feature to obtain a fifth key image feature. And (3) obtaining a sampled fifth key image feature by carrying out up-sampling (up.2) on the fifth key image feature twice, and fusing the sampled fifth key image feature and the fourth key image feature to obtain a fused fourth key image feature. And (3) obtaining a sampled fourth key image feature by carrying out up-sampling (up 2) on the fourth key image feature twice, and fusing the sampled fourth key image feature and the third key image feature to obtain a fused third key image feature. And (3) obtaining a sampled third key image feature by carrying out up-sampling (up 2) on the third key image feature twice, and fusing the sampled third key image feature with the second key image feature to obtain a fused second key image feature. Performing eight-time up-sampling (up 8) and convolution processing on the fifth key image feature to obtain a first sampling image; performing four-time up-sampling (up 4) and convolution processing on the fused fourth key image characteristics to obtain a second sampling image; and performing double up-sampling (up-2) and convolution processing on the fused third key image characteristics to obtain a third sampling image. And carrying out convolution processing on the fused second key image features to obtain a fourth sampling image. Feature stitching is performed on the first sampled image, the second sampled image, the third sampled image and the fourth sampled image, so as to obtain stitched feature images 43 corresponding to the key frame images 41. The pixel values of the pixels in the spliced feature images are used for representing the probability that the pixels in the corresponding key frame images are characters. The computer device obtains the probability range to which each pixel value in the stitched feature image 43 belongs, and generates a probability image 44 and a character frame image 45 according to the probability range to which each pixel value belongs. Wherein different probability ranges may be represented based on different colors, the computer device may determine a color corresponding to each pixel value based on the probability range to which each pixel value belongs in the stitched feature image 43, and generate the probability image 44 from the color corresponding to each pixel value. For example, the probability image may be represented based on a thermodynamic diagram, which refers to a graphical representation of a page area in which a visitor is enthusiastic and a geographic area in which the visitor is located, displayed in a particular highlighted form, and based on the thermodynamic diagram, in an embodiment of the present application, a character area and a position of the character area in a key frame image are displayed in different highlighted forms (i.e., different colors). The probability image 44 and the character frame image 45 are feature fused to generate a fused character image 46, the fused character image 46 may represent the position of the character in the key frame image, and the character region 47 in the key frame image is determined based on the fused character image 46.

Extracting features of the key frame image by using a convolution layer in the character detection model, so that a plurality of key image features corresponding to the key frame image can be extracted; by splicing a plurality of key image features corresponding to the key frame images, the spliced features can more completely reflect the image information in the key frame images; by generating the probability image and the character frame image corresponding to the key frame image, the probability image can represent the probability that the pixel point in the key frame image is a character, and the character frame image can represent the character frame position in the key frame image, so that the character region in the key frame image can be determined according to the character and the character frame position, the character region in the video frame image can be reflected more accurately, the character region can be identified conveniently, and the accuracy of the video category to which the video data belongs is determined.

Optionally, when the key frame image is identified using the character detection model, the computer device detects a plurality of candidate regions containing character features and confidence levels of the candidate regions, but the number of character regions to be finally determined is determined, for example, the computer device detects a plurality of candidate regions, where the candidate regions may include "month", "skill", "vacation video", and "news video", etc. Specifically, the computer device obtains a candidate region with the highest confidence among the plurality of candidate regions, calculates the region overlapping degree (intersection over union, ioU) between the candidate region with the highest confidence and each candidate region by using a Non-maximum suppression algorithm (Non-Maximum Suppression, NMS), and compares the region overlapping degree with an overlapping degree threshold value, thereby determining a final character region "vacation video". The overlap threshold may be set to 0.7, 0.8, 0.9, or other values, which are not limited in this embodiment. The repeated candidate regions in the key frame image may be removed by using a non-maximum suppression algorithm to determine the final character region.

And S103, extracting features of the character area based on the image recognition model, recognizing character data of the key frame image from the character area according to the extracted features, and performing character matching on the character data of the key frame image and a character database to obtain a character detection result of the character area in the key frame image.

In the embodiment of the application, the computer equipment performs feature extraction on the character area based on the image recognition model, recognizes the character data of the key frame image from the character area according to the extracted features, and can obtain the character detection result of the character area in the key frame image by performing character matching on the character data of the key frame image and the character database. The character data may refer to specific characters, for example, a "flight video", and the computer device performs character matching on the "flight video" and the character database to obtain a character detection result of a character area in the key frame image. If the character database contains the "vacation video", the character detection result is determined to be the result of matching the character data with the character database. If the character database does not contain the "vacation video", the character detection result is determined to be a result that the character data is not matched with the character database, the computer equipment can output the information that the video data is legal data, and the legal data can mean that the video data does not contain watermarks, so that copyrights of other people are not infringed, and the video data can be uploaded to an application program so as to be convenient for other users to check interaction.

Optionally, the computer device performs feature extraction on the character region based on the image recognition model, and the method for recognizing the character data of the key frame image from the character region according to the extracted features may include: the computer equipment performs feature extraction on the character region based on the convolution layer in the image recognition model to obtain convolution features corresponding to the character region, and performs serialization processing on the convolution features corresponding to the character region to obtain a feature sequence corresponding to the character region; identifying the feature sequence based on a circulating layer in the image identification model, and determining the sequence character features corresponding to the feature sequence; and performing feature conversion on the sequence character features based on a transcription layer in the image recognition model to obtain character data of the key frame image.

In a specific implementation, the image recognition model may include a feature extraction network, where the feature extraction network includes a convolution layer, and the computer device may perform feature extraction on the character region based on the convolution layer in the feature extraction network to obtain a plurality of convolution features corresponding to the character region, where the feature extraction network may include a deep learning network such as, but not limited to, a convolution neural network (Convolutional Neural Networks, CNN), regression target detection (You Only Look Once, YOLO), a single-point multi-box detector (Single Shot MultiBox Detector, SSD), and the like. The computer equipment can obtain a characteristic sequence corresponding to the character region by carrying out serialization processing on a plurality of convolution characteristics corresponding to the character region, and can determine sequence character characteristics corresponding to the characteristic sequence by carrying out recognition processing on the characteristic sequence based on a circulating layer in the image recognition model, wherein the circulating layer is used for recognizing the characteristic sequence corresponding to the convolution characteristics and determining the character characteristics corresponding to each characteristic sequence. And the computer equipment performs feature conversion on the sequence character features based on a transcription layer in the image recognition model, and integrates the empty characters and the repeated characters in the character features obtained after the feature conversion to obtain the character data of the key frame image. Wherein the loop layer may include a Long-short memory model loop neural network (Long-Short Term Memory RNN, LSTM) or other deep learning network, and the transcription layer may include a join sense temporal classification algorithm (Connectionist Temporal Classification, CTC) or other algorithm.

Referring to fig. 5, fig. 5 is a schematic view of a scene of determining character data of a key frame image based on an image recognition model, a computer device inputs a character region 51 into the image recognition model, performs feature extraction on the character region 51 based on a convolution layer in the image recognition model, extracts features corresponding to characters in the character region to obtain convolution features 52 corresponding to the character region 51, the convolution features 52 are used for indicating character information of the character region, the computer device performs serialization processing on the convolution features 52 corresponding to the character region, the convolution features 52 comprise a plurality of groups of data features, the computer device combines each group of data features to obtain a sequence sub-feature, and combines sequence sub-features corresponding to each group of data features to generate a feature sequence 53. Further, the computer device performs recognition processing on the feature sequence 53 based on a loop layer in the image recognition model, converts the feature sequence into a character form, and determines sequence character features 54 corresponding to the feature sequence 53. For example, the computer device obtains the feature sequence 53 corresponding to the character region 51 shown in fig. 5, and may perform feature fusion on the feature sequence 53 based on the loop layer, and perform feature migration between different sequence sub-features, so as to generate a sequence character feature 54, where the sequence character feature 54 is "-S-T-AATTE. The computer device performs feature conversion on the sequence character feature 54 based on the transcription layer in the image recognition model, integrates the null character and the repeated character in the character feature 54 to obtain character data 55 of the key frame image, and integrates the null character "-" and the repeated characters "AA" and "TT" in fig. 5 to obtain character data 55 of the key frame image, where the character data 55 is "STATE".

Optionally, if the number of key frame images in the video data is 1 and the number of character data of the key frame images is 1, the computer device may perform word segmentation processing on the character data of the key frame images, determine one or more word segmentation character data corresponding to the video data, and perform character matching on the one or more word segmentation character data and the character database respectively to obtain one or more matching character strings and matching numbers corresponding to each matching character string respectively; if the matching number of the matching character strings is larger than the matching threshold value, determining that the character detection result is a result of matching the character data with the character database. Or if the number of the key frame images in the video data is 1 and the number of the character data of the key frame images is a plurality of, the computer equipment can combine the plurality of character data of the key frame images to obtain combined character data; word segmentation processing is carried out on the combined character data, one or more word segmentation character data corresponding to the video data are determined, character matching is carried out on the one or more word segmentation character data and a character database respectively, and one or more matching character strings and matching quantity corresponding to each matching character string are obtained; if the matching number of the matching character strings is larger than the matching threshold value, determining that the character detection result is a result of matching the character data with the character database.

Optionally, if the number of key frame images in the video data is N, N is a positive integer; the computer device may perform character matching on the character data of the key frame image and the character database to obtain a character detection result of the character area in the key frame image. Specifically, the computer equipment combines character data of N key frame images in the video data to obtain combined character data; word segmentation processing is carried out on the combined character data, and M word segmentation character data corresponding to the video data are determined; performing character matching on M word segmentation character data corresponding to the video data and a character database respectively to obtain k matching character strings and matching numbers corresponding to the k matching character strings respectively; if the matching number of the matching character strings is larger than the matching threshold value, determining that the character detection result is a result of matching the character data with the character database. The matching number is used for representing the number of character data matched with the matching character string in M word segmentation character data, M is a positive integer, and k is a positive integer. The combined character data comprises one or more character data, and the word segmentation character data refers to data obtained by word segmentation processing of the combined character data. The character database may include a plurality of character strings, where the character strings may refer to a business name, a product name corresponding to a business, a website name corresponding to a business, or the like, and are used to indicate that the multimedia data carrying the character strings may have infringement or other problems. For example, the string may include "Tencel video" or "XX website" and the like.

In a specific implementation, since the number of the key frame images in the video data is N, the computer device may combine the character data of the N key frame images in the video data to obtain combined character data. For example, N is 3, character data of the 1 st key frame image is "vacation", character data of the 2 nd key frame image is "video", and character data of the 3 rd key frame image is "vacation video", and by combining character data of the 3 key frame images, the resultant combined character data may be "vacation video". Further, the computer device may use a word segmentation tool, such as a barker word segmentation tool or other word segmentation tools, to perform word segmentation processing on the combined character data, and determine M word segmentation character data corresponding to the video data. For example, after word segmentation is performed on the combined character data "flight video", 2 word segmentation character data are obtained as "flight video" and "flight video", respectively. The character data determined in the video data are combined, and word segmentation processing is carried out on the combined character data, so that the influence on the accuracy of a final detection result caused by incomplete watermarks in key frame images can be avoided, and the accuracy of data inspection is improved.

Further, the computer equipment performs character matching on the 2 word segmentation character data and the character database respectively to obtain k matching character strings and matching numbers corresponding to the k matching character strings respectively. For example, if the number of word segmentation character data is 2, the 2 word segmentation character data are "flight video" and "flight video", respectively, and the character database includes character strings "flight video", then the matched character strings obtained after matching are "flight video", and the number of matched character strings is 2. If the number of word segmentation character data is 3, the 3 word segmentation character data are "vacation videos", "vacation news" and "vacation videos", and the character database comprises character strings "vacation videos" and "vacation news", then the matched character strings obtained after matching are "vacation videos" and "vacation news", the matched number of the matched character strings corresponding to the "vacation videos" is 2, and the matched number of the matched character strings corresponding to the "vacation news" is 1.

Still further, if there are matching strings having a number of matches greater than the matching threshold, the computer device determines that the character detection result is a result of matching the character data with the character database. The matching threshold may be a default matching threshold, for example, may be 1, 2, 3, or other values, or the matching threshold may be determined empirically, or the matching threshold may be determined according to a result of historical matching, or the like, which is not limited in the embodiment of the present application. For example, if the matching threshold is 1, the matching string is "vacated video" and the corresponding number of matches is greater than the matching threshold, and the computer device may determine the matching string having a number of matches greater than the matching threshold as the target string matching the character data, i.e., the target string is "vacated video". It will be appreciated that if there are no matching strings having a number of matches greater than the matching threshold, the computer device determines that the character detection result is a result of the character data not matching the character database.

And S104, if the character detection result is a result of matching the character data with the character database, acquiring a target character string matched with the character data from the character database, and determining a data type corresponding to the target character string as a video type to which the video data belongs.

In the embodiment of the application, the computer equipment acquires the character detection result, and if the character detection result is the result of matching the character data with the character database, the computer equipment can acquire the target character string matched with the character data from the character database, and the data type corresponding to the target character string is determined as the video type to which the video data belongs. For example, the computer device obtains the target character string matched with the character data from the character database as "vacation video", and determines the data category corresponding to the "vacation video" as the video category to which the video data belongs, that is, determines the video category to which the video data belongs as "vacation video". Since watermarks in video data have temporal stability and category invariance, whether the video data contains the watermark or not and a specific watermark category can be determined by setting a matching threshold. When the watermarks in the video data are matched with the watermark categories in the character database and the number of the watermark categories in the video data is larger than a matching threshold, determining that the video data contain the watermarks and determining the category of the watermarks in the video data, thereby determining the video category to which the video data belong, reducing the probability of misjudgment of the watermarks and improving the accuracy of data detection.

Optionally, when receiving an upload request sent by the user terminal for video data, the computer device identifies a video category to which the video data belongs, and after determining the video category to which the video data belongs, the computer device may also respond to the upload request sent by the user terminal for the video data; if the video category to which the video data belongs to the marked video category, the computer device can send a data uploading exception prompt to the user terminal. The data uploading abnormal prompt comprises a video category to which the video data belong; the tagged video category may be used to indicate the data category corresponding to each business, e.g., the data category corresponding to vacation may be "vacation video". The data upload exception prompt may include "because the upload video contains a vacation video flag, to avoid risk, upload is prohibited", and so on. The computer equipment sends the data uploading abnormal prompt to the user terminal, and a user uploading the video data can check the data uploading abnormal prompt through the user terminal, so that the video data can be quickly changed, and the user is prevented from infringing the copyrights of other people.

Optionally, if the video category to which the video data belongs does not belong to the tagged video category, the computer device uploads the video data to the application. The video category to which the video data belongs does not belong to the marked video category may mean that the video data contains a watermark, but the watermark does not belong to the marked video category, so that the video data uploaded by the user can be considered to have no problem of infringing the copyrights of other people, and the computer device can upload the video data to an application program corresponding to the video data, for example, the application program may be a social application program, an educational application program, a sports application program or other application programs. For example, the application is a social application, the user may send an upload request for video data to the social application, the computer device may upload the video data to the social application by determining the category to which the video data belongs, and other users may view the video data and interact with the user when it is determined that the video data does not belong to the tagged video category.

In step S101, the computer device determines a key frame image based on the similarity between adjacent video frame images in at least two video frame images, and fig. 6 is a schematic flow chart of a method for determining a key frame image according to an embodiment of the present application; as shown in fig. 6, the method includes:

s201, performing image feature matching on an ith video frame image and an (i+1) th video frame image in at least two video frame images to obtain similarity between the ith video frame image and the (i+1) th video frame image.

In the embodiment of the application, the computer equipment can extract the image characteristics of each video frame image in at least two video frame images in the video data to obtain the image characteristics corresponding to each video frame image, wherein the image characteristics are used for reflecting the image information, the image details and the like in the video frame images, the at least two video frame images comprise an ith video frame image, i is a positive integer, and the computer equipment can calculate the similarity between the ith video frame image and the (i+1) th video frame image based on the image characteristics corresponding to each video frame image in the at least two video frame images. Alternatively, the computer device may obtain the similarity between the ith video frame image and the (i+1) th video frame image by calculating the euclidean distance between the image features of the ith video frame image and the image features of the (i+1) th video frame image, and the method for calculating the similarity may further include, but is not limited to, pearson correlation coefficient method, cosine similarity method, and the like.

S202, it is determined whether the similarity between the i-th video frame image and the (i+1) -th video frame image is smaller than a video similarity threshold.

In the embodiment of the present application, if yes, that is, the similarity between the ith video frame image and the (i+1) th video frame image is smaller than the video similarity threshold, the computer device executes step S203 to determine the (i+1) th video frame image as a key frame image of the video data; if not, i.e. if the similarity between the i-th video frame image and the (i+1) -th video frame image is greater than or equal to the video similarity threshold, the computer device executes step S204. The video similarity threshold may be 0.7, 0.8, 0.9 or other values, which are not limited in the embodiment of the present application.

S203, the (i+1) th video frame image is determined as a key frame image of the video data.

S204, performing image feature matching on the (i+1) th video frame image and the (i+2) th video frame image to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image, and obtaining a key frame image of video data until the (i+2) th video frame image is the last video frame image of at least two video frame images.

In the embodiment of the application, if the similarity between the ith video frame image and the (i+1) th video frame image is greater than or equal to a video similarity threshold, the computer equipment performs image feature matching on the (i+1) th video frame image and the (i+2) th video frame image to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image until the (i+2) th video frame image is the last video frame image of at least two video frame images, and then a key frame image of video data is obtained. That is, the computer device determines a video frame image as a key frame image by calculating a similarity between the video frame image and a previous video frame image of at least two video frame images, respectively, if the similarity is less than a video similarity threshold. If the similarity is greater than or equal to the video similarity threshold, continuing to calculate the similarity between the video frame image and the next video frame image, and determining the next video frame image as a key frame image when the similarity is smaller than the video similarity threshold, so as to obtain the key frame image of the video data.

Since the higher the similarity of two video frame images, the more similar the image information and image details in the two video frame images are represented, when the similarity between the two video frame images is greater than a video similarity threshold, the two video frame images can be considered to belong to two video frame images in the same Group of images (GOP), one GOP contains a plurality of continuous video frame images, the similarity between any two video frame images in the video data can be calculated by the similarity calculation method, so as to determine one or more GOPs contained in the video data, and the first video frame image, the j/2 th video frame image and the j-th video frame image in each GOP are determined as key frame images, wherein j is a positive integer, j is the number of video frame images in the Group of images, the key frame images contain complete video information in the GOP, and the Picture quality of the key frame images is higher than that of other video frame images in the GOP.

Optionally, as shown in fig. 7, fig. 7 is a schematic view of a scene of extracting a key frame sequence according to an embodiment of the present application; the computer equipment decodes the video data through the video processing tool to obtain a video frame data stream contained in the video data; the computer device extracts key frame images from a plurality of GOPs included in the video frame data stream, for example, a first video frame image, a j/2 th video frame image, and a j-th video frame image in each GOP may be extracted, so as to obtain a key frame image sequence, where the key frame image sequence includes a plurality of key frame images. Since the watermark position in the video data is generally fixed, for example, the watermark position is fixedly present in the upper left corner position, the lower left corner position, the upper right corner position, the lower right corner position and the like of the video data, and the key frame image has the characteristics of high picture quality, complete picture information and the like, the subsequent detection is performed by extracting the key frame image in the video data, so that the data detection redundancy can be reduced, the data detection efficiency can be improved, and the accuracy of the data detection result can be improved.

In the embodiment of the application, the key frame images are acquired from at least two video frame images forming the video data, and the key frame images can represent the image data contained in the video data, so that the efficiency of data detection can be improved by acquiring the key frame images from the video data for identification processing. The character area in the key frame image is determined by identifying the image features in the key frame image, so that when character data in the character area is identified, only the character area in the key frame image is required to be identified, the whole key frame image is not required to be identified, and the data identification efficiency can be improved. Further, as the first detection and recognition are carried out on the video frame image, the character area in the key frame image is determined, the character area is recognized, and the character data in the character area is determined, which is equivalent to the two times of recognition on the key frame image, so that the accuracy of data detection can be improved.

Optionally, in order to improve the accuracy of the character detection model in recognizing the key image features and the accuracy of the image recognition model in extracting features from the character region, thereby improving the accuracy of determining the video category to which the video data belongs, before using the character detection model in recognizing the key image features and using the image recognition model in extracting features from the character region, the computer device may train and adjust the model using a large number of sample video data, so that the trained model may realize more accurate recognition of the key image features and perform feature extraction on the character region, thereby improving the accuracy of determining the video category to which the video data belongs. Referring to fig. 8, fig. 8 is a flowchart illustrating another video data processing method according to an embodiment of the application. The method may be applied to a computer device; as shown in fig. 8, the method includes:

S301, acquiring sample key frame images from at least two sample video frame images forming sample video data, and acquiring sample area labels in the sample key frame images.

In the embodiment of the application, the computer equipment can acquire sample video data from a local database; alternatively, sample video data may be obtained from other storage media. The computer equipment obtains at least two sample video frame images by splitting the obtained sample video data, and obtains sample key frame images by performing frame extraction processing on the at least two sample video frame images. The method for acquiring the sample key frame image from the sample video data may refer to the method for acquiring the key frame image from the video data in step S101, which is not described herein. Sample video data refers to video data prepared for training an initial character detection model. If the sample video data is data composed of one sample video frame image, the sample video frame image is determined to be a sample key frame image. If the sample video data is data composed of at least two sample video frame images, the computer device may split the sample video data to obtain at least two sample video frame images that compose the sample video data, and perform frame extraction processing on the at least two sample video frame images to obtain a sample key frame image. The sample region label is a preset label, and the purpose of training the character detection model is to make the sample character region obtained by using the character detection model to identify the sample key frame image and the preset sample region label the same as possible, so that the accuracy of the corresponding character detection model is higher.

S302, identifying sample key image features of the sample key frame image based on the initial character detection model, performing character region feature matching on the sample key image features, and determining sample character regions in the sample key frame image.

In the embodiment of the present application, the computer device identifies the sample key image feature of the sample key frame image based on the initial character detection model, performs character region feature matching on the sample key image feature, determines the probability of indicating the sample character in the sample key image feature, and determines the region in the sample key frame image in which the sample character is likely to be displayed based on the probability of indicating the sample character in the sample key image feature, thereby determining the sample character region in the sample key frame image, and the method for determining the sample character region may refer to the method for determining the character region in the key frame image in step S102, which is not described herein.

S303, generating a first loss function based on the sample character area and the sample area label, and training an initial character detection model based on the first loss function to generate a character detection model.

In the embodiment of the application, the initial character detection model is used for determining the sample character area in the sample key frame image, the first loss function can be determined according to the coincidence degree between the sample character area and the preset sample area label, when the loss value corresponding to the first loss function is larger than the first loss threshold value, the initial character detection model is continuously trained, the parameters in the initial character detection model are adjusted, so that the loss value corresponding to the first loss function is smaller than or equal to the first loss threshold value, and when the loss value corresponding to the first loss function is smaller than or equal to the first loss threshold value, the initial character detection model obtained through training is saved, and the character detection model is obtained. By training the character detection model by using a large amount of sample video data, the accuracy of the character detection model can be improved, so that the sample character region determined according to the character detection model can more accurately reflect the information of the key frame image.

S304, acquiring sample character labels in the sample key frame images.

In the embodiment of the application, the sample character label refers to a preset label, and the purpose of training the image recognition model is to make sample character data obtained by using model recognition in image recognition and the preset sample character label the same as possible, so that the accuracy of the corresponding image recognition model is higher.

S305, carrying out feature extraction on the sample character area based on the initial image recognition model, and recognizing sample character data of the sample key frame image from the sample character area according to the extracted sample features.

In the embodiment of the present application, the computer device performs feature extraction on the sample character region based on the initial image recognition model, and the method for recognizing the sample character data of the sample key frame image from the sample character region according to the extracted sample feature may refer to the method for recognizing the character data of the key frame image based on the image recognition model in step S103, which is not described herein too much.

S306, generating a second loss function based on the sample character data and the sample character labels, and training an initial image recognition model based on the second loss function to generate an image recognition model.

In the embodiment of the application, the sample character data of the sample key frame image is determined by using the initial image recognition model, a second loss function can be determined according to the coincidence degree between the sample character data and the preset sample character label, when the loss value corresponding to the second loss function is larger than a second loss threshold value, the initial image recognition model is continuously trained, the parameters in the initial image recognition model are adjusted, so that the loss value corresponding to the second loss function is smaller than or equal to the second loss threshold value, and when the loss value corresponding to the second loss function is smaller than or equal to the second loss threshold value, the initial image recognition model obtained through training is saved, so that the image recognition model is obtained. By training the image recognition model by using a large amount of sample video data, the accuracy of the image recognition model can be improved, so that the sample character data determined according to the image recognition model can more accurately reflect character information in the video data.

In the embodiment of the application, the computer equipment trains and adjusts the model by using a large amount of sample video data, so that the character detection model obtained by training can realize more accurate identification of key image features, and the image identification model can realize more accurate feature extraction of the character area, thereby improving the accuracy of determining the video category to which the video data belongs.

The method of the embodiment of the application is described above, and the device of the embodiment of the application is described below.

Referring to fig. 9, fig. 9 is a schematic diagram of a composition structure of a video data processing apparatus according to an embodiment of the present application, where the video data processing apparatus may be a computer program (including program code) running in a computer device, and for example, the video data processing apparatus is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. The apparatus 90 includes:

an image acquisition module 91 for acquiring key frame images from at least two video frame images constituting video data;

the character recognition module 92 is configured to recognize key image features of the key frame image based on the character detection model, perform character region feature matching on the key image features, and determine a character region in the key frame image;

the character matching module 93 is configured to perform feature extraction on the character region based on the image recognition model, recognize character data of the key frame image from the character region according to the extracted feature, perform character matching on the character data of the key frame image and the character database, and obtain a character detection result of the character region in the key frame image;

The category determining module 94 is configured to, if the character detection result is a result of matching the character data with the character database, obtain a target character string matching the character data from the character database, and determine a data category corresponding to the target character string as a video category to which the video data belongs.

Optionally, the image acquisition module 91 includes:

an image matching unit 911, configured to perform image feature matching on an ith video frame image and an (i+1) th video frame image in the at least two video frame images, so as to obtain a similarity between the ith video frame image and the (i+1) th video frame image; i is a positive integer;

a first image determining unit 912, configured to determine the (i+1) th video frame image as a key frame image of the video data if the similarity between the i-th video frame image and the (i+1) th video frame image is smaller than the video similarity threshold, and perform image feature matching on the (i+1) th video frame image and the (i+2) th video frame image to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image;

a second image determining unit 913, configured to perform image feature matching on the (i+1) th video frame image and the (i+2) th video frame image if the similarity between the (i+1) th video frame image and the (i+1) th video frame image is greater than or equal to a video similarity threshold, so as to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image; and obtaining a key frame image of the video data until the (i+2) th video frame image is the last video frame image of the at least two video frame images.

The character recognition module 92 includes:

a feature extraction unit 921, configured to perform feature extraction on the key frame image based on the convolution layer in the character detection model, to obtain key image features of the key frame image;

a feature stitching unit 922, configured to perform feature stitching on the key image features to obtain a stitched feature image corresponding to the key frame image; the pixel values of the pixel points in the spliced characteristic image are used for representing the probability that the pixel points in the corresponding key frame image are characters;

an image determining unit 923, configured to obtain a probability range to which each pixel value in the stitched feature image belongs, and generate a probability image and a character frame image according to the probability range to which each pixel value belongs;

the region determining unit 924 is configured to perform feature fusion on the probability image and the character frame image, generate a fused character image, and determine a character region in the key frame image based on the fused character image.

The character matching module 93 includes:

a sequence obtaining unit 931, configured to perform feature extraction on the character region based on the convolution layer in the image recognition model, obtain a convolution feature corresponding to the character region, and perform serialization processing on the convolution feature corresponding to the character region, so as to obtain a feature sequence corresponding to the character region;

A loop processing unit 932, configured to perform recognition processing on the feature sequence based on a loop layer in the image recognition model, and determine a sequence character feature corresponding to the feature sequence;

and a feature conversion unit 933, configured to perform feature conversion on the sequence character features based on the transcription layer in the image recognition model, so as to obtain character data of the key frame image.

Optionally, the number of key frame images in the video data is N; n is a positive integer; the character matching module 93 includes:

a character combining unit 934, configured to combine character data of N key frame images in the video data to obtain combined character data;

the word segmentation determining unit 935 is configured to perform word segmentation processing on the combined character data, and determine M word segmentation character data corresponding to the video data; m is a positive integer;

a character matching unit 936, configured to perform character matching on M word segmentation character data corresponding to the video data and the character database, to obtain k matching character strings and matching numbers corresponding to the k matching character strings respectively; the matching number is used for representing the number of character data matched with the matching character string in the M word segmentation character data; k is a positive integer;

A result determining unit 937, configured to determine that the character detection result is a result of matching the character data with the character database if there are matching character strings whose matching number is greater than a matching threshold;

a character determining unit 938 for determining the matching character string whose matching number is greater than the matching threshold value as a target character string matching the character data.

Optionally, the apparatus 90 further includes:

a data response module 95, configured to respond to an upload request of the video data from the user terminal;

the data prompt module 96 is configured to send a data upload exception prompt to the user terminal if the video category to which the video data belongs to a tagged video category; the data uploading abnormal prompt comprises a video category to which the video data belongs;

the data uploading module 97 is configured to upload the video data to an application program if the video category to which the video data belongs does not belong to the tagged video category.

It should be noted that, in the embodiment corresponding to fig. 9, the content not mentioned may be referred to the description of the method embodiment, and will not be repeated here.

Referring to fig. 10, fig. 10 is a schematic diagram showing the composition of another video data processing apparatus according to an embodiment of the present application, where the video data processing apparatus may be a computer program (including program code) running in a computer device, and for example, the video data processing apparatus is an application software; the device can be used for executing corresponding steps in the method provided by the embodiment of the application. The apparatus 100 includes:

the region tag obtaining module 1001 is configured to obtain a sample key frame image from at least two sample video frame images that form sample video data, and obtain a sample region tag in the sample key frame image;

a sample region determining module 1002, configured to identify sample key image features of the sample key frame image based on an initial character detection model, perform character region feature matching on the sample key image features, and determine a sample character region in the sample key frame image;

the detection model generating module 1003 is configured to generate a first loss function based on the sample character region and the sample region label, train the initial character detection model based on the first loss function, and generate a character detection model.

Optionally, the apparatus 100 further includes:

a character tag obtaining module 1004, configured to obtain a sample character tag in the sample key frame image;

a sample character acquisition module 1005, configured to perform feature extraction on the sample character region based on an initial image recognition model, and recognize sample character data of the sample key frame image from the sample character region according to the extracted sample feature;

the recognition model generating module 1006 is configured to generate a second loss function based on the sample character data and the sample character label, train the initial image recognition model based on the second loss function, and generate an image recognition model.

It should be noted that, in the embodiment corresponding to fig. 10, the content not mentioned may refer to the description of the method embodiment, and will not be repeated here.

Referring to fig. 11, fig. 11 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application. As shown in fig. 11, the computer device 110 may include: processor 1101, network interface 1104, and memory 1105, and further, the above-described computer device 110 may further include: a user interface 1103, and at least one communication bus 1102. Wherein communication bus 1102 is used to facilitate connection communications among the components. The user interface 1103 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1103 may further include a standard wired interface and a wireless interface. Network interface 1104 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1105 may be a high-speed RAM memory or a nonvolatile memory (non-volatile memory), such as at least one magnetic disk memory. The memory 1105 may also optionally be at least one storage device located remotely from the processor 1101. As shown in fig. 11, an operating system, a network communication module, a user interface module, and a device control application may be included in the memory 1105 as one type of computer-readable storage medium.

In the computer device 110 shown in FIG. 11, the network interface 1104 may provide network communication functionality; while user interface 1103 is primarily an interface for providing input to a user; and the processor 1101 may be configured to invoke the device control application stored in the memory 1105 to implement:

It should be understood that the computer device 110 described in the embodiments of the present application may perform the description of the above-mentioned video data processing method in the embodiments corresponding to fig. 3, 6 and 8, and may also perform the description of the above-mentioned video data processing apparatus in the embodiments corresponding to fig. 9 and 10, which are not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

The embodiments of the present application also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method as in the previous embodiments, the computer being part of a computer device as mentioned above. Such as the processor 1101 described above. As an example, the program instructions may be executed on one computer device or on multiple computer devices located at one site, or alternatively, on multiple computer devices distributed across multiple sites and interconnected by a communication network, which may constitute a blockchain network.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs to instruct related hardware, where the programs may be stored on a computer readable storage medium, and where the programs, when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of video data processing, comprising:

acquiring key frame images in video data based on the similarity between any two adjacent video frame images in at least two video frame images forming the video data;

performing feature extraction on the key frame image based on a convolution layer in the character detection model to obtain key image features of the key frame image;

performing feature stitching on the key image features to obtain stitched feature images corresponding to the key frame images; the pixel values of the pixel points in the spliced characteristic images are used for representing the probability that the pixel points in the corresponding key frame images are characters;

acquiring a probability range of each pixel value in the spliced characteristic image, and generating a probability image and a character frame image according to the probability range of each pixel value;

performing feature fusion on the probability image and the character frame image to generate a fused character image, and determining a character area in the key frame image based on the fused character image;

Extracting features of the character areas based on an image recognition model, recognizing character data of the key frame images from the character areas according to the extracted features, and performing character matching on the character data of the key frame images and a character database to obtain character detection results of the character areas in the key frame images;

and if the character detection result is a result of matching the character data with the character database, acquiring a target character string matched with the character data from the character database, and determining a data category corresponding to the target character string as a video category to which the video data belongs.

2. The method of claim 1, wherein the acquiring key frame images in the video data based on a similarity between any two adjacent video frame images of at least two video frame images that make up the video data comprises:

performing image feature matching on an ith video frame image and an (i+1) th video frame image in the at least two video frame images to obtain similarity between the ith video frame image and the (i+1) th video frame image; i is a positive integer;

If the similarity between the ith video frame image and the (i+1) th video frame image is smaller than a video similarity threshold, determining the (i+1) th video frame image as a key frame image of the video data, and performing image feature matching on the (i+1) th video frame image and the (i+2) th video frame image to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image;

if the similarity between the ith video frame image and the (i+1) th video frame image is greater than or equal to the video similarity threshold, performing image feature matching on the (i+1) th video frame image and the (i+2) th video frame image to obtain the similarity between the (i+1) th video frame image and the (i+2) th video frame image;

and obtaining a key frame image of the video data until the (i+2) th video frame image is the last video frame image of the at least two video frame images.

3. The method according to claim 1, wherein the feature extraction of the character region based on the image recognition model, recognizing character data of the key frame image from the character region based on the extracted features, comprises:

Performing feature extraction on the character region based on a convolution layer in the image recognition model to obtain convolution features corresponding to the character region, and performing serialization processing on the convolution features corresponding to the character region to obtain a feature sequence corresponding to the character region;

identifying the characteristic sequence based on a circulating layer in the image identification model, and determining sequence character characteristics corresponding to the characteristic sequence;

and performing feature conversion on the sequence character features based on a transcription layer in the image recognition model to obtain character data of the key frame image.

4. A method according to any one of claims 1-3, wherein the number of key frame images in the video data is N; n is a positive integer;

the step of performing character matching on the character data of the key frame image and a character database to obtain a character detection result of a character area in the key frame image comprises the following steps:

combining the character data of N key frame images in the video data to obtain combined character data;

performing word segmentation processing on the combined character data to determine M word segmentation character data corresponding to the video data; m is a positive integer;

Performing character matching on M word segmentation character data corresponding to the video data and the character database respectively to obtain k matching character strings and matching numbers corresponding to the k matching character strings respectively; the matching quantity is used for representing the quantity of character data matched with the matching character strings in the M word segmentation character data; k is a positive integer;

if the matching number of the matching character strings is larger than a matching threshold value, determining that the character detection result is a result of matching the character data with the character database;

the obtaining the target character string matched with the character data from the character database comprises the following steps:

and determining the matching character strings with the matching quantity larger than a matching threshold value as target character strings matched with the character data.

5. The method according to claim 1, wherein the method further comprises:

responding to an uploading request of a user terminal for the video data;

if the video category to which the video data belongs to the marked video category, sending a data uploading abnormal prompt to the user terminal; the data uploading abnormal prompt comprises a video category to which the video data belongs;

And if the video category to which the video data belongs does not belong to the marked video category, uploading the video data to an application program.

6. A method of video data processing, comprising:

acquiring sample key frame images in sample video data based on the similarity between any two adjacent sample video frame images in at least two sample video frame images forming the sample video data, and acquiring sample region labels in the sample key frame images;

performing feature extraction on the sample key frame image based on a convolution layer in an initial character detection model to obtain sample key image features of the sample key frame image;

performing feature stitching on the sample key image features to obtain sample stitching feature images corresponding to the sample key frame images; the pixel values of the pixel points in the sample spliced characteristic images are used for representing the probability that the pixel points in the corresponding sample key frame images are characters;

acquiring a probability range of each pixel value in the sample mosaic feature image, and generating a sample probability image and a sample character frame image according to the probability range of each pixel value;

Performing feature fusion on the sample probability image and the sample character frame image to generate a sample fusion character image, and determining a sample character area in the sample key frame image based on the sample fusion character image;

7. The method of claim 6, wherein the method further comprises:

acquiring a sample character label in the sample key frame image;

extracting features of the sample character area based on an initial image recognition model, and recognizing sample character data of the sample key frame image from the sample character area according to the extracted sample features;

generating a second loss function based on the sample character data and the sample character label, and training the initial image recognition model based on the second loss function to generate an image recognition model.

8. A computer device, comprising: a processor, a memory, and a network interface;

The processor is connected to the memory, the network interface for providing data communication functions, the memory for storing program code, the processor for invoking the program code to cause the computer device to perform the method of any of claims 1-5 or to perform the method of any of claims 6-7.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-5 or to perform the method of any of claims 6-7.