CN111405360A

CN111405360A - Video processing method and device, electronic equipment and storage medium

Info

Publication number: CN111405360A
Application number: CN202010217386.7A
Authority: CN
Inventors: 张春焰; 姚圣源; 安涵; 岑杰鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-07-10
Anticipated expiration: 2040-03-25
Also published as: CN111405360B

Abstract

The embodiment of the invention discloses a video processing method, a video processing device, electronic equipment and a storage medium. The invention can obtain the target video; performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text; grouping the texts to obtain a text set; classifying the types of the texts in the text set based on the text positions and the text set, and determining the text types of the texts; generating video detail information according to the text and the text type; and displaying the video detail information. The embodiment of the invention can efficiently and automatically identify all texts in the target video and the positions of the texts, and can generate corresponding video detail information by combining the texts and the positions of the texts. Therefore, the video processing method can improve the effect of the video processing method.

Description

Video processing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of image processing, in particular to a video processing method, a video processing device, electronic equipment and a storage medium.

Background

One commonly used technique for Character Recognition is OCR (Optical Character Recognition), which recognizes text in an image containing black and white dots and converts the text into a text format for further editing and processing.

Compared with a static picture, a text in a dynamic video may change in size, angle, position, text content and the like, so that it is difficult for the current character recognition technology to extract the text in the video, and it is difficult to further edit and process the text in the dynamic video.

Particularly, for text information (for example, information on a playing card in a basketball game, information on a scoring panel in an electronic game competition, and the like) appearing in a target video (for example, a live game video, a recorded game video, and the like), the current video playing platform needs to extract and identify the text information, and summarize, count, and display the text information according to a certain competition rule.

However, it is difficult to accurately, automatically and efficiently extract text information in a target video by using the conventional method, and perform a series of processing on the text information, so the conventional video processing method has poor effect.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a video processing device, electronic equipment and a storage medium, and aims to improve the effect of the video processing method.

The embodiment of the invention provides a video processing method, which comprises the following steps:

acquiring a target video;

performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text;

grouping the texts to obtain a text set;

based on the text position and the text set, performing type classification on the texts in the text set, and determining the text type of the texts;

generating video detail information according to the text and the text type;

and displaying the video detail information.

An embodiment of the present invention provides a video processing apparatus, including:

the acquisition module is used for acquiring a target video;

the recognition module is used for performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text;

the collection module is used for grouping the texts to obtain a text collection;

the classification module is used for classifying the types of the texts based on the text positions and the text set and determining the text types of the texts;

the generating module is used for generating video detail information according to the text and the text type;

and the display module is used for displaying the video detail information.

In some embodiments, the aggregation module comprises:

the matching sub-module is used for performing type matching on the text in a preset text library, and if the preset text matched with the text exists in the preset text library, classifying the text into a first type group;

the first grouping submodule is used for grouping the texts into a second type group if the preset texts matched with the texts do not exist in the preset text library and the texts consist of numbers;

and the second sub-grouping module is used for grouping the text into a time type group if the preset text matched with the text does not exist in the preset text library and the text comprises a preset time symbol.

In some embodiments, the target video is a video recorded with antagonistic game content, the first type group includes a team name group and a game stage group, and the first grouping sub-module includes:

the matching subunit is used for performing type matching on the text in a preset text library;

the team name subunit is used for classifying the text into a team name group if the team name matched with the text exists in the preset text library;

the stage subunit is used for dividing the text into a match stage group if a match stage matched with the text exists in the preset text library;

in some embodiments, the predetermined text includes a team name and a game stage, the second category group includes a game score group, and the second grouping sub-module is for:

and if the preset text base does not have the team names and the competition stages matched with the text, and the text consists of numbers, dividing the text into competition groups.

In some embodiments, the preset text comprises a team name and a game stage, the preset time symbols comprise a first preset time symbol and a second preset time symbol, the time category groups comprise a first time category group and a second time category group, and the second grouping sub-module is configured to:

if the team names and the competition stages matched with the texts do not exist in the preset text library and the texts comprise first preset time symbols, dividing the texts into a first time type group;

and if the team names and the competition stages matched with the texts do not exist in the preset text library and the texts comprise second preset time symbols, dividing the texts into a second time type group.

In some embodiments, the target video is a video recorded with antagonistic game content, the text set includes a team name group, a game stage group, a game score group, a first time type group, and a second time type group, the text type includes a game stage type, a first time type, a second time type, a main team name type, and a game score type, and the classification module includes:

the stage type submodule is used for determining the text type of the text in the competition stage group as a competition stage type;

the first time type submodule is used for determining the text type of the text in the first time type group as a first time type;

the second time type submodule is used for determining the text type of the text in the second time type group as a second time type;

the main guest judgment sub-module is used for judging the main guest of the text in the name group of the participating team according to the text position of the text in the name group of the participating team and determining the main guest team name type of the text in the name group of the participating team;

and the score type judgment sub-module is used for judging the score type of the texts in the match score group according to the text position of the texts in the match score group and determining the match score type of the texts in the match score group.

In some embodiments, the score type determination sub-module includes:

the quantity subunit is used for determining the quantity of the texts in the first time type group, the second time type group and the match score group;

the scoring type judging subunit is used for determining the text types of the texts in the first time type group and the second time type group as second time types when the number of the texts in the competition scoring group is 3 and the sum of the numbers of the texts in the first time type group and the second time type group is 1, and performing scoring type judgment on the texts in the competition scoring group according to the text positions of the texts in the competition name group to determine the competition scoring type of the texts in the competition name group;

the first relative position relation subunit is used for determining the relative position relation among the texts in the first time type group according to the text positions when the number of the texts in the first time type group is 2, determining the text type of the texts belonging to the first relative position relation as a first time type, and determining the text type of the texts belonging to the second relative position relation as a second time type;

and the second relative position relation subunit is configured to, when the number of the texts in the second time type group is 2, determine the relative position relation between the texts in the second time type group according to the text position, determine the text type of the text belonging to the first relative position relation as the first time type, and determine the text type of the text belonging to the second relative position relation as the second time type.

In some embodiments, the score type determination subunit is to:

when the match score group has a text belonging to a preset text range, determining the text type of the text belonging to the preset text range as a second time type;

when the competition group has a plurality of texts belonging to a preset text range, counting the number of the texts with a preset position relation between the texts in the competition group and the texts in the second time type group according to the text positions of the texts in the second time type group;

if the number of texts having a preset position relationship with the texts in the second time type group in the match score group is 1, determining the text type of the texts having the preset position relationship with the texts in the second time type group in the match score group as a second text type, and determining the text type of the texts not having the preset position relationship with the texts in the second time type group in the match score group as a match score type;

if the number of texts with preset position relations between the texts in the competition score group and the texts in the second time type group is 2, calculating the relative distance between the texts in the competition score group and a preset coordinate axis;

and if the relative distance is greater than a preset distance threshold value, determining the text type of the text with the minimum relative distance in the match score group as a second time type, and determining the text type of the text with the non-minimum relative distance in the match score group as the match score type.

In some embodiments, the identification module comprises:

the characteristic extraction submodule is used for extracting the image characteristics of the target video to obtain the image characteristics of the target video;

the region detection submodule is used for carrying out text region detection based on the image features to obtain text region features in the target video;

the text recognition sub-module is used for performing text recognition based on the image characteristics to obtain text characteristics in the target video;

the region trimming sub-module is used for performing region trimming processing on the text region characteristics to obtain the processed text region characteristics;

the region prediction sub-module is used for performing text region prediction based on the processed text region characteristics and determining a text region appearing in the target video;

the text prediction sub-module is used for performing text prediction based on the processed text characteristics and determining a text appearing in the target video;

and the determining submodule is used for determining the text position of the text according to the text region and the text.

In some embodiments, the region detection sub-module is configured to:

performing multi-size feature extraction by using a feature extraction layer according to the image features to obtain a plurality of image features with different sizes;

performing feature fusion processing on the image feature lines with different sizes by adopting a multi-level fusion layer to obtain shared fusion features;

and determining text region characteristics in the target video according to the shared fusion characteristics by adopting a multi-channel output layer.

In some embodiments, the text prediction sub-module is to:

merging the characteristics of the image characteristic lines with different sizes by adopting a multi-level fusion layer to obtain shared fusion characteristics;

and determining texts appearing in the target video according to the shared fusion characteristics by adopting a multi-channel output layer.

In some embodiments, the text recognition sub-module is to: :

extracting high-dimensional features based on the image features to obtain high-dimensional image features;

extracting text time sequence characteristics according to the high-dimensional image characteristics;

and determining the text characteristics appearing in the target video according to the text time sequence characteristics.

In some embodiments, the video detail information includes a game detail intelligence table and a game trend graph, and the generating module is configured to:

generating a match detailed information table and a match trend chart according to the text and the text type;

in some embodiments, the target video is a video recorded with antagonistic game content, and the display module is configured to:

displaying a match detail page;

and displaying the game detailed information table and the game trend chart on a game detailed page.

The embodiment of the invention provides electronic equipment, which comprises a processor and a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform any of the steps in the video processing method.

An embodiment of the present invention provides a storage medium, where a plurality of instructions are stored, where the instructions are suitable for being loaded by a processor to perform any of the steps in the video processing method.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic view of a scene of a video processing method according to an embodiment of the present invention;

fig. 1b is a schematic flow chart of a video processing method according to an embodiment of the present invention;

fig. 1c is a schematic view of a live video frame of a basketball game according to the video processing method provided in the embodiment of the present invention;

fig. 1d is a schematic diagram of a text recognition network structure of a video processing method according to an embodiment of the present invention;

fig. 1e is a schematic diagram of a regional detection network structure of a video processing method according to an embodiment of the present invention;

fig. 2a is a schematic specific view of a live video of a basketball game according to a video processing method provided in an embodiment of the present invention;

fig. 2b is a schematic flow chart of a video processing method according to an embodiment of the present invention;

fig. 2c is a schematic flowchart of a video processing method for classifying types of texts in a text set according to an embodiment of the present invention;

fig. 2d is a schematic view of a video detail page of a video processing method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a video processing method, a video processing device, electronic equipment and a storage medium.

The video processing apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet computer, an intelligent bluetooth device, a notebook computer, or a Personal Computer (PC); the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the video processing apparatus may also be integrated into a plurality of electronic devices, for example, the video processing apparatus may be integrated into a plurality of servers, and the video processing method of the present invention is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1a, the electronic device may be a smart phone that may obtain a target video when a user watches a live game using the smart phone; performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text; then, grouping the texts to obtain a text set; classifying the types of the texts in the text set based on the text positions and the text set, and determining the text types of the texts; generating match detail information according to the text and the text type; finally, the smartphone may display a game details page on the screen, which may include game details information.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

Artificial Intelligence (AI) is a technique that uses a digital computer to simulate the human perception environment, acquire knowledge, and use the knowledge, which can make a machine function similar to human perception, reasoning, and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and the like.

Among them, Computer Vision (CV) is a technology for performing operations such as recognition and measurement on a target image by using a Computer instead of human eyes and further performing processing. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, virtual reality, augmented reality, synchronized localization and mapping, and other techniques, such as image rendering, image edge extraction, and other image processing techniques.

In this embodiment, a video processing method based on computer vision is provided, and as shown in fig. 1b, a specific flow of the video processing method may be as follows:

101. and acquiring a target video.

The target video may be a video in which antagonistic game contents are recorded, such as a basketball game video, a video of a game of a video of a soccer game, and the like.

In addition, the target video may be other videos with visual text content, such as news videos, advertisement videos, stock market videos, and the like.

The target video may be presented in various forms, for example, the target video may be a live video, a recorded video, or the like.

Taking a basketball game live video as an example for explanation, referring to fig. 1c, fig. 1c is a video frame of a basketball game live video, which includes the frame content of the basketball game, and a game score board, which may include team name information, team approach number information, and the remaining time of the game measure at the moment, etc.

In some embodiments, the target video may be received from the network via Blockchain (Blockchain) techniques.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. The blockchain is essentially a decentralized database, which is a string of data blocks associated by using cryptography, each data block contains information of a batch of network transactions, and the information is used for verifying the validity (anti-counterfeiting) of the information and generating the next block.

In this embodiment, a blockchain technology may be used as a platform product service layer and an application service layer to obtain a target video, the platform product service layer may provide basic capabilities and an implementation framework of typical applications, and a technician may superimpose characteristics of a service based on the basic capabilities to complete blockchain implementation of service logic. The application service layer provides the application service based on the block chain scheme for the business participants to use.

102. And performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text.

Text characters often appear in the video picture of the target video.

For example, the video frame of the basketball game video may include game score text, video source text, news text scrolling under the video, and the like; for example, a video frame of a video of an electronic game tournament may include a player killing word, a team scoring word, and the like.

In the scheme, all texts appearing in the target video can be subjected to text recognition, and the content and the position of the texts are recognized.

For example, referring to fig. 1c, fig. 1c is a video frame of a live video of a basketball game, when the target video is the live video of the basketball game of fig. 1c, a text in a match score board appearing in the target video may be identified to obtain a text content of the text and a coordinate point of a text position where the text content is located, where the coordinate point may be a coordinate point where a center of the text is located, or coordinates of an upper left corner and a lower right corner of a minimum bounding box of the text.

For example, in fig. 1c, the text "team a", "22", "team B", "26", and "3: 12". Wherein the text positions are [ (548, 917), (732, 951) ], [ (748, 916), (796, 951) ], [ (968, 915), (1117, 949) ], [ (1130, 916), (1177, 952) ] and [ (1209, 915), (1270, 953) ], respectively.

Referring to fig. 1d, fig. 1d shows a Text recognition Network, which may be any artificial Network model for end-to-end Text detection, such as a tow-stage algorithm model like FOTS Network (Fast organized Text Spotting with An integrated Network), EAST Network (An Efficient and Accurate Scene Text Detector), etc.; the FOTS network model may include a shared convolutional network, an area detection network, a text recognition network, and a roilotate network, among others.

For example, in some embodiments, step 102 may be performed using a FOTS network model, as follows:

(1) and extracting image features of the target video to obtain the image features of the target video.

In the FOTS network model, a shared convolution network can be used to extract image features of a target video to obtain the image features of the target video.

The backbone Network of the shared convolutional Network may be any image feature extraction Network, for example, a VGG Network (Visual Geometry Group Network), an Alexnet Network, a Deep residual error Network (Deep residual Network), and the like.

For example, in some embodiments, in order to solve the problem of accuracy reduction of the training set that occurs with the deepening of the artificial neural network, a deep residual network, such as Resnet-18, Resnet-50, or the like, may be used to deepen the network as much as possible, and bypass the input information to the output, thereby simplifying the goal and difficulty of machine learning while protecting the integrity of the information.

(2) And detecting the text region based on the image features to obtain the text region features in the target video.

In the FOTS network model, a region detection network may be used to perform text region detection based on image features, so as to obtain text region features in the target video.

The area detection network can be any text detection model based on deep learning.

Because most of the existing regional detection networks are multi-stage (multi-stage) text detection network models, and multiple stages (stages) need to be optimized in the training process, which results in poor model effect, in some embodiments, in order to eliminate intermediate stages (such as candidate regional aggregation, text segmentation, post-processing, etc.), text lines are directly predicted to solve the problem, and the regional detection network may be an EAST network.

The network structure of the EAST network can refer to fig. 1e, and includes a feature extraction layer, a multi-level fusion layer, and a multi-channel output layer.

The feature extraction layer comprises a plurality of convolution kernels with different sizes, and the convolution kernels can extract feature maps with different sizes so as to solve the problem of inaccurate text recognition caused by size conversion of text sizes in target videos.

The feature extraction layer and the multi-level fusion layer can perform feature fusion by adopting a U-net (CNN-based image segmentation network) method, namely combining features obtained by a shared rolling machine network from top to bottom by using operations of pooling, series connection, convolution and the like to obtain fused features, and then performing convolution by using convolution kernels with the size of 3 × 3 and the channel number of 32 to obtain final features.

Where a multi-channel output layer may include multiple channels, for example, in some embodiments, the multi-channel output layer may include 6 channels, and the 1st channel may compute a text feature; for each text, the next 5 channels can calculate text region features and the last channel can calculate text angle features.

In some embodiments, the candidate text region overlap problem may also be solved by a local-Aware Non-Maximum Suppression (L certainty-Aware _ sup _ supression, L certainty-Aware NMS) algorithm, finally, resulting in final text region features and text features.

For example, in some embodiments, when the EAST network is used to perform the step "text region detection based on image features to obtain text region features in the target video", the method may include the following steps:

A. performing multi-size feature extraction by adopting a feature extraction layer according to the image features to obtain a plurality of image features with different sizes;

B. performing feature fusion processing on a plurality of image feature lines with different sizes by adopting a multi-level fusion layer to obtain shared fusion features;

C. and determining text region characteristics in the target video according to the shared fusion characteristics by adopting a multi-channel output layer.

(3) And performing text recognition based on the image characteristics to obtain text characteristics in the target video.

The text recognition Network may include a variety of recognition networks, such as a CRNN Network (probabilistic recurrent Neural Network), wherein the CRNN Network may include a convolutional layer, a cyclic layer, and a transcription layer.

In some embodiments, the convolutional layer may be various CNN networks, such as VGG Network (Visual geometry group Network), Goog L eNet Network, etc., to reduce the number of CNN convolutional cores and increase the depth of the convolutional layer, thereby extracting high-dimensional features.

In some embodiments, the loop layer may be various RNN networks, such as a long-Short Term Memory network (L STM, &lttttranslation = L "&gttl &lttt/t &ttton Short-Term Memory), a Bidirectional long-Short Term Memory network (Bi-L STM, Bidirectional L on Short-Term Memory), and so on.

For example, in some embodiments, the step of "performing text recognition based on image features to obtain text features in the target video" may include the following steps:

A. extracting high-dimensional features based on the image features to obtain high-dimensional image features;

B. extracting text time sequence characteristics according to the high-dimensional image characteristics;

C. and determining the text characteristics appearing in the target video according to the text time sequence characteristics.

The method comprises the steps of extracting high-dimensional features on the basis of image features at a convolution layer of the CRNN to obtain the high-dimensional image features, extracting text time sequence features according to the high-dimensional image features at a circulation layer, and finally determining the text features appearing in a target video according to the text time sequence features at a transcription layer.

Wherein, in order to ensure the continuity of the obtained text time sequence characteristics in time, the loop layer can be a Bi-L STM network, and in order to convert the text time sequence characteristics into a label sequence, the transcription layer can be a (CTC) network.

(4) And performing area trimming processing on the text area characteristics to obtain the processed text area characteristics.

The finishing processing refers to mapping the text region features and inputting the mapped text region features into a text recognition network.

The characteristic mapping can be performed by using a RoIRote network, and the RoIRote network can convert an angled text region into a normal axially aligned text region through affine transformation.

(5) And predicting the text region based on the processed text region characteristics, and determining the text region appearing in the target video.

In the present embodiment, the boundaries of the text regions can be predicted, for example, for a rectangular text region, the coordinates of the upper left corner and the upper right corner of the text region can be predicted.

(6) And performing text prediction based on the processed text characteristics, and determining the text appearing in the target video.

In some embodiments, the step of "performing text prediction based on text features, determining text present in the target video" may comprise the steps of:

B. performing feature merging processing on a plurality of image features with different sizes by adopting a multi-level fusion layer to obtain shared fusion features;

C. and determining texts appearing in the target video according to the shared fusion characteristics by adopting a multi-channel output layer.

For example, referring to the multi-channel output layer of fig. 1e, after 1 × 1 convolution, text prediction may be performed based on text features, and text features, text region features, text angle features, and the like of a text appearing in the target video may be determined.

(7) And determining the text position of the text according to the text area and the text.

103. And grouping the texts to obtain a text set.

After the text is obtained, the text can be grouped according to the text content.

For example, text composed of numbers is grouped into score groups, which will contain ": "a text composed of symbols and numbers is divided into a time group, a text composed of chinese characters is divided into a team name group and a player name group, and the like.

For example, in some embodiments, the target video is a video recorded with resistant game content, the text composed of chinese characters may be divided into groups of team names, and since the team names of the game participating teams are often fixed, a text library of team names may be established first, and the text appearing in the library is all team names, so step 103 may include the following steps:

(1) and performing type matching on the text in a preset text library, and if the preset text matched with the text exists in the preset text library, classifying the text into a first type group.

In some embodiments, the first type group may include a team name group and a competition period group, wherein the competition period refers to a period in which the game is played at the moment, for example, in some electronic competition games, 0 to 10 minutes after the game is started may be referred to as a first period, and 10 to 30 minutes after the game is started is referred to as a second period, and so on.

Similarly, the basketball game may include a first stage (1st), a second stage (2nd), a third stage (3rd), and so on. The preset text may include a team name and a competition stage, so the step of "performing type matching on the text in the preset text library, and if the preset text matched with the text exists in the preset text library, classifying the text into a first type group" may include the steps of:

A. performing type matching on the text in a preset text library;

B. if the name of the team matched with the text exists in the preset text base, dividing the text into a team name group;

C. and if the preset text base has the match stage matched with the text, dividing the text into match stage groups.

The preset text library may include pre-stored and recorded team names, player names, and the like. And judging whether the text is a battle team name and a player name, performing type matching on the text in a preset text base, and if the same text is matched in the preset text base, indicating that the text can be divided into a team name group.

(2) And if the preset text matched with the text does not exist in the preset text library and the text consists of numbers, classifying the text into a second type group.

In some embodiments, the predetermined text may include a team name and a competition period, the second type group may include a competition result group, and the step "if there is no predetermined text matching the text in the predetermined text library and the text is composed of numbers, the step of grouping the text into the second type group" may include the steps of:

The game score means a game score obtained by each team or team.

(3) And if the preset text matched with the text does not exist in the preset text library and the text comprises the preset time symbol, dividing the text into time type groups.

Wherein, the preset time symbol may refer to ": ", etc.

For example, if the text is "3: 01 ", the text cannot be matched with the text in the preset text library, and the text includes a preset time symbol": ", the text may be divided into time type groups.

In some embodiments, the preset text may include a team name and a game stage, the preset time symbol may include a first preset time symbol and a second preset time symbol, the time type group may include a first time type group and a second time type group, and the step of "if there is no preset text matching the text in the preset text library and the text includes the preset time symbol, the step of dividing the text into the time type group" may include the steps of:

if the team names and the competition stages matched with the texts do not exist in the preset text base and the texts comprise first preset time symbols, dividing the texts into a first time type group;

and if the team names and the competition stages matched with the texts do not exist in the preset text base and the texts comprise second preset time symbols, classifying the texts into a second time type group.

In the score basket of the basketball game, a bar remaining time (a game stage) and a remaining attack time may be displayed, for example, the bar remaining time may be 10 minutes and the remaining attack time may be 25 seconds due to a game rule of the basketball, so that a first time type group may include text expressing the bar remaining time and a second time type group may include text expressing the remaining attack time.

104. And classifying the types of the texts in the text set based on the text positions and the text set, and determining the text types of the texts.

In some embodiments, the target video is a video recorded with antagonistic game content, the text set may include a team name group, a game stage group, a game score group, a first time type group, a second time type group, and the text type may include a game stage type, a first time type, a second time type, a main team name type, a game score type, and the like.

For example, the group of team names may include a plurality of texts indicating the names of the teams participating in the game, the group of game stages may include a plurality of texts indicating the stage of the game at the current time, the group of game scores may include a plurality of texts indicating the score of each team, the group of first time types may include a plurality of texts indicating the remaining time of the game measure, and the group of second time types may include a plurality of texts indicating the remaining attack time.

In some embodiments, the master-guest team name type may include a master team name type and a guest team name type.

Since the text in each set is not necessarily the type corresponding to the set after the initial grouping, in order to further determine the text type, in the step "classify the text based on the text position and the information of the text set, and determine the text type of the text" may include the following steps:

(1) the text type of the text in the match stage group is determined as the match stage type.

(2) The text type of the text in the first time type group is determined as a first time type.

(3) The text type of the text in the second time type group is determined as the second time type.

(4) And judging the host and guest of the text in the name group of the participating team according to the text position of the text in the name group of the participating team, and determining the host and guest name type of the text in the name group of the participating team.

(5) And according to the text position of the text in the match score group, performing score type judgment on the text in the match score group, and determining the match score type of the text in the match score group.

In some embodiments, the step of "performing score type determination on the text in the score group according to the text position in the team name group, and determining the match score type of the text in the team name group" may include the steps of:

A. and determining the number of texts in the first time type group, the second time type group and the match score group.

B. And when the number of texts in the competition result group is 3 and the sum of the numbers of texts in the first time type group and the second time type group is 1, determining the text types of the texts in the first time type group and the second time type group as the second time type, performing score type judgment on the texts in the competition result group according to the text positions of the texts in the competition team name group, and determining the competition result type of the texts in the competition team name group.

In some embodiments, the step of "performing score type determination on the text in the score group according to the text position of the text in the team name group" may include the steps of:

a. when the match score group has a text belonging to a preset text range, determining the text type of the text belonging to the preset text range as a second time type;

b. when the competition group has a plurality of texts belonging to a preset text range, counting the number of texts with a preset position relation between the texts in the competition group and the texts in the second time type group according to the text positions of the texts in the second time type group;

c. if the number of the texts with the preset position relation with the texts in the second time type group in the competition result group is 1, determining the text type of the texts with the preset position relation with the texts in the second time type group in the competition result group as a second text type, and determining the text type of the texts which do not have the preset position relation with the texts in the second time type group in the competition result group as a competition score type;

d. if the number of texts with the preset position relation with the texts in the second time type group in the competition score group is 2, calculating the relative distance between the texts in the competition score group and a preset coordinate axis;

e. and if the relative distance is greater than the preset distance threshold value, determining the text type of the text with the minimum relative distance in the match score group as a second time type, and determining the text type of the text with the non-minimum relative distance in the match score group as the match score type.

For example, in a basketball game scene, the preset text range may be any positive integer from 0 to 25, and the text conforming to the preset text range may be the goal score of a basketball game team or the time remaining in a section of the basketball game.

The preset positional relationship may refer to a left-right relationship, an up-down relationship, and the like.

For example, in a basketball game scenario, when the number of texts in the game score group is 3, one of the 3 texts is the score of the guest team, one is the score of the main team, and the other is the remaining attack time.

In the basketball game, the remaining attack time is less than 25 seconds, and the scores of the main and guest teams can exceed 25 points, so that if only one text is less than 25 in the 3 texts, the text can be determined as the remaining attack time, and the other two texts are the scores of the main and guest teams.

Since the number on the left side of the basketball game scoreboard is usually the score of the main team, the number on the middle side of the basketball game scoreboard is usually the score of the guest team, and the number on the right side of the basketball game scoreboard is usually the remaining attack time, when all 3 texts are less than 25, according to the relative position relationship between the 3 texts, it can be judged which of the two texts is the score of the main team, which is the score of the guest team, and which is the remaining attack time.

The preset distance threshold refers to a preset threshold of a relative distance between the text and the text.

In the live scene of the basketball game, since the basketball game scoreboard may have changes such as rotation, zooming, displacement and the like, in order to make the recognition more accurate, the calculation of the relative distance can be performed by taking the lowermost part of the basketball game scoreboard as an x axis and the leftmost part of the basketball game scoreboard as a y axis.

The preset coordinate axis may refer to the x-axis or the y-axis.

C. When the number of texts in the first time type group is 2, determining the relative position relation between the texts in the first time type group according to the text positions, determining the text type of the texts belonging to the first relative position relation as a first time type, and determining the text type of the texts belonging to the second relative position relation as a second time type.

For example, generally, the number of texts in the bar remaining time group is 1, and when the number of texts in the bar remaining time group is 2, it is described that one text in the bar remaining time group is the bar remaining time, and the other text in the bar remaining time group is the remaining attack time.

Therefore, which text is the section remaining time and which text is the remaining attack time can be determined according to the relative position relationship between the two texts.

D. When the number of texts in the second time type group is 2, determining the relative position relationship between the texts in the second time type group according to the text positions, determining the text type of the text belonging to the first relative position relationship as the first time type, and determining the text type of the text belonging to the second relative position relationship as the second time type.

For example, generally, the number of texts in the remaining attack time group is 1, and when the number of texts in the remaining attack time group is 2, it is described that one of the texts in the remaining attack time is a measure of the remaining time and the other text is the remaining attack time.

105. And generating video detail information according to the text and the text type.

In some embodiments, the target video is a video recorded with antagonistic game content, so the video detail information may include a game detail intelligence table and a game trend graph, and step 105 may generate the game detail intelligence table and the game trend graph according to text and text type.

For example, referring to table 1, table 1 is a game detail report, as follows:

TABLE 1

In some embodiments, a game trend graph may also be generated based on team scores.

106. And displaying the video detail information.

In some embodiments, video detail information may be displayed directly; in other embodiments, the video detail information may be further processed for display, and so on.

In some embodiments, the target video amount may be a video in which antagonistic game content is recorded, so the game details page, which may include video details information, may be displayed without 106; in other embodiments, the video details may be further processed and a game details page may be displayed based on the processed video details, and so on.

For example, in some embodiments, step 106 may display a game details page and display a game details report and a game trend chart on the game details page.

In some embodiments, the video detail information, the game detail information table, the game trend graph and the like can be sent to the mobile terminal through a block chain technology through a network, so that the mobile terminal displays a game detail page, and displays the game detail report and the game trend graph on the game detail page.

For example, referring to fig. 1a, a live game video and a game detail page may be simultaneously displayed on a screen of the smart phone, and a game detail report and a game trend chart may be displayed on the game detail page.

As can be seen from the above, the embodiment of the present invention can obtain a target video; performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text; grouping the texts to obtain a text set; classifying the types of the texts in the text set based on the text positions and the text set, and determining the text types of the texts; generating video detail information according to the text and the text type; and displaying the video detail information.

According to the scheme, all texts in the target video and the positions of the texts can be automatically identified, the problem of low identification efficiency caused by text rotation, zooming, displacement and the like in the dynamic video is solved, and the text type to which the texts belong can be correctly identified by combining the texts and the positions of the texts, so that correct video detail information is generated. Therefore, the video processing method can improve the effect of the video processing method.

The method described in the above embodiments is further described in detail below.

In this embodiment, the method of the embodiment of the present invention will be described in detail by taking a live basketball game as an example.

Referring to fig. 2a, fig. 2a is a schematic diagram of a basketball game live view including two parts, a game view and a scoreboard, wherein in fig. 2a, the name of the main team is included in the scoreboard: team a, team B, team a score 26, team B score 22, basketball game bar stage 1st (first stage), 1st bar remaining time 3: 17, and a remaining attack time 21.

As shown in fig. 2b, a specific flow of a video processing method is as follows:

201. and obtaining a training sample and an initial text recognition model, and training the initial text recognition model by adopting the training sample until the initial text recognition model is converged to obtain the text recognition model.

In this embodiment, the target video is a live basketball video, and the training samples can be obtained by extracting frames from the live basketball video and then performing annotation.

For example, frames are extracted from live broadcast videos of a basketball game in four seasons, 30 frames of pictures are extracted from each game, about 8 ten thousand pictures are extracted in total, and after the pictures are obtained, the position information of texts and texts in the game pictures can be labeled by a manual labeling method or a machine labeling method.

For example, for 7 contents in the game score card: the coordinate and the content of the dual-formation name, the score, the measure remaining time, the remaining attack time and the like of the dual-formation are required to be marked.

For example, as shown in fig. 2a, fig. 2a shows the labeling rule of the team name in the game picture, when the score 26 of the team a needs to be labeled, the coordinates of "26" and the upper left corner and the lower right corner of the minimum bounding box (indicated by the dashed rectangle box in the figure) of the team a are labeled.

For example, the label for the score 26 of team A in FIG. 2a may be {26, [ (260, 399), (310, 449) ] }.

In this embodiment, the initial text recognition model is a FOTS model, and includes a shared convolutional network, an area detection network, a Rotation network, and a text recognition network.

The main network of the shared convolutional network is a ResNet-50 network, the area detection network is an EAST network, and the text recognition network is a CRNN network based on VGG and Bi-L STM.

The specific network structure of the FOTS model can refer to fig. 1d, and is not described herein.

Wherein the loss function L of the area detection network_dIncluding text loss items L_sAnd a regional loss term L_gThe loss of the predicted text and the loss of the predicted text region are respectively expressed as follows:

L_d＝λL_s+L_g

where λ represents a weight that balances the two losses.

Loss items L for text classification_sIn this embodiment, the dice loss may be used for calculation:

wherein i represents the number of pixel points of the model output characteristic image, p_iRepresenting the probability that a feature map pixel i is predicted as text, g_iAnd (4) representing the real label of the pixel point of the characteristic image.

Term for regional loss L_gIn the present embodiment, the rotation angle loss L may be adopted_θLoss of Intersection-over-Union (IoU) L_AABBAnd (3) calculating:

L_g＝L_AABB+λL_θ

wherein, λ represents a weight to balance two losses,

representing the predicted coordinates of the text region, wherein R is the real coordinates of the text region;

represents the predicted angle of the text region, and θ is the true angle of the text region.

Wherein, L_AABBIs IoU loss between the predicted text region and the real text region, IoU loss is the overlap ratio of the predicted text region and the real text region, i.e. the intersection of the prediction result (Detection result) and the group Truth (classification accuracy) is compared to their union:

the text recognition network can input the extracted high-dimensional feature map into a Bi-L STM (synchronous text transfer model) to train and recognize in the RNN part so as to capture sequence features of an input text, and then the CTC network is adopted to convert the fully-connected character classification into a label sequence, and the recognition loss of the part can be classified log loss:

L＝L_d+λL_r

where λ represents the weight that balances the two task losses.

And training the initial text recognition model until L stably converges to obtain the text recognition model.

202. The method comprises the steps of obtaining a live basketball video, and performing text recognition on the live basketball video by adopting a text recognition model to obtain a text appearing in a match video and a text position of the text.

Specifically, the step of performing text recognition on the live basketball video by using a text recognition model to obtain a text appearing in the video of the basketball game and a text position of the text may refer to step 102, which is not described herein again.

203. And performing grouping processing on the texts to obtain a text set, classifying the types of the texts in the text set based on the text position and the text set, and determining the text type of the texts.

In this embodiment, in order to prevent the occurrence of recognition errors, if the number of texts appearing in the game video obtained in step 202 is less than 4 and greater than 8, the null is directly returned.

In this embodiment, type matching may be performed on the text in a preset text library, and if a preset text matching the text exists in the preset text library, the text is classified into a first type group, where the first type group may include a team name group and a competition phase group;

if the preset text matched with the text does not exist in the preset text library and the text consists of numbers, classifying the text into a second type group;

if the preset text matched with the text does not exist in the preset text library and the text comprises the preset time symbol, dividing the text into time type groups, wherein the time type groups comprise a section remaining time group (namely a first time type group) and a remaining attack time group (namely a second time type group).

The preset text library may include a team name library and a competition stage library.

For example, the team name library may include a plurality of team name texts: a lake team, a warrior team, a fast fleet, a teddy bear team, and so on. The library of game stages may include a plurality of game stage names: 1^st(first stage) 2^nd(second stage) 3^rd(third stage) 4^th(fourth stage), and so on.

When the text matches the text in the team name library, the text may be assigned to a team name group.

When the text matches the text in the game stage library, the text may be assigned to a game stage group.

In some embodiments, the second type group may include a match score group, and if there is no preset text matching the text in the preset text library and the text is composed of all numbers, the text may be assigned to the match score group.

The preset text can comprise ": and" - ", if the text contains": ", the text is added into a section remaining time group, and if the text contains" - ", but does not contain": ", the text is added into a remaining attack time group.

Since the playing score card is not fixed in style, the order of the characters therein is not fixed, the playing score in the playing score card is difficult to be determined simply by the position, and the remaining attack time may also be a pure number, so it is difficult to determine which is the remaining attack time and which is the playing score in the three pure numbers, so referring to fig. 2d, fig. 2c is a flow chart for classifying the types of the texts in the text set and determining the text types of the texts based on the text positions and the text set, when the characters in the playing score group are all numbers and the number of the characters in the playing score group is 3, the following can be determined by the case of 6:

case 1: and when the number of the characters in the match group is 3 and the sum of the numbers of the characters in the first time type group and the second time type group is 1, determining the character types of the characters in the first time type group and the second time type group as the second time type, performing score type judgment on the characters in the match group according to the character positions of the characters in the name group of the participating team, and determining the match score type of the characters in the name group of the participating team.

That is, when there are 3 numbers in the race score group, and only one of the bar remaining time group and the attack time group has a value, and the number of values is 1, it can be determined that the only one has a value of bar remaining time, which can be denoted as a herein.

Then, it is determined whether only one number in the game score group is less than 25, if only one number is less than 25, the number can be determined as the remaining attack time, and the other two numbers can be determined as the game score, and the two game scores can be determined as the home team score and the guest team score according to their relative positions, for example, the game score on the relatively left and upper of the two game scores can be determined as the home team score, otherwise, the guest team score is determined.

If more than one of the three game scores is less than 25, the number of the three numbers to the right of the bar remaining time A can be determined.

When only one number is located to the right of the bar remaining time a, then that number can be determined as the remaining attack time and the remaining two numbers as the game score, and similarly the home team score and the guest team score can be determined based on their relative positions.

When more than one number is to the right of the bar remaining time a, these numbers can be denoted as number B, number C. In this embodiment, the distance Dy of the number B and the number C on the right side of the bar remaining time a with respect to the bar remaining time a on the Y axis can be calculated.

If Dy is larger, when the distance between Dy and the bar remaining time a is larger than a preset threshold (for example, 30), it may be determined that Dy is closer to the bar remaining attack time, and the remaining two numbers are the game score.

The one with the number B and the number C farther on the Y axis than the bar remaining time a may calculate the distance Dx between the number B and the number C on the X axis than the bar remaining time a when the distance from the bar remaining time a is less than a preset threshold (e.g., 30), and the smaller of the Dx may be determined as the remaining attack time and the remaining two numbers are the game score.

Case 2: when the number of the characters in the first time type group is 2, the relative position relationship between the characters in the first time type group can be determined according to the character positions, the character type of the characters belonging to the first relative position relationship is determined as the first time type, and the character type of the characters belonging to the second relative position relationship is determined as the second time type.

That is, when the number of texts in the bar remaining time group is 2, the text relatively positioned on the left side or the upper side of the two texts is determined as the bar remaining time, and the text relatively positioned on the right side or the lower side of the two texts is determined as the remaining attack time.

Case 3: this case is similar to case 2, where the number of characters in the second time type group is 2, the relative positional relationship between the characters in the second time type group is determined according to the character positions, the character type of the character belonging to the first relative positional relationship is determined as the first time type, and the character type of the character belonging to the second relative positional relationship is determined as the second time type.

That is, when the number of texts in the remaining attack time group is 2, the text relatively positioned on the left side or the upper side of the two texts is determined as the bar remaining time, and the text relatively positioned on the right side or the lower side of the two texts is determined as the remaining attack time.

Case 4: if only one text exists in the section remaining time group, the text is the section remaining time, and if only one text exists in the remaining attack time group, the text is the content of the remaining attack time.

Case 5: if only one text exists in the section remaining time group, the text is the section remaining time, and if any text does not exist in the remaining attack time group, the remaining attack time is null.

Case 6: if only one text exists in the remaining attack time group, the text is the remaining attack time, and if any text does not exist in the bar remaining time group, the bar remaining time is empty.

And finally, checking the returned content, returning the identification result if and only if the main team competition score and the section remaining time are normal values, and returning to be empty if not.

204. Generating game detail information according to the text and the text type, and displaying a game detail page, wherein the game detail page comprises the game detail information.

Referring to fig. 2d, after the text and text type are recognized, game detail information may be generated and a game detail page may be displayed, the game detail page including the game detail information.

The invention can identify the content in the basketball game video scoreboard in real time, and correctly map the content in the scoreboard to the name and the score of the battle biquad, the number of the game nubs, the remaining time of the nubs and the remaining attack time.

Through the scheme, the identification accuracy of the scoreboard reaches more than 92%, about 1 second is consumed for identifying one frame of image, and the effect of the video processing method is effectively improved.

Therefore, all texts in the target video and the positions of the texts can be efficiently and automatically recognized, corresponding match detail information can be generated by combining the texts and the positions of the texts, and compared with a scheme that the texts can be extracted only by recognizing the scoreboard in the prior art, the scheme can solve the problem that the texts in the scoreboard are difficult to recognize under the conditions of displacement, rotation, scaling, deformation and the like. In addition, the text sequence in the scoreboard may be changed, and the scheme can analyze which text type the text is after the text is recognized, so that further match detail information can be generated conveniently, and therefore the effect of the video processing method can be improved.

In order to better implement the above method, an embodiment of the present invention further provides a video processing apparatus, where the video processing apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in the present embodiment, the method according to the embodiment of the present invention will be described in detail by taking an example in which a video processing apparatus is specifically integrated in a server.

For example, as shown in fig. 3, the video processing apparatus may include an acquisition module 301, a recognition module 302, an aggregation module 303, a classification module 304, a generation module 305, and a display module 306, as follows:

an acquisition module 301.

The acquisition module 301 may be used to acquire a target video.

(II) an identification module 302.

The recognition module 302 may be configured to perform text recognition on the target video, so as to obtain a text appearing in the target video and a text position of the text.

In some embodiments, the recognition module 302 may include a feature extraction sub-module, a region detection sub-module, a text recognition sub-module, a region trimming sub-module, a region prediction sub-module, a text prediction sub-module, and a determination sub-module, as follows:

(1) and a feature extraction submodule.

The feature extraction submodule can be used for extracting image features of the target video to obtain the image features of the target video.

(2) And a region detection submodule.

The region detection submodule can be used for carrying out text region detection based on the image features to obtain the text region features in the target video.

(3) And a text recognition sub-module.

The text recognition submodule can be used for performing text recognition based on the image features to obtain the text features in the target video.

(4) And (4) an area trimming submodule.

The region trimming sub-module may be configured to perform region trimming processing on the text region feature to obtain a processed text region feature.

(5) And a region prediction sub-module.

The region prediction sub-module may be configured to perform text region prediction based on the processed text region features, and determine a text region appearing in the target video.

(6) And a text prediction sub-module.

The text prediction sub-module can be used for performing text prediction based on the processed text features and determining texts appearing in the target video.

(7) A sub-module is determined.

The determining sub-module may be configured to determine a text position of the text based on the text region and the text.

In some embodiments, the region detection sub-module may be configured to:

performing multi-size feature extraction by adopting a feature extraction layer according to the image features to obtain a plurality of image features with different sizes;

performing feature fusion processing on a plurality of image feature lines with different sizes by adopting a multi-level fusion layer to obtain shared fusion features;

In some embodiments, the text prediction sub-module may be configured to:

combining the characteristics of a plurality of image characteristic lines with different sizes by adopting a multi-level fusion layer to obtain shared fusion characteristics;

In some embodiments, the text recognition sub-module may be configured to:

And (iii) a collection module 303.

The aggregation module 303 may be configured to perform grouping processing on the texts to obtain a text aggregation.

In some embodiments, the aggregation module 303 may include a matching sub-module, a first grouping sub-module, and a second grouping sub-module, as follows:

(1) and matching the sub-modules.

The matching sub-module can be used for performing type matching on the text in a preset text library, and if the preset text matched with the text exists in the preset text library, the text is classified into a first type group.

(2) A first grouping submodule.

The first grouping submodule may be configured to, if there is no preset text matching the text in the preset text library and the text is composed of numbers, group the text into a second type group.

(3) A second grouping sub-module.

The second grouping submodule may be configured to, if a preset text matching the text does not exist in the preset text library and the text may include a preset time symbol, group the text into a time type group.

In some embodiments, the target video may be a video in which antagonistic game content is recorded, the first type group may include a team name group and a game stage group, and the first grouping sub-module may include a matching sub-unit, a team name sub-unit, and a stage sub-unit, as follows:

A. a matching subunit.

The matching subunit can be used for performing type matching on the text in a preset text library.

B. A team name subunit.

The team name subunit may be configured to, if a team name matching the text exists in the preset text library, divide the text into a team name group.

C. A phase subunit.

The stage subunit may be configured to, if there is a match stage matching the text in the preset text library, divide the text into a match stage group.

In some embodiments, the predefined text may include a team name and a game stage, the second category group may include a game score group, and the second grouping sub-module is for:

In some embodiments, the predetermined text may include a team name and a game stage, the predetermined time symbols may include a first predetermined time symbol and a second predetermined time symbol, the time category group may include a first time category group and a second time category group, and the second grouping sub-module is configured to:

if the team names and the competition stages matched with the texts do not exist in the preset text base and the texts can include first preset time symbols, dividing the texts into a first time type group;

and if the preset text base does not have the team names and the competition stages matched with the text and the text can comprise a second preset time symbol, classifying the text into a second time type group.

(IV) a classification module 304.

The classification module 304 may be configured to classify the text by type based on the text location and the text set, and determine the text type of the text.

In some embodiments, the target video may be a video recording antagonistic game content, the text set may include a team name group, a game stage group, a game score group, a first time type group, a second time type group, the text type may include a game stage type, a first time type, a second time type, a guest owner name type, a game score type, the classification module 304 may include a stage type submodule, a first time type submodule, a second time type submodule, a guest owner judgment submodule, and a score type judgment submodule as follows:

(1) a phase type submodule.

The stage type sub-module may be configured to determine a text type of text in the set of game stages as the game stage type.

(2) A first time type sub-module.

The first time type sub-module may be for determining a text type of text in the first time type group as the first time type.

(3) A second time type sub-module.

The second time type sub-module may be for determining a text type of text in the second group of time types as the second time type.

(4) And a host and guest judgment submodule.

The host-guest judgment sub-module can be used for judging the host guest of the text in the name group of the participating team according to the text position of the text in the name group of the participating team and determining the type of the host guest name of the text in the name group of the participating team.

(5) And a score type judgment submodule.

The score type judging submodule can be used for judging the score types of the texts in the competition score group according to the text positions of the texts in the competition score group and determining the competition score types of the texts in the competition score group.

In some embodiments, the score type determination sub-module may include a number sub-unit, a score type determination sub-unit, a first relative position relationship sub-unit, and a second relative position relationship sub-unit, as follows:

A. a number of sub-units.

The number of sub-units may be used to determine the number of texts in the first time type group, the second time type group, the tournament score group.

B. And a score type judgment subunit.

The score type judging subunit may be configured to, when the number of texts in the competition score group is 3 and the sum of the numbers of texts in the first time type group and the second time type group is 1, determine the text types of the texts in the first time type group and the second time type group as the second time type, perform score type judgment on the texts in the competition score group according to the text positions of the texts in the competition team name group, and determine the competition score type of the texts in the competition team name group.

C. A first relative positional relationship subunit.

The first relative positional relationship subunit may be configured to, when the number of texts in the first time type group is 2, determine a relative positional relationship between the texts in the first time type group according to the text positions, determine a text type of the text belonging to the first relative positional relationship as a first time type, and determine a text type of the text belonging to the second relative positional relationship as a second time type.

D. A second relative positional relationship subunit.

The second relative positional relationship subunit may be configured to determine, when the number of texts in the second time type group is 2, a relative positional relationship between the texts in the second time type group according to the text positions, determine a text type of the text belonging to the first relative positional relationship as the first time type, and determine a text type of the text belonging to the second relative positional relationship as the second time type.

In some embodiments, the score type determination subunit may be to:

when the competition group has a plurality of texts belonging to a preset text range, counting the number of texts with a preset position relation between the texts in the competition group and the texts in the second time type group according to the text positions of the texts in the second time type group;

if the number of the texts with the preset position relation with the texts in the second time type group in the competition result group is 1, determining the text type of the texts with the preset position relation with the texts in the second time type group in the competition result group as a second text type, and determining the text type of the texts which do not have the preset position relation with the texts in the second time type group in the competition result group as a competition score type;

if the number of texts with the preset position relation with the texts in the second time type group in the competition score group is 2, calculating the relative distance between the texts in the competition score group and a preset coordinate axis;

and if the relative distance is greater than the preset distance threshold value, determining the text type of the text with the minimum relative distance in the match score group as a second time type, and determining the text type of the text with the non-minimum relative distance in the match score group as the match score type.

(V) generating module 305.

The generation module 305 may be used to generate video detail information from text and text type.

In some embodiments, the target video may be a video recorded with antagonistic game content, the video detail information may include a game detail intelligence table and a game trend graph, and the generating module 305 may be configured to:

(sixth) a display module 306.

The display module 306 may be used to display video detail information.

In some embodiments, the target video may be a video recorded with antagonistic game content, the video detail information may include a game detail report and a game trend graph, and the display module 306 may be configured to:

displaying a match detail page;

and displaying a game detail report and a game trend chart on a game detail page.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, the video processing apparatus of the present embodiment obtains the target video by the obtaining module; the identification module performs text identification on the target video to obtain a text appearing in the target video and a text position of the text; the set module is used for grouping the texts to obtain a text set; the classification module classifies the types of the texts based on the text positions and the text sets to determine the text types of the texts; generating video detail information according to the text and the text type by a generating module; and displaying the video detail information by the display module. Therefore, the embodiment of the invention can improve the effect of the video processing method.

The embodiment of the invention also provides the electronic equipment which can be equipment such as a terminal, a server and the like. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, or the like.

In this embodiment, the electronic device of this embodiment is described in detail as an example, for example, as shown in fig. 4, it shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, an input module 404, and a communication module 405. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. In some embodiments, processor 401 may include one or more processing cores; in some embodiments, processor 401 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device also includes a power supply 403 for supplying power to the various components, and in some embodiments, the power supply 403 may be logically coupled to the processor 401 via a power management system, such that the power management system may manage charging, discharging, and power consumption. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may also include an input module 404, the input module 404 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The electronic device may also include a communication module 405, and in some embodiments the communication module 405 may include a wireless module, through which the electronic device may wirelessly transmit over short distances, thereby providing wireless broadband internet access to the user. For example, the communication module 405 may be used to assist a user in sending and receiving e-mails, browsing web pages, accessing streaming media, and the like.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring a target video;

grouping the texts to obtain a text set;

classifying the types of the texts in the text set based on the text positions and the text set, and determining the text types of the texts;

generating video detail information according to the text and the text type;

and displaying the video detail information.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Therefore, the video processing method can improve the effect of the video processing method.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the video processing methods provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

acquiring a target video;

grouping the texts to obtain a text set;

generating video detail information according to the text and the text type;

and displaying the video detail information.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video processing method provided in the embodiments of the present invention, beneficial effects that can be achieved by any video processing method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The video processing method, the video processing apparatus, the electronic device, and the computer-readable storage medium according to the embodiments of the present invention are described in detail above, and a specific example is applied in the description to explain the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video processing method, comprising:

acquiring a target video;

grouping the texts to obtain a text set;

generating video detail information according to the text and the text type;

and displaying the video detail information.

2. The video processing method of claim 1, wherein the grouping the text to obtain a text set comprises:

performing type matching on the text in a preset text library, and if the preset text matched with the text exists in the preset text library, classifying the text into a first type group;

if a preset text matched with the text does not exist in the preset text library and the text consists of numbers, classifying the text into a second type group;

and if the preset text matched with the text does not exist in the preset text library and the text comprises a preset time symbol, dividing the text into time type groups.

3. The video processing method according to claim 2, wherein the target video is a video in which antagonistic game contents are recorded, the first type group includes a team name group and a game stage group, the preset text includes a team name and a game stage, type matching is performed on the text in a preset text library, and if there is a preset text matching the text in the preset text library, the classifying the text into the first type group includes:

performing type matching on the text in a preset text library;

if the name of the team matched with the text exists in the preset text base, dividing the text into a team name group;

if a match stage matched with the text exists in the preset text library, dividing the text into a match stage group;

the preset text comprises a team name and a competition stage, the second type group comprises a competition score group, and if the preset text matched with the text does not exist in the preset text library and the text is composed of numbers, the text is classified into the second type group, which comprises the following steps:

4. The video processing method of claim 3, wherein the preset text includes a team name and a game stage, the preset time symbols include a first preset time symbol and a second preset time symbol, and the time type groups include a first time type group and a second time type group;

if the preset text matched with the text does not exist in the preset text library and the text comprises a preset time symbol, dividing the text into time type groups, including:

5. The video processing method according to claim 1, wherein the target video is a video in which antagonistic game contents are recorded, the text set includes a team name group, a competition phase group, a competition score group, a first time type group, and a second time type group, and the text type includes a competition phase type, a first time type, a second time type, a main team name type, and a competition score type;

the type classification of the text based on the text position and the information of the text set, and the determination of the text type of the text comprise:

determining the text type of the text in the competition stage group as a competition stage type;

determining the text type of the text in the first time type group as a first time type;

determining the text type of the text in the second time type group as a second time type;

according to the text position of the text in the name group of the participating team, carrying out host and guest judgment on the text in the name group of the participating team, and determining the host and guest name type of the text in the name group of the participating team;

and judging the score type of the text in the match score group according to the text position of the text in the match score group, and determining the match score type of the text in the match score group.

6. The video processing method of claim 5, wherein the determining the match score type of the text in the group of names of the participating teams by performing score type judgment on the text in the group of match scores according to the text position of the text in the group of names of the participating teams comprises:

determining the number of texts in the first time type group, the second time type group and the match score group;

when the number of texts in the competition group is 3 and the sum of the numbers of texts in the first time type group and the second time type group is 1, determining the text types of the texts in the first time type group and the second time type group as a second time type, and performing score type judgment on the texts in the competition group according to the text positions of the texts in the competition group name group to determine the competition score type of the texts in the competition group name group;

when the number of texts in the first time type group is 2, determining the relative position relationship among the texts in the first time type group according to the text position, determining the text type of the text belonging to the first relative position relationship as a first time type, and determining the text type of the text belonging to the second relative position relationship as a second time type;

when the number of the texts in the second time type group is 2, determining the relative position relationship between the texts in the second time type group according to the text position, determining the text type of the text belonging to the first relative position relationship as a first time type, and determining the text type of the text belonging to the second relative position relationship as a second time type.

7. The video processing method of claim 6, wherein determining the match score type of the text in the set of names of the participating teams by scoring the text in the set of match scores according to the text position of the text in the set of names of the participating teams comprises:

8. The video processing method of claim 1, wherein the performing text recognition on the target video to obtain a text appearing in the target video and a text position of the text comprises:

extracting image features of the target video to obtain the image features of the target video;

text region detection is carried out based on the image features, and text region features in the target video are obtained;

performing text recognition based on the image features to obtain text features in the target video;

performing region trimming processing on the text region characteristics to obtain processed text region characteristics;

performing text region prediction based on the processed text region characteristics, and determining a text region appearing in the target video;

performing text prediction based on the processed text characteristics, and determining a text appearing in the target video;

and determining the text position of the text according to the text area and the text.

9. The video processing method according to claim 8, wherein the performing text region detection based on the image features to obtain text region features in the target video comprises:

10. The video processing method of claim 8, wherein the performing text prediction based on the text region features to determine the text appearing in the target video comprises:

11. The video processing method of claim 8, wherein the performing text recognition based on the image features to obtain text features in the target video comprises:

12. The video processing method according to claim 1, wherein the target video is a video in which contents of antagonistic games are recorded, and the video detail information includes a game detail information table and a game trend graph;

the generating of the video detail information according to the text and the text type comprises:

the displaying the video detail information comprises:

displaying a match detail page;

13. A video processing apparatus, comprising:

the acquisition module is used for acquiring a target video;

and the display module is used for displaying the video detail information.

14. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the video processing method according to any one of claims 1 to 12.

15. A storage medium storing instructions adapted to be loaded by a processor to perform the steps of the video processing method according to any of claims 1 to 12.