CN112163577A

CN112163577A - Character recognition method and device in game picture, electronic equipment and storage medium

Info

Publication number: CN112163577A
Application number: CN202011003615.1A
Authority: CN
Inventors: 晋博; 李秋实; 孙智
Original assignee: Guangzhou Boguan Information Technology Co Ltd
Current assignee: Guangzhou Boguan Information Technology Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-01
Anticipated expiration: 2040-09-22
Also published as: CN112163577B

Abstract

The embodiment discloses a character recognition method, a character recognition device, electronic equipment and a storage medium in a game picture, which can extract characteristic information of the game picture to be detected, wherein the game picture to be detected is the game picture of a target virtual scene; performing text region detection on the game picture to be detected based on the characteristic information to obtain a candidate text region; identifying a target text region where target information is located from the candidate text regions, wherein the target information is character information needing to be extracted from a target virtual scene; and identifying the text in the target text area to obtain the target information of the game picture to be detected. In the scheme, the text region identification and the character identification are carried out in two stages, and the text region identification is not limited to the text region where the target information is located, so that the text region identification scheme can be shared by different virtual scenes, the universality of the character identification method in the virtual scenes is improved, and the character identification accuracy is ensured by the character identification scheme.

Description

Character recognition method and device in game picture, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for recognizing characters in a game picture, electronic equipment and a storage medium.

Background

The character detection and identification technology is a technology for accurately positioning and identifying characters in videos or images. In many business scenarios, some key information in the video or image may be needed for the business to be implemented.

By taking live game as an example, in the live game, massive live game videos can be generated every day, and accurate acquisition of key character information in videos or images can provide more key information for management of live content and the like, so that user experience is enriched.

However, for different games, key information required to be acquired may be different, and because a game live broadcast scene is complex, in a detection and identification related technology for game live broadcast text content, a set of text detection scheme is often designed separately for text identification for different games. In another related art, a general character detection model is designed for character recognition based on deep learning. However, the above solution has the following disadvantages:

1. the universality of a set of character detection scheme which is designed independently is not strong, different models need to be trained for character recognition of different games, the requirements of training of different models on sample resources of different games are large, and when the number of games is large, a large amount of time and resources are consumed for model training;

2. for a set of character detection scheme which is designed independently, when the number of games needing to be identified is large, the deployment and the operation of the model occupy more storage resources and calculation resources of the server, and the character identification efficiency is not favorably improved.

3. For the scheme using the universal character detection model, when character detection is performed for complex game scenes in a game, the operation speed of the algorithm cannot meet the game operation requirement due to the complex network structure.

Disclosure of Invention

The embodiment of the application provides a character recognition method and device in a game picture, electronic equipment and a storage medium, which can improve the universality of character recognition on various virtual scenes and are beneficial to improving the character recognition efficiency and accuracy.

The embodiment of the application provides a character recognition method, which comprises the following steps:

extracting characteristic information of a game picture to be detected, wherein the game picture to be detected is a game picture of a target virtual scene;

performing text region detection on the game picture to be detected based on the characteristic information to obtain a candidate text region;

identifying a target text region where target information is located from the candidate text regions, wherein the target information is character information needing to be extracted from the target virtual scene;

and identifying the text in the target text area to obtain the target information of the game picture to be detected.

Correspondingly, the embodiment of the present application further provides a text recognition apparatus, including:

the device comprises a characteristic information extraction unit, a detection unit and a control unit, wherein the characteristic information extraction unit is used for extracting characteristic information of a game picture to be detected, and the game picture to be detected is a game picture of a target virtual scene;

a text region detection unit, configured to perform text region detection on the game picture to be detected based on the feature information to obtain a candidate text region;

a text region selection unit, configured to identify a target text region where target information is located from the candidate text regions, where the target information is text information that needs to be extracted from the target virtual scene;

and the information identification unit is used for identifying the text in the target text area to obtain the target information of the game picture to be detected.

In some optional embodiments, the feature information extraction unit is configured to extract feature information from the game picture to be detected through a general text region detection model;

a text region detection unit, configured to perform text region detection on the game picture to be detected based on the feature information through a general text region detection model, so as to obtain a candidate text region;

a text region selection unit, configured to identify a target text region where target information is located from the candidate text regions based on a character recognition model corresponding to the target virtual scene;

and the information identification unit is used for identifying the text in the target text area based on the character identification model corresponding to the target virtual scene to obtain the target information of the game picture to be detected.

In some optional embodiments, the universal text region detection model is trained based on sample game scenes of at least two virtual game scenes, wherein the at least two virtual scenes include at least a target virtual scene.

In some optional embodiments, the generic text region detection model includes at least two connected feature extraction layers, and corresponding feature merging layers, and the feature information extraction unit is configured to:

extracting feature graphs of multiple scales from the game picture to be detected through each feature extraction layer;

and fusing the characteristic graphs of the multiple scales through the characteristic merging layer to obtain the characteristic information of the game picture to be detected.

In some optional embodiments, the number of the feature merging layers is one layer less than the number of the feature extraction layers, and the feature information extraction unit, through the feature merging layers, is configured to:

taking the feature map extracted by the last feature extraction layer as a feature map to be merged, and merging the feature map to be merged with the feature map extracted by the adjacent upper feature extraction layer after carrying out scale transformation on the feature map to be merged by the feature merging layer to obtain a merged feature map;

performing convolution operation on the merged feature map to obtain a feature map after convolution;

and taking the feature graph after convolution as a new feature graph to be merged, returning to execute the step of merging the feature graph after the feature graph to be merged is subjected to scale transformation through the feature merging layer and the feature graph extracted by the adjacent upper feature extraction layer until the feature graphs of all scales are merged, and taking the finally obtained feature graph after convolution as the feature information of the game picture to be detected.

In some optional embodiments, the apparatus further comprises: a text region detection module training unit to:

obtaining sample game pictures from a plurality of virtual scenes, wherein the virtual scenes comprise the target virtual scene, the sample game pictures comprise texts formed by characters in various styles, and the sample game pictures are marked with position information of an actual text area where the texts are located;

extracting characteristic information from the sample game picture based on a general text area detection model to be trained;

performing text region detection on the sample game picture based on the characteristic information through the general text region detection model to obtain information of a predicted text region of the sample game picture;

and adjusting parameters of the universal text region detection model based on the position information of the actual text region and the information of the predicted text region.

In some alternative embodiments, the sample game screens include a first sample game screen and a second sample game screen, wherein,

the first sample game picture is from a plurality of virtual scenes, and position information of an actual text area where a text is located is marked in the first sample game picture; the second sample game picture is a sample game picture obtained by generating a plurality of sections of texts by adopting a plurality of preset character styles and setting at least one section of generated texts in the game picture derived from the virtual scene.

In some optional embodiments, the information of the predicted text region includes: an offset and an offset angle of the predicted text region;

a text region detection module training unit to:

calculating a first loss for measuring similarity of image contents of the actual text region and the predicted text region based on the image contents of the actual text region and the predicted text region in the sample game picture;

calculating the shape loss corresponding to the predicted text region based on the position information of the actual text region and the offset of the predicted text region;

calculating the angle loss corresponding to the predicted text region based on the position information of the actual text region and the offset angle of the predicted text region;

obtaining a second loss of the predicted text region based on the shape loss and the angle loss;

adjusting parameters of the generic text region detection model based on the first and second losses.

In some optional embodiments, the generated text set in the second sample game screen is generated based on at least one text style selected from a plurality of preset text styles and a text model of information required by each virtual scene, wherein the information required by each virtual scene is text information required to be extracted from each virtual scene.

In some optional embodiments, the apparatus further comprises: a character recognition model training unit for:

acquiring a predicted text area of the sample game picture;

based on the predicted text region, acquiring a training sample of a character recognition model corresponding to the target virtual scene;

predicting the probability that the predicted text region of the training sample comprises the target information through a character recognition model corresponding to the target virtual scene to obtain a classification result of each predicted text region;

and adjusting parameters of a character recognition model corresponding to the target virtual scene based on the classification result.

In some optional embodiments, the training sample comprises type identification information of a predicted text region, the type identification information being used to identify whether the target information is included in the predicted text region;

a character recognition model training unit for:

and adjusting parameters of a character recognition model corresponding to the target virtual scene based on the classification result and the type identification information.

In some optional embodiments, the text recognition model training unit is configured to:

determining at least two first predicted text regions from the predicted text regions;

and synthesizing the at least two first prediction text regions to obtain a second prediction text region, and taking the second prediction text region as a training sample of a character recognition model of the target virtual scene.

In some optional embodiments, the text region selection unit is to:

predicting the probability that the candidate text region comprises target information based on a character recognition model corresponding to the target virtual scene;

and selecting the candidate text region with the probability not lower than a preset probability threshold value from the candidate text regions as a target text region.

In some optional embodiments, the apparatus further comprises: the first game picture acquisition unit is used for extracting the game picture to be detected from the video of the target virtual scene before extracting the characteristic information of the game picture to be detected;

the device also includes: and the video classification information acquisition unit is used for identifying the text in the target text region to obtain the target information of the game picture to be detected, and then determining the video classification information of the video of the target virtual scene based on the preset corresponding relation between the target information of the target virtual scene and the video classification information and the target information of the game picture to be detected.

In some optional embodiments, the apparatus further comprises: the second game picture acquisition unit is used for extracting the game picture to be detected from the video of the target virtual scene before extracting the characteristic information of the game picture to be detected;

the device also includes: the video recommending unit is used for acquiring a video recommending strategy of the target virtual scene based on the target information of the game picture to be detected after the text in the target text area is identified and the target information of the game picture to be detected is obtained;

and recommending the video of the target virtual scene based on the recommendation strategy.

Correspondingly, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps in any of the character recognition methods provided in the embodiments of the present application.

In addition, a storage medium is further provided, on which a computer program is stored, where the computer program is executed by a processor to implement the steps in any one of the character recognition methods provided in the embodiments of the present application.

By adopting the embodiment of the application, the characteristic information can be extracted from the game picture to be detected, wherein the game picture to be detected is the game picture of the target virtual scene; performing text region detection on the game picture to be detected based on the characteristic information to obtain a candidate text region; identifying a target text region where target information is located from the candidate text regions, wherein the target information is character information needing to be extracted from a target virtual scene; and identifying the text in the target text area to obtain the target information of the game picture to be detected. The scheme adopts a two-segment character recognition method, firstly, candidate text regions are recognized from a game picture, a target text region is selected, then character recognition is carried out based on the target text region, the text region recognition and the character recognition are carried out in two stages, and the text region recognition is not limited to the text region where the target information of a target virtual scene is located, so that the recognition scheme of the text region can be shared by different virtual scenes, and the universality of the character recognition method of the embodiment under the virtual scene is improved.

Furthermore, the identification of candidate text regions of game pictures of different virtual scenes can be realized by adopting the same universal text region identification model, the requirement of training of the universal text region identification model on the number of sample resources can be reduced, the occupation of character identification in the game pictures on storage resources and calculation resources of a server can be effectively controlled based on the sharing of the universal text region identification model in a plurality of virtual scenes when the number of games is large, the text region identification and the character identification are divided into two stages, and the running speed of an algorithm can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a scene schematic diagram of a method for recognizing characters in a game screen according to an embodiment of the present application;

FIG. 1b is a schematic flow chart illustrating a method for recognizing characters in a game screen according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a generic text region detection model according to an embodiment of the present application;

FIG. 3 is a diagram of a second sample game screen provided by an embodiment of the present application;

FIG. 4 is a graph of loss for a generic text region detection model provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a character recognition device in a game screen according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application provides a character recognition method and device in a game picture, electronic equipment and a computer readable storage medium. Specifically, the character recognition method according to the embodiment of the present application may be executed by an electronic device, where the electronic device may be a terminal or a server.

The terminal can be a mobile terminal device such as a mobile phone, a tablet Computer, a notebook Computer and a wearable intelligent device, and can also be a fixed terminal device such as an intelligent television and a Personal Computer (PC).

The terminal may include a client, where the client may be a video client, a live broadcast client, and the like, and the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

For example, referring to fig. 1a, taking the character recognition method in the game screen as an example, the electronic device may extract feature information from a game screen to be detected, where the game screen to be detected is a game screen of a target virtual scene; performing text region detection on the game picture to be detected based on the characteristic information to obtain a candidate text region; identifying a target text region where target information is located from the candidate text regions, wherein the target information is character information needing to be extracted from the target virtual scene; and identifying the text in the target text area to obtain the target information of the game picture to be detected.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment of the application provides a character recognition method in a game picture, which can be executed by a terminal or a server, or can be executed by the terminal and the server together; the embodiment of the present application is described by taking an example in which a character recognition method in a game screen is executed by a terminal.

As shown in fig. 1b, the specific flow of the character recognition method in the game screen may be as follows:

101. extracting characteristic information of a game picture to be detected, wherein the game picture to be detected is a game picture of a target virtual scene;

the virtual scene described in this embodiment may be any type of virtual scene, including but not limited to a game scene, an animation scene, and the like, and the game picture to be detected may be a picture image in a game video, or a picture image in a live game video, and the like, which is not limited in this embodiment.

Before step 101, the method may further include: and acquiring a game picture to be detected from the target virtual scene.

In this embodiment, image extraction may be performed from a video of a target virtual scene, so as to obtain a game picture to be detected. The number of the game frames to be detected may be one or more, which is not limited in this embodiment. Optionally, the step of "acquiring the game picture to be detected from the target virtual scene" may include: and extracting the game picture to be detected from the video of the target virtual scene.

Specifically, the video of the target virtual scene may be converted into an image sequence, and the image may be extracted from the image sequence according to a preset image extraction rule to serve as the game picture to be detected.

Wherein the image extraction rule may include: and extracting images in the image sequence according to a preset frame number interval, or determining an image subsequence corresponding to a preset extraction time period in the image sequence, and extracting the images from the image subsequence. The preset extraction time period can be set according to the target virtual scene to which the game picture to be detected belongs.

In another example, the video of the target virtual scene may be a live video, which may be a video in a live state or a live video saved after the live state is finished. Optionally, the step of "extracting the game picture to be detected from the video of the target virtual scene" may include:

in the live broadcast process of the target virtual scene, acquiring a live broadcast video of the target virtual scene;

and acquiring a video image from the live video as a game picture to be detected.

Optionally, the step of extracting feature information from the game picture to be detected may include:

and extracting characteristic information of the game picture to be detected through the universal text area detection model.

In this embodiment, the function of the universal text region detection model includes detecting a text region that may contain a text from an image as a candidate text region, where the universal text region detection model is obtained by training sample game pictures based on at least two virtual scenes, so that the universal text region detection model can perform more accurate text region recognition on images of multiple virtual scenes. Therefore, the embodiment can realize the text area detection of the images of the virtual scenes through one universal text area detection model. The character recognition scheme of the embodiment has certain universality for various virtual scenes.

For example, if the generic text region detection model is trained based on sample game scenes from games A, B and C, the generic text region detection model has the ability to accurately identify candidate text regions from scenes for any of games A-C. I.e., the screen image of game A, B or C, can use the generic detection model to perform feature information extraction and candidate text region detection.

In this embodiment, the general text region detection model may include a feature extraction module, and step 101 may specifically include extracting feature information from the game picture to be detected by the feature extraction module. Wherein the feature information may be a feature map.

In one example, the feature extraction module may extract feature maps of multiple scales from the game picture to be detected, and fuse the feature maps of the multiple scales to obtain feature information of the game picture to be detected. Therefore, the characteristic information can comprise information in characteristic diagrams of various scales, corresponding text regions can be accurately detected for text blocks with different sizes, and the problem of severe text line scale conversion is solved.

Optionally, the feature extraction module in this embodiment may include at least two connected feature extraction layers and corresponding feature merging layers, where the shallow feature map extracted by the feature extraction layer in this embodiment may be used to predict small text lines, and the deep feature map may be used to predict large text lines.

The general text region detection model of this embodiment may be implemented by using a lightweight network, for example, a feature extraction module in the general text region detection model may be implemented by using a lightweight network Mobilenetv2, so that the general text region detection model is light-weighted under the condition that detection accuracy is not changed, and is conveniently used for deployment of the general text region detection model, for example, at a server.

Optionally, the step of "extracting feature information from the game picture to be detected through the universal text area detection model" may include:

extracting feature graphs of multiple scales from a game picture to be detected through each feature extraction layer;

and fusing the characteristic graphs of various scales through the characteristic merging layer to obtain the characteristic information of the game picture to be detected.

In this embodiment, the feature extraction layers may be connected in series, and each feature extraction layer may perform corresponding feature extraction operations, such as convolution operation, on the feature map extracted by the previous feature extraction layer to obtain a new feature map. The scales of the feature maps from the first feature extraction layer to the last feature extraction layer may be sequentially decreased or sequentially increased, and the change rule of the scales of the feature maps is determined by the design of the universal text region detection model, which is not limited in this embodiment.

In this embodiment, the number of the feature merging layers is not limited, and may be set according to actual needs, and in one example, the step "fusing feature maps of multiple scales through the feature merging layers to obtain feature information of the game picture to be detected" may include:

through the feature merging layer, carrying out scale conversion on feature maps of various scales of a game picture to be detected, and converting all the feature maps into preset scales;

and fusing the converted characteristic graphs to obtain the characteristic information of the game picture to be detected.

The preset scale can be set according to actual needs, for example, set to 1/4 or the like of the game picture to be detected. Optionally, fusing the converted feature maps may include splicing the converted feature maps.

In another example, the feature maps can be merged from the top feature map of the feature extraction module (the feature map output by the last layer of feature extraction network) according to a corresponding rule by using a U-net method. Optionally, the number of the feature merging layers is one layer less than that of the feature extraction layers, and feature graphs of multiple scales are fused through the feature merging layers to obtain feature information of the game picture to be detected, where the feature information includes:

performing convolution operation on the combined feature map to obtain a feature map after convolution;

and taking the feature graph after convolution as a new feature graph to be merged, returning to the step of executing scale transformation on the feature graph to be merged through the feature merging layer and merging the feature graph with the feature graph extracted by the adjacent upper feature extraction layer until merging of all the scale feature graphs is completed, and taking the finally obtained feature graph after convolution as the feature information of the game picture to be detected.

After a new feature map to be merged is obtained each time, the feature merging layer for merging the feature map to be merged is not the same as the feature merging layer used for merging the features in the previous round.

In this embodiment, performing scale transformation on the feature graph to be merged through the feature merging layer may include: and transforming the dimension of the feature map to be merged into the dimension of the feature map extracted by the adjacent upper feature extraction layer through the feature merging layer.

For example, if the scales of the feature maps from the first feature extraction layer to the last feature extraction layer are sequentially reduced, which are 1/2, 1/4, 1/8 and 1/16 of the game picture to be detected, the feature map to be merged is subjected to scale transformation by the feature merging layer, that is, the scale of the feature map to be merged is enlarged by 2 times.

102. Performing text region detection on the game picture to be detected based on the characteristic information to obtain a candidate text region;

the text area in this embodiment may be an area containing any text, for example, an area where a control with text is displayed, an area where an icon with text is displayed, an area where a text line or a text column is located, or the like.

Wherein, step 102 may specifically include: and carrying out text region detection on the game picture to be detected based on the characteristic information through a general text region detection model to obtain a candidate text region.

In this embodiment, the universal text area detection model further includes an output layer, configured to implement the step 102, in this embodiment, the information output by the output layer may include a score map of a text area of the candidate text area, and information of the candidate text area, where the information includes two parts, namely an AABB Bounding box and a rotaangle, where the AABB represents four offsets, namely, an upper offset, a lower offset, a left offset, a right offset, and a left offset, of the candidate text area, and the rotaangle is an offset angle of the candidate text area.

The following describes an example of a text region detection process of the universal text region detection model in this embodiment with reference to the network structure of the universal text region detection model shown in fig. 2. Referring to fig. 2, the universal text region detection model of the present embodiment includes a Feature extraction network Feature extractor including four Feature extraction layers (Conv stages), a Feature merging layer (Feature merging), and an output layer (output layer). Optionally, the number of feature extraction layers is 4, and the number of feature merging layers is 3. Each feature merging layer comprises a scale conversion layer, namely an up-firing layer, a merging layer, namely a contact layer and a convolution layer; optionally, the scale of the feature map is 1/2, 1/4, 1/8 and 1/16 of the game picture to be detected respectively from the first layer feature extraction layer (Conv stage 1) to the last layer feature extraction layer (Conv stage 4).

Referring to fig. 2, Image in fig. 2 represents a game picture to be detected, and the game picture to be detected passes through 4 feature extraction layers to obtain feature maps of 4 scales, namely f1-f4 in fig. 2.

After or during feature extraction, merging may be performed from the last feature map f1 based on a feature merging layer, first taking the last feature map f1 extracted in the feature extraction network as a feature map h1 to be merged, inputting the last feature merging layer, where a scale transformation layer, i.e., an up-pooling layer, in the feature merging layer enlarges the scale of the feature map h1 by 2 times of the original scale, then merging with the feature map f2 of the previous layer to obtain a merged feature map, performing convolution operation on the merged feature map through two convolution kernels with the sizes of 1x1 and 3x3 to obtain a convolved feature map h2, taking the convolved feature map h2 as a new feature map to be merged, inputting the previous feature merging layer, performing scale transformation in the previous feature merging layer, and then merging with the feature map f3 of the previous layer to obtain a merged feature map, and performing convolution operation on the combined feature map through convolution kernels with the sizes of 1x1 and 3x3 to obtain a feature map h3 after convolution, and performing similar combination operation on the h3 and the f4 to obtain a final feature map h4 after convolution, wherein the feature map h4 is feature information of the game picture to be detected.

The convolution layer in the feature merging layer may be composed of convolution kernels having the sizes 1x1 and 3x3 as described above, or may be composed of convolution kernels having other sizes, which is not limited in this embodiment. From the last feature merge layer to the top feature merge layer, the number of convolution kernels decreases from layer to layer, in the

order

128, 64 and 32. The output layer in this embodiment is a convolution kernel that may include 32 3x 3.

In this embodiment, the specific merging formula of the feature maps of multiple scales is as follows:

wherein fi represents a feature map output by the ith layer of feature extraction layer, and hi may represent the ith feature map to be merged.

The feature information of this embodiment may be processed by the output layer to obtain a text region score map of the candidate text region and text shape information of the candidate text region, where the text region score map may be used to determine probabilities that the candidate text region belongs to a foreground and a background, the text shape information includes two parts, namely, an AABB Bounding box and a rotae angle, where the AABB represents four offsets, i.e., an upper offset, a lower offset, a left offset, and a right offset, of the text region, and is used to determine a size of the text region, and the rotae angle is an offset angle of the text region.

103. Identifying a target text region where target information is located from the candidate text regions, wherein the target information is character information needing to be extracted from the target virtual scene;

optionally, the step "identifying the target text region where the target information is located from the candidate text regions" may include: and identifying a target text region where target information is located from the candidate text regions based on the character identification model corresponding to the target virtual scene.

104. And identifying the text in the target text area to obtain the target information of the game picture to be detected.

Optionally, the step of "recognizing the text in the target text region to obtain the target information of the game picture to be detected" may include: and identifying the text in the target text area based on the character identification model corresponding to the target virtual scene to obtain the target information of the game picture to be detected.

In this embodiment, each virtual scene has a corresponding (dedicated) character recognition model, and the model is obtained by training with a training sample corresponding to the virtual scene. The character recognition model can recognize a target text region where target information required by a target virtual scene is located from the candidate text regions, and then recognize texts in the target text region to obtain the required target information.

In this embodiment, the character recognition model may also be designed to be lightweight, for example, the character recognition model may include 3 convolutional layers, 2 pooling layers, and 2 full-link layers.

In this embodiment, the generic text region detection model is trained in advance to obtain the generic text region detection model used in step 101.

Optionally, before step 101, the method may further include:

In this embodiment, the specific extraction process of the feature information of the sample game picture by the universal text area detection model may refer to the aforementioned extraction process of the feature information of the game picture to be detected.

For example, the step "extracting feature information from a sample game screen based on a universal text area detection model to be trained" may specifically include:

extracting feature graphs of multiple scales from a sample game picture through each feature extraction layer;

and taking the feature graph after convolution as a new feature graph to be merged, returning to the step of executing scale transformation on the feature graph to be merged through the feature merging layer and merging the feature graph with the feature graph extracted by the adjacent upper-layer feature extraction layer until merging of the feature graphs of all scales is completed, and taking the finally obtained feature graph after convolution as the feature information of the sample game picture.

As is known, a large amount of labeled text data is required for model training, and in order to improve efficiency, in this embodiment, a method of mixing a small amount of manual labeling data and a large amount of generated data is used to produce a required sample game picture. Meanwhile, in order to improve the generalization capability of the model, the generated sample game picture can contain texts composed of characters in different styles.

In this embodiment, the sample game screen may include a first sample game screen and a second sample game screen, wherein,

The generating process of the second game sample picture may specifically include:

acquiring game picture images from a plurality of videos derived from a plurality of virtual scenes, respectively;

and generating a plurality of sections of texts by adopting a plurality of preset character styles, and setting at least one section of generated text in each game picture to obtain a second sample game picture.

The style of the text in this embodiment includes, but is not limited to, the language type (e.g., Chinese, English, Korean, etc.), the font, the color of the text, the size of the text, and so on.

Optionally, in this embodiment, the specific process of acquiring the first sample game screen may include: video images are respectively obtained from a plurality of videos from a plurality of virtual scenes, and actual text areas where texts in the video images are located are marked to obtain a first sample game picture. The actual text area where the text in the video image is located can be labeled through a network model with a complex network structure and excellent text area identification capability, for example, a predicted text area of a sample game picture is identified through a text area identification network with a complex structure, and then a certain condition is met, for example, the predicted text area with the probability of belonging to the foreground being greater than a preset threshold value is used as the actual text area where the text in the sample game picture is located for labeling.

Optionally, the annotation of the first sample game image may also be performed manually, the obtained video image may be sent to the annotation platform, the video image is manually annotated in the actual text region on the annotation platform, and the annotated video image is received by the annotation platform as the first sample game image.

Alternatively, in this embodiment, the first sample game screen may also be generated in a similar manner to the second sample game screen.

Optionally, the second sample game screen may also be labeled with an actual text region, and in view of the fact that the second sample game screen includes the generated text, the position of the text generated in the second sample game screen may be obtained based on the set position of the generated text in the second sample game screen, and then the actual text region is automatically labeled on the second sample game screen based on the position.

In this embodiment, the position information of the actual text region may include coordinates (x, y) of a certain vertex of the actual text region, as well as a height h and a width w of the actual text region, and a rotation angle of the actual text region.

Alternatively, referring to fig. 3, fig. 3 shows a schematic diagram of a generated second sample game screen, which may include text composed of fonts of different styles.

Optionally, in this embodiment, the information of the predicted text region includes: predicting the offset and offset angle of the text region; the step of adjusting parameters of the universal text region detection model based on the position information of the actual text region and the information of the predicted text region may include:

calculating a first loss for measuring similarity of image contents of the actual text region and the predicted text region based on image contents of the actual text region and the predicted text region in the sample game picture;

based on the first loss and the second loss, parameters of the universal text region detection model are adjusted.

The first loss in this embodiment may be understood as a score map loss of the predicted text region, and the second loss may be understood as a text region shape loss of the predicted text region.

The total loss L of the generic text region detection model of the present embodiment can be expressed as follows:

L＝L_S+β_gL_g

wherein Ls is score map loss (score map loss) of the text region, Lg is text region shape loss (geometry loss) of the predicted text region, and beta_gIs the weighting factor of Lg.

In this embodiment, the Score map loss may adopt class balance cross entropy loss or dice loss, where the convergence speed during training may be improved by using dice loss, and in the example of using dice loss, the calculation formula of the first loss Ls is as follows:

wherein, X represents the number of predicted text regions predicted by the general text region detection model, and Y represents the number of actual text regions marked in the sample game picture. X ∞ Y is approximately a dot product of the image content in the predicted text region and the actual text region.

The second loss Lg in the present embodiment, i.e., the text region shape loss, may include a shape loss L_AABBAnd angle loss L_θWherein, the calculation formula of Lg is as follows:

wherein, beta_θIs an angle loss L_θThe coefficient of (b) can be set according to actual needs, for example, to 20.

In this embodiment, the shape loss L_AABBIOU (Intersection over Union) loss calculations may be used. When the IOU loss is adopted in the AABB part of the text region regression, the IOU loss calculation formula is as follows

The AABB may represent a geometric shape of the predicted text region, and may be obtained from information of the predicted text region output by the generic text region detection model, for example, based on four offsets of the predicted text region. Wherein

The AABB geometry representing the predicted text region, R is the geometry of its corresponding actual text region.

In this embodiment, the angle loss L_θThe possible formula can be as follows

Wherein the content of the first and second substances,

is a prediction of the rotation angle of the predicted text region, θ^*Representing the true value of the rotation angle of the actual text region.

In this embodiment, a training set and a test set may be further separated from the sample game screen, and the universal text region detection model may be trained based on the training set and the test set.

The training loss curve of the general text region detection model of the present embodiment is shown in fig. 4. The horizontal axis is the iteration number of training, the vertical axis is the loss value, and as can be seen from the figure, the training set loss and the test set loss are gradually reduced and converged with the increase of the iteration number, and the result shows that the designed lightweight network structure can be normally converged to achieve the expected result.

In this embodiment, the character recognition model may be trained based on the predicted text region recognized by the universal text region detection model, and optionally, when the second sample game screen is produced, the generation of the information required by each virtual scene may be performed based on the recognition effect of the character recognition model.

Further, the generated text set in the second sample game picture is generated based on at least one character style selected from a plurality of preset character styles and a text model of information required by each virtual scene, wherein the information required by each virtual scene is character information required to be extracted from each virtual scene.

For example, a text model may be set for the target information required by each virtual scene, and the text model may include features of the target information required by the virtual scene, for example, for a certain game, the target information is the remaining number of virtual bullets displayed in the upper right corner of the game page, such as "60 remaining", the text model may be set to "XX remaining", and XX is a numerical value.

Optionally, the generating process of the second sample game screen may include:

acquiring a text model of target information required by each virtual scene;

selecting at least one character style from a plurality of preset character styles for each video image of each virtual scene; and generating at least one text segment based on the selected character style and the text model of the virtual scene, and setting the generated text in the video image corresponding to the virtual scene.

In the video image of a certain virtual scene, at least one section of text generated based on the text model of the information required by the virtual scene can be set.

Optionally, in this embodiment, before extracting the feature information from the game picture to be detected, the method may further include:

acquiring a prediction text area of a sample game picture;

acquiring a training sample of a character recognition model of the target virtual scene based on the predicted text region, wherein a label of the training sample comprises type identification information of the predicted text region, and the type identification information is used for identifying whether the predicted text region comprises target information required by the target virtual scene;

predicting the probability that a predicted text region of a training sample comprises target information required by a target virtual scene through a character recognition model to obtain a classification result of each predicted text region;

and adjusting parameters of the character recognition model based on the classification result and the type identification information.

The step of obtaining the predicted text region of the sample game picture may be to recognize the text region of the sample game picture through a trained universal text region detection model to obtain the predicted text region of the sample game picture.

In this embodiment, a part of the training samples may also be obtained by synthesizing the predicted text regions, for example, for the predicted text regions detected to be relatively adjacent in the training samples, new predicted text regions may be obtained by combining, for example, stitching. For example, for a text of "4-kill", two text regions are identified, which are "4" and "kill", respectively, and may be merged into one. Thus, training samples with rich varieties can be formed.

Optionally, obtaining a training sample of the character recognition model of the target virtual scene based on the predicted text region may include:

The at least two first prediction text regions are synthesized, and the at least two first prediction text regions can be spliced to obtain a second prediction text region.

In this embodiment, after the predicted text region is obtained, the predicted text region may be labeled to obtain a training sample, where if the text in the predicted text region is the target information of the target virtual scene, the type identifier information of the predicted text region may be set to 1, and if the text in the predicted text region is not the target information required by the target virtual scene, the type identifier information of the predicted text region may be set to 0. That is, the type identification information is used to identify whether the target information of the target virtual scene is included in the predictive text region.

The step of adjusting the parameters of the character recognition model corresponding to the target virtual scene based on the classification result may include:

Based on the classification result and the difference information of the type identification information, the classification loss of the character recognition model corresponding to the target virtual scene can be determined, and the parameters of the character recognition model corresponding to the target virtual scene are adjusted based on the classification loss.

Based on the training sample, the character recognition model of the target virtual scene can learn the required characteristic information of the target text region, such as the shape of the text region and the character number in the text region.

It is conceivable that, if the sample game screen is the second sample game screen, whether or not the text in the predicted text region is the target information of the target virtual scene may be determined based on the set position of the text generated in the second sample game screen and the text model used by the generated text, so that the labeling of the type identification information of the predicted text region of the second sample game screen may be automatically realized by the computer.

Optionally, the step "identifying a target text region where target information of the target virtual scene is located from the candidate text regions based on the character recognition model of the target virtual scene" may include:

The preset probability threshold may be set according to actual needs, for example, set to 0.8, and the like.

In this embodiment, management operations such as classification and recommendation may be performed on videos to which the game pictures to be detected belong based on the target information acquired from the game pictures to be detected. Optionally, in this embodiment, after identifying the text in the target text region to obtain the target information of the game picture to be detected, the method may further include:

and determining the video classification information of the video of the target virtual scene based on the preset corresponding relation between the target information and the video classification information of the target virtual scene and the target information of the game picture to be detected.

For example, if the target information of a certain game a is "XX continuous killing", and the video classification information corresponding to "4 continuous killing" is a silver player video, assuming that "4 continuous killing" is detected in an image of one video of the game a, the video classification information of the game a may be set as "silver player".

Optionally, after the step "recognizing the text in the target text region to obtain the target information of the game picture to be detected", the method may further include:

acquiring a recommendation strategy of a video of a target virtual scene based on target information of a game picture to be detected;

The recommendation policy in this embodiment may include information such as recommendation time, recommendation object, recommendation frequency, recommendation platform, and the like, which is not limited in this embodiment.

Wherein, for better recommendation of videos, the target information can be set according to video recommendation requirements.

In this embodiment, the corresponding relationship between the target information and the recommendation policy may be preset, and after the target information of the game picture to be detected is obtained, the recommendation policy corresponding to the target information of the game picture to be detected may be obtained from the preset corresponding relationship, and then the video to which the game picture to be detected belongs is recommended.

By adopting the embodiment, the scheme of training the universal text area detection model and the character recognition model by generating the massive data set based on the virtual scene can accurately detect and recognize target information in scenes such as live game and the like, which is needed, and is beneficial to reducing the load of server resources for deploying the model. Meanwhile, the model of the embodiment can be designed into a lightweight network structure model, so that the performance efficiency of target information detection can be improved, the identification requirement of online game content is met, and more target information is provided for classification and recommendation of live broadcast content.

In order to better implement the method, the embodiment of the application also provides a character recognition device, and the character recognition device can be integrated in a terminal or a server.

For example, as shown in fig. 5, the character recognition apparatus may include a feature information extraction unit 501, a text region detection unit 502, a text region selection unit 503, and an information recognition unit 504, as follows:

a text region detection module training unit to:

acquiring a predicted text area of the sample game picture;

a character recognition model training unit for:

In some optional embodiments, the text region selection unit is to:

By adopting the embodiment, the target information in the scene needing live game and the like can be accurately detected and identified, and the load of server resources for deploying the model is reduced. Meanwhile, the model of the embodiment can be designed into a lightweight network structure model, so that the performance efficiency of target information detection can be improved, the identification requirement of online game content is met, and more effective information is provided for classification and recommendation of live broadcast content.

In addition, the embodiment of the present application further provides an electronic device, where the electronic device may be a terminal or a server, and the terminal may be a terminal device such as a smart phone, a tablet Computer, a notebook Computer, a touch screen, a game machine, a Personal Computer (PC), a Personal Digital Assistant (PDA), and the like. As shown in fig. 6, fig. 6 is a schematic structural diagram of an electronic device provided in the embodiment of the present application. The electronic device 1000 includes a processor 601 with one or more processing cores, a memory 602 with one or more computer-readable storage media, and a computer program stored on the memory 602 and executable on the processor. The processor 601 is electrically connected to the memory 602. Those skilled in the art will appreciate that the electronic device configurations shown in the figures do not constitute limitations of the electronic device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The processor 601 is a control center of the electronic device 1000, connects various parts of the whole electronic device 1000 by using various interfaces and lines, and performs various functions of the electronic device 1000 and processes data by running or loading software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the electronic device 1000.

In the embodiment of the present application, the processor 601 in the electronic device 1000 loads instructions corresponding to processes of one or more application programs into the memory 602, and the processor 601 executes the application programs stored in the memory 602 according to the following steps, so as to implement various functions:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Optionally, as shown in fig. 6, the electronic device 1000 further includes: a touch display screen 603, a radio frequency circuit 604, an audio circuit 605, an input unit 606, and a power supply 607. The processor 601 is electrically connected to the touch display screen 603, the radio frequency circuit 604, the audio circuit 605, the input unit 606, and the power supply 607. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The touch display screen 603 can be used for displaying a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface. The touch display screen 603 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 601, and can receive and execute commands sent by the processor 601. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel transmits the touch operation to the processor 601 to determine the type of the touch event, and then the processor 601 provides a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 603 to implement input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display screen 603 can also be used as a part of the input unit 606 to implement an input function.

In the present embodiment, a game application is executed by the processor 601 to generate a graphical user interface on the touch sensitive display screen 603.

The rf circuit 604 may be used for transceiving rf signals to establish wireless communication with a network device or other electronic devices via wireless communication, and for transceiving signals with the network device or other electronic devices.

The audio circuit 605 may be used to provide an audio interface between the user and the electronic device through a speaker, microphone. The audio circuit 605 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 605 and converted into audio data, which is then processed by the audio data output processor 601, and then transmitted to another electronic device via the rf circuit 604, or output to the memory 602 for further processing. The audio circuit 605 may also include an earbud jack to provide communication of peripheral headphones with the electronic device.

The input unit 606 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 607 is used to power the various components of the electronic device 1000. Optionally, the power supply 607 may be logically connected to the processor 601 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 607 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown in fig. 6, the electronic device 1000 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any of the methods for recognizing characters in a game screen provided in embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any of the methods for recognizing characters in game images provided by the embodiments of the present application, the beneficial effects that can be achieved by any of the methods for recognizing characters in game images provided by the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The method, the apparatus, the storage medium, and the electronic device for recognizing characters in a game screen provided in the embodiments of the present application are described in detail above, and a specific example is applied in the description to explain the principle and the implementation of the present application, and the description of the embodiments above is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for recognizing characters in a game screen, comprising:

2. The method according to claim 1, wherein the extracting feature information from the game screen to be detected comprises:

extracting characteristic information of a game picture to be detected through a universal text area detection model;

the detecting the text area of the game picture to be detected based on the characteristic information to obtain a candidate text area comprises the following steps:

performing text region detection on the game picture to be detected based on the characteristic information through a general text region detection model to obtain a candidate text region;

the identifying the target text region where the target information is located from the candidate text regions includes:

identifying a target text region where target information is located from the candidate text regions based on a character identification model corresponding to the target virtual scene;

the identifying the text in the target text area to obtain the target information of the game picture to be detected comprises the following steps:

and identifying the text in the target text area based on the character identification model corresponding to the target virtual scene to obtain the target information of the game picture to be detected.

3. The method of claim 1, wherein the model for detecting the general text region is trained based on sample game scenes of at least two virtual game scenes, wherein the at least two virtual scenes comprise at least a target virtual scene.

4. The method according to claim 2, wherein the generic text region detection model includes at least two connected feature extraction layers and corresponding feature merging layers, and the extracting feature information of the game picture to be detected by the generic text region detection model includes:

5. The method as claimed in claim 4, wherein the number of the feature merging layers is one layer less than the number of the feature extraction layers, and the obtaining of the feature information of the game screen to be detected by fusing the feature maps of multiple scales through the feature merging layers comprises:

6. The method of claim 2, wherein before extracting feature information from the game picture to be detected by the generic text region detection model, the method further comprises:

7. The character recognition method in a game screen according to any one of claims 1 to 6, wherein the sample game screen includes a first sample game screen and a second sample game screen, wherein,

8. The method of claim 6, wherein the information of the predicted text region comprises: an offset and an offset angle of the predicted text region;

the adjusting the parameters of the general text region detection model based on the position information of the actual text region and the information of the predicted text region includes:

9. The method of recognizing characters in a game screen according to claim 7, wherein the generated text set in the second sample game screen is generated based on at least one character style selected from a plurality of preset character styles and a text model of information required for each virtual scene, wherein the information required for each virtual scene is character information required to be extracted from each virtual scene.

10. The method according to claim 6, wherein before extracting the feature information from the game screen to be detected, the method further comprises:

acquiring a predicted text area of the sample game picture;

11. The method of claim 10, wherein the training samples comprise type identification information of a predicted text region, the type identification information being used to identify whether the target information is included in the predicted text region;

the adjusting the parameters of the character recognition model corresponding to the target virtual scene based on the classification result comprises:

12. The method of claim 10, wherein obtaining training samples of a character recognition model of the target virtual scene based on the predicted text region comprises:

13. The method of claim 8, wherein the identifying a target text region in which target information is located from the candidate text regions based on the character recognition model corresponding to the target virtual scene comprises:

14. The method for recognizing characters in a game screen according to any one of claims 1 to 13, wherein before extracting feature information from the game screen to be detected, the method further comprises:

extracting a game picture to be detected from a video of a target virtual scene;

after the text in the target text region is identified to obtain the target information of the game picture to be detected, the method further comprises the following steps:

15. The method for recognizing characters in a game screen according to any one of claims 1 to 13, wherein before extracting feature information from the game screen to be detected, the method further comprises:

acquiring a recommendation strategy of the video of the target virtual scene based on the target information of the game picture to be detected;

16. A character recognition apparatus in a game screen, comprising:

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method according to any of claims 1-15 are implemented when the program is executed by the processor.

18. A storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the method according to any of claims 1-15.