WO2020221298A1

WO2020221298A1 - Text detection model training method and apparatus, text region determination method and apparatus, and text content determination method and apparatus

Info

Publication number: WO2020221298A1
Application number: PCT/CN2020/087809
Authority: WO
Inventors: 苏驰; 李凯; 刘弘也; 赵志明
Original assignee: 北京金山云网络技术有限公司; 北京金山云科技有限公司
Priority date: 2019-04-30
Filing date: 2020-04-29
Publication date: 2020-11-05
Also published as: CN110110715A

Abstract

The present application provides a text detection model training method and apparatus, a text region determination method and apparatus, and a text content determination method and apparatus. The text detection model training method comprises: extracting a plurality of initial feature maps of a target training image by means of a first feature extraction network; fusing the plurality of initial feature maps by means of a feature fusion network to obtain a fusion feature map; inputting the fusion feature map to a first output network, and outputting candidate regions of a text region in the target training image and the probability value of each candidate region; determining a first loss value by means of a preset loss detection function; and training the first initial model according to the first loss value until parameters in the first initial model are converged, to obtain a text detection model. According to the present application, all kinds of texts in the image can be quickly, fully and accurately detected under a variety of front sizes, fonts, shapes and directions, thereby contributing to the accuracy of subsequent text recognition, and improving the text recognition effect.

Description

Text detection model training method, text area, content determination method and device

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on April 30, 2019, the application number is 201910367675. 2, and the invention title is "text detection model training method, text area, content determination method and device", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of image processing technology, and in particular to a text detection model training method, text region, content determination method and device.

Background technique

In related technologies, text detection and recognition can be realized by character segmentation. But these methods are usually suitable for simple scenes such as single font size, simple background, and single text arrangement direction; in complex scenes, such as scenes with multiple font sizes, multiple fonts, multiple shapes, multiple directions, and changing backgrounds, The above-mentioned text detection and recognition methods are less effective.

Summary of the invention

In view of this, the purpose of this application is to provide a text detection model training method, text region, content determination method and device to improve the accuracy of text recognition.

In the first aspect, an embodiment of the present application provides a text detection model training method. The method includes: determining a target training image based on a preset training set; inputting the target training image to a first initial model; the first initial model includes the first A feature extraction network, a feature fusion network, and a first output network; multiple initial feature maps of the target training image are extracted through the first feature extraction network; the scales of the multiple initial feature maps are different; the multiple initial feature maps are processed through the feature fusion network The feature map is fused to obtain a fusion feature map; the fusion feature map is input to the first output network, and the candidate area of the text area in the target training image and the probability value of each candidate area are output; the candidate is determined by the preset detection loss function Region and the first loss value of the probability value of each candidate region; training the first initial model according to the first loss value until the parameters in the first initial model converge to obtain a text detection model.

In a second aspect, an embodiment of the present application provides a method for determining a text region, the method includes: obtaining an image to be detected; inputting the image to be detected into a pre-trained text detection model, and outputting multiple text regions in the image to be detected Candidate regions, and the probability value of each candidate region; the text detection model is trained by the above-mentioned text detection model training method; according to the probability value of the candidate region and the degree of overlap between multiple candidate regions, determine from multiple candidate regions The text area in the image to be detected.

In a third aspect, an embodiment of the present application provides a method for determining text content. The method includes: obtaining a text area in an image through the above method for determining a text area; inputting the text area into a pre-trained text recognition model, and outputting the text The recognition result of the area; the text content in the text area is determined according to the recognition result.

In a fourth aspect, an embodiment of the present application provides a text detection model training device. The device includes: a training image determination module configured to determine a target training image based on a preset training set; a training image input module configured to train the target The image is input to the first initial model; the first initial model includes a first feature extraction network, a feature fusion network, and a first output network; the feature extraction module is configured to extract multiple initial feature maps of the target training image through the first feature extraction network ; The scales between the multiple initial feature maps are different; the feature fusion module is set to merge multiple initial feature maps through the feature fusion network to obtain the fusion feature map; the output module is set to input the fusion feature map to the first Output network, output the candidate area of the text area in the target training image and the probability value of each candidate area; the loss value determination and training module is set to determine the candidate area and the probability value of each candidate area through the preset detection loss function The first loss value; training the first initial model according to the first loss value until the parameters in the first initial model converge to obtain a text detection model.

In a fifth aspect, an embodiment of the present application provides a device for determining a text region. The device includes: an image acquisition module configured to acquire an image to be detected; a detection module configured to input the image to be detected into a pre-trained text detection model , Output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region; the text detection model is trained through the above-mentioned text detection model training method; the text region determination module is set according to the probability value of the candidate region As well as the degree of overlap between multiple candidate areas, the text area in the image to be detected is determined from the multiple candidate areas.

In a sixth aspect, an embodiment of the present application provides a device for determining text content. The device includes: a region acquiring module configured to acquire a text region in an image by the above-mentioned text region determining method; and a recognition module configured to input the text region To the pre-trained text recognition model, output the recognition result of the text area; the text content determination module is set to determine the text content in the text area according to the recognition result.

In a seventh aspect, an embodiment of the present application provides an electronic device including a processor and a memory. The memory stores machine-executable instructions that can be executed by the processor. The processor executes the machine-executable instructions to implement the aforementioned text detection model training method. , The steps of the above-mentioned text area determination method, or the above-mentioned text content determination method.

In an eighth aspect, an embodiment of the present application provides a machine-readable storage medium that stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt The processor implements the steps of the text detection model training method, the text region determination method, or the text content determination method.

In a ninth aspect, an embodiment of the present application provides an executable program code, the executable program code is set to be executed to execute the above-mentioned text detection model training method, the above-mentioned text area determination method, or the above-mentioned text content determination method steps .

The embodiments of the application bring the following beneficial effects:

In the text detection model training method provided by the embodiments of this application, the feature extraction network can automatically extract features of different scales. Therefore, if the text detection model is applied, if an image is input, candidate regions of text regions of various scales in the image can be obtained. There is no need to manually change the image scale, and the operation is convenient, especially in the scene of multiple font sizes, multiple fonts, multiple shapes, and multiple directions. It can quickly, comprehensively and accurately detect various types of text in the image, which is also beneficial to subsequent text The accuracy of recognition improves the effect of text recognition.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application and related technologies, the following briefly introduces the drawings that need to be used in the embodiments and related technologies. Obviously, the drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

Fig. 1 is a flowchart of a text detection model training method provided by an embodiment of the application;

2 is a schematic structural diagram of a first feature extraction network provided by an embodiment of this application;

FIG. 3 is a schematic diagram of performing fusion processing on multiple initial feature maps according to an embodiment of the application;

4 is a flowchart of a method for determining a text area provided by an embodiment of the application;

FIG. 5 is a flowchart of another method for determining a text area according to an embodiment of the application;

FIG. 6 is a flowchart of a method for determining text content according to an embodiment of the application;

FIG. 7 is a flowchart of a method for training a text recognition model provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of a second feature extraction network provided by an embodiment of this application;

FIG. 9 is a flowchart of another method for determining text content according to an embodiment of the application;

FIG. 10 is a flowchart of another method for determining text content according to an embodiment of the application;

11 is a schematic structural diagram of a text detection model training device provided by an embodiment of the application;

FIG. 12 is a schematic structural diagram of an apparatus for determining a text area provided by an embodiment of this application;

FIG. 13 is a schematic structural diagram of a text content determination device provided by an embodiment of the application;

FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of the present application clearer, the following further describes the present application in detail with reference to the drawings and embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.

In related text recognition technology, the text area that may have text is detected from the picture through artificially set rules, and then the detected text area is divided into characters to obtain the image block corresponding to each character, and through pre-trained classification The device recognizes each image block, and then obtains the final text recognition result. The disadvantages of this technology are mainly as follows: First, due to the limited number of artificially set rules, most of the detected text areas are regular-shaped areas, which results in the limited application range of this technology and it is difficult to apply to text in complex scenes. Detection and recognition, such as multiple font sizes, multiple fonts, multiple shapes, multiple directions, changeable backgrounds and other scenes. Second, the technology is the recognition of single characters without considering the correlation between characters, resulting in The detection and recognition effect in complex scenes is poor.

Another related technology is to realize text recognition through deep learning; firstly, the recognition model needs to be trained through a recurrent neural network; then the image to be detected is transformed into multiple scales, and then input into the recognition model to detect the text area and recognize the text; this The disadvantages of the technology are mainly as follows: First, it is necessary to manually transform the image scale, and input images of multiple scales into the recognition model, so that the recognition model can recognize texts of different sizes. The operation is cumbersome and difficult to meet real-time recognition. Second, because the cyclic neural network needs to follow the time series for recursive operations, it is difficult to process in parallel, and the calculation speed is slow; third, the recognition model usually uses a rectangular detection frame to detect the text area, so it can only detect and recognize the horizontal direction The text recognition effect is poor for text at any angle, which makes it difficult to apply to text detection and recognition in complex scenes.

In summary, the text detection and recognition methods in related technologies have poor effect in complex scenarios; based on this, the embodiments of the present application provide a text detection model training method, text region, content determination method and device; this technology can be widely used Text detection and text recognition in various scenarios can be especially applied to text detection and text recognition in complex scenarios such as web live broadcasts, limited TV live broadcasts, games, and videos.

First, a method for training a text detection model disclosed in an embodiment of the present application is introduced in detail. The text detection model can be used for text detection. The text detection can be understood as: locating an image area containing text from an image. As shown in Figure 1, the method includes the following steps:

Step S102: Determine a target training image based on a preset training set.

In the subsequent content, in some cases, in the process of training the first initial model, the training image needs to be determined multiple times; in one embodiment, the target training image can be determined from the preset training set each time; or, other In the implementation, it is also possible to obtain a new training image every time.

Taking the determination of the target training image from the preset training set as an example, the training set can contain multiple images. In order to improve the wide application of the detection model, the images in the training set can contain images in various scenarios , For example, live scene images, game scene images, outdoor scene images, indoor scene images, etc.; images in the training set can also contain text lines of multiple font sizes, shapes, fonts, and languages, so that the trained detection model can detect Various text lines.

Each image contains a text area of a text line manually marked. The text area can be marked by a rectangular box or other polygonal box; the marked text area can usually cover the entire text line completely, and The text area and the text line can fit closely. In addition, the multiple images in the above-mentioned training set may be divided into a training subset and a testing subset according to a preset ratio. In the training process, the target training image can be obtained from the training subset. After the training is completed, the target test image can be obtained from the test subset to test the performance of the detection model.

Step S104, input the target training image to a first initial model; the first initial model includes a first feature extraction network, a feature fusion network, and a first output network.

Before inputting to the first initial model, the target training image can be adjusted to a preset size, such as 512 pixels*512 pixels.

Step S106: Extract multiple initial feature maps of the target training image through the first feature extraction network; the multiple initial feature maps have different scales.

Among them, the first feature extraction network can be realized by multi-layer convolutional layers. Generally, the multi-layer convolutional layers are connected in sequence (the meaning of connection is that the input of one convolutional layer is the output of another convolutional layer), and each layer of convolution The layer sets different convolution kernels to extract feature maps of different scales. Multiple initial feature maps of the target training image, each initial feature map can be obtained by convolution calculation of a corresponding convolution layer. Taking a four-layer convolutional layer as an example, each convolutional layer can output an initial feature map; each convolutional layer can be set with a different size of convolution kernel, so that the scale of the initial feature map output by each convolutional layer is different. For example, the scale of the initial feature map output by the convolutional layer of the input target training image can be set to be the largest, and the scale of the initial feature map output by each subsequent convolutional layer gradually decreases.

Step S108: Perform fusion processing on multiple initial feature maps through the feature fusion network to obtain a fusion feature map.

Generally, a smaller convolution kernel can sense high-frequency features in the image, and the initial feature map output by the convolution network using a smaller convolution kernel carries small-scale text line features; a larger convolution kernel can Sense the low-frequency features in the image. The initial feature map output by the convolutional layer of the larger convolutional network carries large-scale text line features; based on this, multiple initial feature maps of different scales carry various scales The fusion feature map obtained after fusion processing of multiple initial feature maps also carries text line features of various scales. In this way, the detection model can detect text lines of various scales without artificial image scale transformation before detection.

In one case, because the scales of multiple initial feature maps are different, before the fusion, the smaller-scale initial feature maps can be interpolated to expand the smaller-scale initial feature maps to make them comparable to the larger-scale ones. Match the initial feature map. In the fusion process, between different initial feature maps, feature points at the same position can be multiplied or added to obtain the final fusion feature map.

Step S110, input the fusion feature map to the first output network, and output the candidate regions of the text region in the target training image and the probability value of each candidate region.

The first output network is set to extract the required features from the fusion feature map to obtain the output result; if the output result of the detection model is a result, the first output network usually contains a group of networks; if the output result of the detection model For multiple results, the first output network usually includes multiple groups of networks, and the multiple groups of networks are arranged in parallel, and each group of networks respectively outputs one result. The first output network can be composed of a convolutional layer or a fully connected layer. In the above steps, the first output network needs to output two results of the candidate area and the probability value of the candidate area. Therefore, the first output network may include two sets of networks, and each set of networks may be a convolutional network or a fully connected network.

Step S112: Determine the candidate region and the first loss value of the probability value of each candidate region through a preset detection loss function; train the first initial model according to the first loss value until the parameters in the first initial model Converge, and get the text detection model.

The target training image is pre-marked with a standard text area. Based on the position of the marked text area, the coordinate matrix of the text area and the probability matrix of the text area can be generated. Among them, the coordinate matrix of the text area can contain the standard text area Vertex coordinates; the probability matrix of the text area contains the probability value of the text area, for example, the probability value can be 1.

The detection loss function can compare the difference between the coordinate matrix of the candidate area and the coordinate matrix of the standard text area, as well as the difference between the probability value of the candidate area and the standard text area. Generally, the greater the difference, the greater the above-mentioned first loss value . Based on the first loss value, the parameters of each part in the first initial model can be adjusted to achieve the purpose of training. When each parameter in the model converges, the training ends and the detection model is obtained.

The text detection model training method provided by the embodiments of the application first extracts multiple initial feature maps with different scales of the target training image; then performs fusion processing on the multiple initial feature maps to obtain a fused feature map; and then inputs the fused feature map To the first output network, output the candidate regions of the text region in the target training image and the probability value of each candidate region; after the first loss value is determined by the preset detection loss function, the first initial model is calculated according to the first loss value Perform training to obtain a detection model. In this method, the feature extraction network can automatically extract features of different scales. Therefore, if the text detection model is applied, if you input an image, you can get the candidate regions of the text area of various scales in the image, without manually changing the image scale. Convenient, especially in scenarios with multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, it can quickly, comprehensively and accurately detect all types of text in the image, which is also conducive to the accuracy of subsequent text recognition and improves the text The effect of recognition.

The embodiment of the present application also provides another text detection model training method, which is implemented on the basis of the text detection model training method described in the above embodiment; this method focuses on the specific implementation process of each step in the above training method; this method Including the following steps:

Step 202: Determine a target training image based on a preset training set.

Step 204: Input the target training image to a first initial model; the first initial model includes a first feature extraction network, a feature fusion network, and a first output network.

Step 206: Extract multiple initial feature maps of the target training image through the first feature extraction network; the multiple initial feature maps have different scales.

In one embodiment, the first feature extraction network may include multiple groups of first convolutional networks connected in sequence; each group of first convolutional networks includes a convolution layer, a batch normalization layer, and an activation function layer connected in sequence. Figure 2 shows a schematic structural diagram of a first feature extraction network; Figure 2 takes four groups of first convolutional networks as an example for illustration, the convolutional layers of the latter group of first convolutional networks are connected to the first group of first convolutional networks. The activation function layer of the convolutional network. The activation function layer of each group of the first convolutional network outputs an initial feature map. The initial feature map output by the activation function layer of the first group of first convolutional network is also input to the latter group The convolutional layer of the first convolutional network. In addition, the first feature extraction network may also include more groups or fewer groups of first convolutional networks.

The batch normalization layer in the first convolutional network is set to normalize the feature map output by the convolutional layer. This process can speed up the convergence speed of the first feature extraction network and the detection model, and can alleviate the The problem of gradient dispersion in convolutional networks makes the first feature extraction network more stable. The activation function layer in the first convolutional network can perform function transformation on the normalized feature map. The transformation process breaks the linear combination of the convolutional layer input and can improve the feature expression ability of the first convolutional network. The activation function layer may specifically be Sigmoid function, tanh function, Relu function, etc.

Step 208: Perform fusion processing on the multiple initial feature maps through the feature fusion network to obtain a fusion feature map.

The following steps 02-08 provide a specific implementation manner of step 208. In this manner, the pyramid feature is taken as an example for description, that is, the scales of the initial feature maps output by each convolutional layer are sequentially reduced:

Step 02: Arrange multiple initial feature maps in sequence according to the scale of the initial feature map; among them, the scale of the initial feature map at the top level is the smallest; the scale of the initial feature map at the bottom level is the largest;

Step 04: Determine the top-level initial feature map as the top-level fusion feature map;

Step 06: In addition to the top level, the initial feature map of the current level and the fusion feature map of the previous level of the current level are fused to obtain the fusion feature map of the current level;

Since the scale of the fusion feature map of the upper level of the current level is smaller than the initial feature map of the current level, before the two are fused, the scale of the fusion result of the upper level of the current level can be extended to the current level through interpolation. The scale of the initial feature map is the same, and then the fusion processing of point-by-point addition or point-by-point multiplication is performed to obtain the current level of fusion feature map.

Step 08: Determine the lowest level fusion feature map as the final fusion feature map.

The fusion result of each level is essentially the fusion feature map of each level. In order to distinguish from the final fusion feature map, the fusion feature map of each level is called the fusion result. Steps 04-08 can be expressed as: according to the arrangement order, for each level below the top level, the initial feature map of the level and the fusion result of the upper level of the level are merged to obtain the fusion of the level Result; wherein the fusion result of the top level is the initial feature map of the top level; the fusion result of the lowest level is determined as the fusion feature map of the initial feature map.

Figure 3 shows a schematic diagram of fusion processing multiple initial feature maps; the target training image is convolved by the first feature extraction network to obtain four-layer initial feature maps; the top-level initial feature map is taken as the top Fusion feature map of level; the fusion feature map of the top level is fused with the initial feature map of the second level to obtain the fusion feature map of the second level; the fusion feature map of the second level is fused with the initial feature map of the third level , The third-level fusion feature map is obtained; the third-level fusion feature map is fused with the fourth-level initial feature map to obtain the fourth-level fusion feature map; the fourth-level fusion feature map is the final fusion feature Figure.

Step 210: Input the fusion feature map to the first output network, and output the candidate regions of the text region in the target training image and the probability value of each candidate region.

Taking the convolutional network as an example, the above-mentioned first output network includes a first convolutional layer and a second convolutional layer; wherein, the first convolutional layer and the second convolutional layer are arranged in parallel, and the first convolutional layer and the second convolutional layer are arranged in parallel. The layers are respectively set to output the vertex coordinates of the selected area and the probability value of the candidate area. The above step 210 can also be implemented through the following steps 12-16:

Step 12: Input the fusion feature map to the first convolutional layer and the second convolutional layer respectively;

Step 14. Perform a first convolution operation on the fused feature map through the first convolution layer, and output a coordinate matrix; the coordinate matrix includes the vertex coordinates of the candidate region of the text region in the target training image;

For example, the coordinate matrix can be expressed as n*H*W, where H and W are the height and width of the coordinate matrix, and n is the dimension of the coordinate matrix; for example, when the candidate area is a quadrilateral, a candidate area needs to pass through four The vertex coordinates are determined, so that n is 8; when the candidate area is another polygon, the value of n is usually twice the number of sides of the candidate area.

Step 16. Perform a second convolution operation on the fused feature map through the second convolution layer to output a probability matrix; the probability matrix includes the probability value of each candidate region.

The probability value of each candidate area can also be called the score of each candidate area, and the probability value can be used to characterize the probability that the candidate area can completely contain the text line.

Step 212: Determine the candidate region and the first loss value of the probability value of each candidate region through a preset detection loss function; train the first initial model according to the first loss value until the parameters in the first initial model Converge, and get the text detection model.

In one case, the above detection loss function includes a first function and a second function, which are respectively used to calculate the vertex coordinates of the candidate area and the loss value of the probability value of each candidate area; where the first function is L ₁ =|G ^* -G|; where G ^* is the coordinate matrix of the text area in the pre-labeled target training image; G is the coordinate matrix of the candidate area of the text area in the target training image output by the first output network; the second function is L ₂ =-Y ^* logY-(1-Y ^* )log(1-Y); where Y ^* is the probability matrix of the text area in the pre-labeled target training image; Y is the text in the target training image output by the first output network The probability matrix of the candidate area of the area; log represents the logarithmic operation. The first loss value of the vertex coordinates of the candidate area and the probability value of each candidate area is the sum of the first function and the second function, that is, L=L ₁ +L ₂ .

Based on the above description of the first loss value, in the above steps, the process of training the first initial model according to the first loss value can also be implemented through the following steps 22-28:

Step 22: Update the parameters in the first initial model according to the first loss value;

In one case, the function mapping relationship can be preset, and the original parameters and the first loss value are input into the function mapping relationship, and then the updated parameters can be calculated. The function mapping relationship of different parameters can be the same or different.

Specifically, the parameters to be updated can be determined first according to preset rules; the parameters to be updated can be all parameters in the first initial model, or some parameters randomly determined from the first initial model; then the first loss is calculated The derivative of the value to the parameter to be updated in the first initial model

Among them, L is the first loss value; W is the parameter to be updated;

Represents partial derivative operation; the parameter to be updated can also be called the weight of each neuron. This process can also be called a backpropagation algorithm; if the first loss value is large, it means that the output of the current first initial model does not match the expected output result, then the first loss value is calculated for the first initial model to be Update the derivative of the parameter, which can be used as a basis for adjusting the parameter to be updated.

After obtaining the derivative of each parameter to be updated, update each parameter to be updated to obtain the updated parameter to be updated

Among them, α is the preset coefficient. This process can also be called a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the first loss value drops the fastest relative to the current parameter. By adjusting the parameters in this direction, the first loss value can be quickly reduced To make the parameter converge. In addition, when the first initial model is trained once, a first loss value is obtained. At this time, one or more parameters can be randomly selected from each parameter in the first initial model to perform the above-mentioned update process. The model training time of this method is Shorter, faster algorithm; Of course, it is also possible to perform the above-mentioned update process on all parameters in the first initial model, and the model training in this way is more accurate.

Step 24: Judge whether the updated parameters are all converged; if the updated parameters are all converged, go to step 26; if the updated parameters are not all converged, go to step 28;

Step 26: Determine the first initial model after parameter update as the detection model; end.

Step 28: Continue to execute the step of determining the target training image based on the preset training set until all the updated parameters converge.

Specifically, a new image can be retrieved from the training set as the target training image, or the current target training image can be used as the target training image for training.

In the above method, the feature extraction network can automatically extract feature maps of different scales, and then perform fusion processing on the feature maps of different scales, and obtain candidate regions of text regions of various scales in the image based on the obtained fused feature maps. The detection model only needs to input an image to obtain the candidate regions of the text area of various scales in the image, without manually changing the image scale, and the operation is convenient, especially in multiple font sizes, multiple fonts, multiple shapes, In a multi-directional scene, various types of text in the image can be detected quickly, comprehensively and accurately, which is also conducive to the accuracy of subsequent text recognition and improves the effect of text recognition.

The embodiment of the present application also provides a method for determining a text region, which is implemented on the basis of the text detection model training method described in the above embodiment; as shown in FIG. 4, the method includes the following steps:

Step S402: Obtain an image to be detected; the image to be detected may be a picture, or a video frame intercepted from a video file or a live video.

Step S404, input the image to be detected into the text detection model pre-trained using the text detection model training method provided in the above embodiment, and output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region .

Step S406: Determine the text area in the image to be detected from the multiple candidate areas according to the probability value of the candidate area and the degree of overlap between the multiple candidate areas.

Among the candidate regions output by the above-mentioned text detection model, there may be multiple candidate regions corresponding to the same text line; in order to find the region that best matches the text line from the multiple candidate regions, multiple candidate regions need to be filtered. In most cases, multiple candidate areas with a high degree of mutual overlap usually correspond to the same text line, and then based on the probability value of multiple candidate areas with a high degree of mutual overlap, the text area corresponding to the text line can be determined. ; For example, among a plurality of candidate regions with a high degree of mutual overlap, the candidate region with the largest probability value is determined as the text region. If there are multiple text lines in the image, multiple text regions can be identified.

The above-mentioned text area determination method provided by the embodiments of the present application inputs the acquired image to be detected into the text detection model, and outputs multiple candidate areas of the text area in the image to be detected and the probability value of each candidate area; and then according to the candidate area The probability value of and the degree of overlap between multiple candidate areas are used to determine the text area in the image to be detected from the multiple candidate areas. In this way, the text detection model can automatically extract features of different scales. If you input an image to the model, you can get the candidate regions of the text area of various scales in the image. There is no need to manually transform the image scale, and the operation is convenient, especially in With multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, various types of text in the image can be detected quickly, comprehensively and accurately, which in turn is beneficial to the accuracy of subsequent text recognition and improves the effect of text recognition.

The embodiment of the application also provides another method for determining a text region, which is implemented on the basis of the method for determining a text region in the above embodiment; this method focuses on determining the vertex coordinates of the candidate region output by the detection network and the probability value of the candidate region The specific process of the text area in the image to be detected; as shown in Figure 5, the method includes the following steps:

Step S502: Obtain an image to be detected.

Step S504, input the image to be detected into the pre-trained text detection model, and output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region;

Step S506, among the multiple candidate areas, the candidate areas whose probability value is lower than the preset probability threshold are eliminated to obtain the final multiple candidate areas.

This step S506 is an optional step, that is, in the following step S508, each candidate region output by the detection model can be arranged, or the candidate regions output by the detection model can be first selected with a probability value lower than the preset probability threshold. The region is eliminated, and then the remaining candidate regions are arranged. The aforementioned preset probability threshold can be set in advance, such as 0.2, 0.1, etc.; by eliminating candidate regions whose probability value is lower than the preset probability threshold, it is helpful to reduce the amount of calculation for the subsequent determination of the text area in the image to be detected and increase the speed of calculation .

Step S508: Arranging multiple candidate regions in sequence according to the probability values of the candidate regions; among them, the probability value of the first candidate region is the largest, and the probability value of the last candidate region is the smallest;

Step S510, taking the first candidate area as the current candidate area, and calculating the degree of overlap between the current candidate area and the candidate areas other than the current candidate area one by one;

Candidate areas other than the current candidate area can also be referred to as other candidate areas for short. When calculating the degree of overlap between the current candidate area and each other candidate area, the intersection ratio of the two candidate areas can be specifically calculated, and the intersection ratio is equal to two. The area size of the intersection of two candidate areas and the area size of the union of two candidate areas. It can be understood that the greater the intersection ratio, the greater the degree of overlap between the two candidate regions. For the current candidate area, other candidate areas that have a greater degree of overlap with the current candidate area usually represent the same text line as the current candidate area, and because the probability values of other candidate areas are less than the current candidate area, the other candidate areas can be The candidate area is eliminated to characterize the text line through the current candidate area.

In step S512, among the candidate regions except the current candidate region, candidate regions whose overlap degree is greater than a preset overlap threshold are eliminated; the overlap threshold may be preset, such as 0.5, 0.6, etc.

Step S514: Regard the next candidate area of the current candidate area as the new current candidate area, and continue to perform the step of calculating the overlap degree of the current candidate area and the candidate areas other than the current candidate area one by one until the last candidate area is reached.

The above steps S510-S514 include a cyclic process, in each round of the cycle, part of the candidate area will be eliminated. When the last candidate area is traversed, the cycle ends, and the final remaining candidate area is determined as the text area in the image to be detected. If there are multiple remaining candidate areas, it can be determined that there are multiple text areas in the image to be detected.

Steps S510-S514 can also be expressed as: according to the order of arrangement, for each candidate area, calculate the degree of overlap between the candidate area and the candidate area other than the candidate area one by one; The candidate regions whose degree of overlap is greater than the preset overlap threshold are eliminated.

Step S516: Determine the remaining candidate area after being eliminated as the text area in the image to be detected.

In the above manner, multiple candidate regions and the probability value of each candidate region can be obtained through the text detection model, and then the text region can be determined from the multiple candidate regions by non-maximum suppression. In this way, the text detection model can automatically extract features of different scales. If you input an image to the model, you can get the candidate regions of the text area of various scales in the image. There is no need to manually transform the image scale, and the operation is convenient, especially in With multiple font sizes, multiple fonts, multiple shapes, and multiple orientations, various types of text in the image can be detected quickly, comprehensively and accurately, which in turn is beneficial to the accuracy of subsequent text recognition and improves the effect of text recognition.

The embodiment of the present application also provides a method for determining text content, which is implemented on the basis of the method for determining a text area described in the foregoing embodiment; as shown in FIG. 6, the method includes the following steps:

Step S602: Obtain the text area in the image by using the above-mentioned text area determination method;

Step S604, input the text area into the pre-trained text recognition model, and output the recognition result of the text area;

Step S606: Determine the text content in the text area according to the recognition result.

The above text recognition model can be obtained by training in a variety of ways, such as cyclic neural network and convolutional neural network. Of course, the recognition result of the text area can also be obtained by means of optical character recognition. The recognition result output by the text recognition model can be determined as the text content in the text area, or the recognition result output by the text recognition model can be optimized first, such as deleting repeated characters and empty characters, empty characters, etc., and then the processed The recognition result is determined as the text content in the text area.

In the method for determining text content provided by the embodiments of the present application, the text area in the image is first obtained by the above-mentioned method for determining text area; then the text area is input into a pre-trained text recognition model, and the recognition result of the text area is output; The recognition result determines the text information in the text area. In this method, because the above-mentioned text area determination method can obtain text areas of various scales through the text detection model, it can quickly, comprehensively and accurately detect in a variety of font sizes, fonts, shapes, and directions. The various types of text in the image in turn also contribute to the accuracy of text recognition and improve the effect of text recognition.

The embodiment of the present application also provides another method for determining text content, which is implemented on the basis of the method described in the above embodiment; the method focuses on the training method of the text recognition model; the text recognition model can be used for text recognition. Text recognition can be understood as: detecting the text area in the picture, thereby locating the picture area containing the text, and then determining the language meaning of the text in the picture area. As shown in Figure 7, the detection model is trained in the following way:

Step S702: Determine a target training text image based on a preset training set;

In the subsequent content, in some cases, in the process of training the second initial model, the training text image needs to be determined multiple times; in one embodiment, the target training text image can be determined from the preset training set each time; or In other embodiments, a new training text image may also be obtained every time.

Taking the determination of a target training text image from a preset training set as an example, the target training text image may be a separate image or an image area marked on the image. The training set can contain multiple images. In order to improve the wide application of the text recognition model, the images in the training set can contain images in various scenes, for example, live broadcast scene images, game scene images, outdoor scene images, indoor Scene images, etc.; the images in the training set can also contain text lines of multiple font sizes, shapes, fonts, and languages, so that the trained text recognition model can detect various text lines. Each target training text image corresponds to the text content of the manually labeled text line, such as "Hello", "Awesome", etc. Each target training text image corresponds to a labeled text content.

After the annotation is completed, you can also build a character library based on the text content of all text lines corresponding to all images in the training set; specifically, obtain the text content of all text lines corresponding to all images in the training set, and extract different Characters, which compose characters that are different from each other into a character library. In addition, the multiple images in the above-mentioned training set may be divided into a training subset and a testing subset according to a preset ratio. In the training process, the target training image can be obtained from the training subset. After the training is completed, the target test image can be obtained from the test subset to test the performance of the text recognition model.

Step S704, input the target training text image into a second initial model; the second initial model includes a second feature extraction network, a feature splitting network, a second output network, and a classification function;

Step S706, extract the feature map of the target training text image through the second feature extraction network;

The second feature extraction network can be realized by a multi-layer convolutional layer. Usually, the multi-layer convolutional layer is connected in sequence. Each time the convolutional layer sets the corresponding convolution kernel, the input data is convolutional calculation, and the last layer is The output data of the convolutional layer can be used as the feature map of the target training text image.

Step S708, split the feature map into at least one sub-feature map through the feature splitting network;

Based on the purpose of recognizing text content, the text recognition model needs to split the feature map corresponding to the text line, so that each sub feature map contains one or a small amount of text or symbols, which is convenient for text content recognition. During the splitting process, the scale of the sub-feature map can be preset, and the feature map can be split based on the scale of the sub-feature map; or the number of sub-feature maps can be preset, and the feature map can be split based on the number of sub-feature maps. Of course, if the text line is inherently short, such as only one character, the feature map may also be split into only one sub-feature map.

Step S710, input the aforementioned sub-characteristic maps to the second output network respectively, and output the output matrix corresponding to each sub-characteristic map;

The second output network is set to recalculate the sub-characteristic map; in the output matrix corresponding to each sub-characteristic map output, each position corresponds to a preset character; the value at this position can represent the sub-characteristic map and The matching degree of the character corresponding to the position. The second output network may be a convolutional network or a fully connected network.

Step S712, input the output matrix corresponding to each sub-feature map to the classification function, and output the probability matrix corresponding to each sub-feature map;

The classification function can map each value in the output matrix to a probability value, thereby obtaining a probability matrix. The probability value at each position in the probability matrix can be used to characterize the probability that the sub-characteristic map matches the character corresponding to the position.

Step S714: Determine the second loss value of the probability matrix through the preset recognition loss function; train the second initial model according to the second loss value until the parameters in the second initial model converge to obtain a text recognition model.

For example, the target training text image can be pre-marked with standard text content, and the text content can be composed of one or more standard characters; a probability matrix can be generated based on the text content; in the probability matrix, the sub-feature map corresponds to The probability value of the position corresponding to the standard character can be 1, and the probability value of other positions can be 0. The recognition loss function can compare the difference between the probability matrix output by the classification function and the probability matrix of the standard text content. Generally, the greater the difference, the greater the aforementioned second loss value. Based on the second loss value, the parameters of each part of the second initial model can be adjusted to achieve the purpose of training. When the parameters in the model converge, the training ends and the text recognition model is obtained.

In the training method of the text recognition model described above, first extract the feature map of the target training text image; then split the feature map into at least one sub-feature map; then input the sub-feature map to the second output network separately, and output each sub-feature map. The output matrix corresponding to the feature map; the probability matrix corresponding to each sub-feature map is obtained through the classification function; after the second loss value of the probability matrix is determined by the preset recognition loss function, the second initial model is performed according to the second loss value Train the text recognition model. In this way, the model can automatically segment the feature map of the image. Therefore, the text recognition model only needs to input the image containing the text line to get the text content in the image, and there is no need to segment the text line. The text content of the text line can be directly obtained, operation and editing, the calculation speed is fast, and the text recognition accuracy is high.

The following focuses on the specific implementation process of each step in the above training method:

Step S704, input the target training text image to a second initial model; the second initial model includes a second feature extraction network, a feature splitting network, a second output network, and a classification function;

The second feature extraction network may include multiple groups of second convolutional networks connected in sequence; each group of second convolutional networks includes a convolution layer, a pooling layer, and an activation function layer connected in sequence. Figure 8 shows a schematic structural diagram of a second feature extraction network; Figure 8 takes four sets of second convolutional networks as an example for illustration. The convolutional layers of the latter set of second convolutional networks are connected to the previous set of second convolutional networks. The activation function layer of the convolutional network. The second feature extraction network may also include more or fewer sets of second convolutional networks.

It can be understood that the convolutional layer in the second convolutional network is used to extract features and generate feature maps; the pooling layer can be an average pooling layer (Average Pooling or mean-pooling), a global average pooling layer (Global Average Pooling) ), max-pooling, etc.; the pooling layer can be used to compress the feature map output by the convolutional layer, retain the main features in the feature map, and delete non-main features to reduce the dimension of the feature map. Taking the average pooling layer as an example, the average pooling layer can average the feature point values in the neighborhood of the preset range size of the current feature point, and use the average value as the new feature point value of the current feature point. In addition, the pooling layer can also help the feature map to maintain some non-deformation, such as rotation invariance, translation invariance, and expansion invariance. The activation function layer can perform function transformation on the feature map processed by the pooling layer. The transformation process breaks the linear combination of the input of the convolutional layer and can improve the feature expression ability of the second convolutional network. The activation function layer may specifically be Sigmoid function, tanh function, Relu function, etc.

Considering that most of the text behavior is arranged horizontally, in order to make the sub-feature map after the split contains one or a few characters corresponding to the feature, the feature map can be split into at least one sub-feature along the column direction of the feature map Figure; the column direction of the feature map can be understood as the vertical direction of the text row direction. In one case, the width of the sub-characteristic map can be set according to the width of most characters, and the aforementioned characteristic map can be split according to the width. For example, the above feature map is H*W*C, and the preset width of the sub feature map is k, then each sub feature map is H*(W/k)*C. In addition, the number of sub-feature maps can also be preset, such as T, each sub-feature map is H*(W/T)*C.

Taking the convolutional network as an example, the second output network includes multiple fully connected layers; multiple fully connected layers are arranged in parallel; the number of fully connected layers corresponds to the number of sub-feature maps, and each sub-feature map is input to the corresponding In the fully connected layer of, the output matrix corresponding to the sub-characteristic map output by each fully connected layer is obtained.

The classification function can be a Softmax function; the Softmax function can be identified as

Where, e represents a natural constant; t represents the t-th probability matrix; K represents the number of different characters contained in the target training text image of the training set; m represents from 1 to K+1; ∑ represents a sum operation;

Is the i-th element in the output matrix; the

Is the i-th element in the probability matrix pt.

Relative to the elements in the output matrix

Itself, the value of the exponential function of the element

The difference between each element can be expanded. For example, the output matrix is [3,1,-3]. After calculating the exponential function value of each element, the exponential function value matrix corresponding to the output matrix is [20,2.7,0.05] . Using the element's exponential function value to calculate the probability of each element can increase the probability gap between each other, make the probability of the correct recognition result higher, and help the accuracy of the recognition result.

The recognition loss function includes L=-log p(y|{p ^t } _t=1...T ); where y is the pre-labeled probability matrix of the target training text image; t represents the t-th probability matrix; p ^t is the probability matrix corresponding to each of the sub-characteristic maps output by the classification function; T is the total number of the probability matrix; p represents the calculated probability; log represents the logarithmic operation. Based on the recognition loss function, in the above steps, the process of training the second initial model according to the second loss value can also be implemented through the following steps 32-38:

Step 32: Update the parameters in the second initial model according to the second loss value;

For example, the function mapping relationship can be preset, and the original parameter and the second loss value are input into the function mapping relationship, and then the updated parameters can be calculated. The function mapping relationship of different parameters can be the same or different.

Specifically, the parameters to be updated can be determined from the second initial model according to preset rules; the parameters to be updated can be all parameters in the second initial model, or some parameters can be randomly determined from the second initial model; and then calculate The second loss value is the derivative of the parameter to be updated

Among them, L′ is the loss value of the probability matrix; W′ is the parameter to be updated;

Represents partial derivative operation; the parameter to be updated can also be called the weight of each neuron. This process can also be called a back-propagation algorithm; if the second loss value is large, it means that the output of the current second initial model does not match the expected output result. Update the derivative of the parameter, which can be used as a basis for adjusting the parameter to be updated.

After obtaining the derivative of each parameter to be updated, update the parameter to be updated to obtain the updated parameter to be updated

Among them, α'is the preset coefficient. This process can also be called a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the first loss value drops the fastest based on the current parameter to be updated. By adjusting the parameters in this direction, the first loss value can be adjusted Decrease quickly to make the parameter converge. In addition, when the second initial model is trained once, a second loss value is obtained. At this time, one or more parameters can be randomly selected from each parameter in the second initial model to perform the above-mentioned update process. The model training time of this method is Shorter, faster algorithm; Of course, it is also possible to perform the above-mentioned update process on all parameters in the first initial model, and the model training in this way is more accurate.

Step 34: Judge whether the updated parameters all converge; if the updated parameters all converge, go to step 36; if the updated parameters do not all converge, go to step 38;

Step 36: Determine the second initial model after parameter update as the recognition model;

Step 38: Continue to perform the step of determining the target training text image based on the preset training set until all the updated parameters converge.

Specifically, a new image can be retrieved from the training set as the target training text image, or the current target training text image can be continuously used as the target training text image for training.

In the above method, the model can automatically segment the feature map of the image. Using the text recognition model, if you input an image containing a text line, you can get the text content in the image, so there is no need to segment the text line. The text content of the text line can be obtained, operation and editing, the calculation speed is fast, and the text recognition accuracy is high.

Based on the text content determination method provided in the foregoing embodiment, an embodiment of the present application also provides another text content determination method, which is implemented on the basis of the text content determination method or text recognition model training method described in the foregoing embodiment; This method focuses on the process of obtaining the text content of the text area based on the recognition result after the text recognition model outputs the recognition result; as shown in Figure 9, the method includes the following steps:

Step S902: Obtain the text area in the image by using the above-mentioned text area determination method;

Step S904: Normalize the text area according to the preset size.

The preset size can include a preset length and width. If the text area does not meet the preset size, the text area can be scaled, or the text area can be cut or filled in the blank area to make The processed text area meets the aforementioned preset size.

Step S906, input the processed text area into the pre-trained text recognition model, and output the recognition result of the text area; the recognition result of the text area includes multiple probability matrices corresponding to the text area;

In the recognition process of the text recognition model, the feature map corresponding to the text area needs to be segmented, and the sub-feature maps after segmentation are output through the corresponding output network to output the output matrix, and then the classification function is used to obtain the corresponding output matrix Probability matrix, so the recognition result of the text area includes multiple probability matrices, each probability matrix usually corresponds to one or a few characters.

Step S908, determining the position of the maximum probability value in each probability matrix;

Step S910: Obtain the character corresponding to the position with the maximum probability value from the correspondence between each position and the character in the preset probability matrix; for the convenience of description, the obtained character may be called the character to be arranged.

As described in the foregoing embodiment, the probability value at each position in the probability matrix can be used to characterize the probability that the sub-characteristic map matches the character corresponding to the position. Therefore, the character corresponding to the position of the maximum probability value can be determined as the recognition result of the corresponding sub-feature map. In most cases, the character corresponding to the position of the maximum probability value can be one character or multiple characters. The correspondence between the above positions and the characters can be established in the following way: first collect characters, which can include text, punctuation marks, mathematical symbols, network emoticons, etc. in multiple languages; specifically, it can be in the process of establishing a training set Collecting characters can also be collected through dictionaries, character libraries, symbol libraries, etc.

Step S912: Arrange the acquired characters (the characters to be arranged above) according to the arrangement order of the multiple probability matrices;

The multiple probability matrices output by the text recognition model are usually arranged according to the position of the sub-feature map corresponding to each probability matrix in the feature map. Therefore, the arrangement order of the multiple probability matrices is usually contained in the sub-feature map corresponding to each probability matrix. The arrangement order of the characters is consistent; based on this, according to the arrangement order of the multiple probability matrices, arrange the obtained characters, so that the arranged characters can be consistent with the character arrangement of the original text line, so it can be determined according to the arranged characters The text content in the text area.

Step S914: Determine the text content in the text area according to the arranged characters.

For example, you can directly determine the text content in the text area by the arranged characters; however, considering that the font size of the characters in the text is different, in the text recognition model, when segmenting the feature map, it may not be exactly according to a character Corresponding to a sub-characteristic map. Therefore, there may be repeated characters in the final arranged characters. In order to further optimize the recognition effect of the text, you can delete the repeated characters and empty characters in the arranged characters according to preset rules To get the text content in the text area.

Specifically, a repetitive word library can be established in advance. If there are repeated characters in the arranged characters, you can find whether the repeated characters exist in the repetitive words library, if not, delete the repeated characters, and only keep the repeated characters In addition, you can also combine the semantics of other characters to determine whether there should be repeated characters in the current context. For empty characters, you can also determine whether to delete them based on the current context. If the empty characters are located between two English words, they do not need to be deleted and can be kept. For example, the characters after the above arrangement are "--hh-e-l-ll-oo-", where "-" represents a null character; after deleting repeated characters and null characters, the text content obtained is "hello".

In the above method, first normalize the acquired text area, and then obtain the recognition result of the text area through the text recognition model; then determine the recognized characters through each probability matrix in the recognition result, and then obtain the text of the text area content. Since the text recognition model can automatically segment the feature map of the image, in this way, if you input an image containing a text line, you can get the recognition result of the image, and then get the text content, without the need to segment the text line. The text content of the text line can be directly obtained, operation and editing, the calculation speed is fast, and the text recognition accuracy is high.

Based on the method for determining text content provided by the foregoing embodiment, the embodiment of the present application also provides another method for determining text content, which is implemented on the basis of the foregoing method; the method focuses on obtaining the text content of the text area and then based on the text The content determines whether the image contains sensitive words.

Usually, a sensitive word database needs to be established in advance, and the sensitive word database is used to determine whether the text content corresponding to the image contains sensitive information; the sensitive word database contains sensitive words, such as sensitive words involving pornography, reaction, and terrorism; The sensitive word database can be matched one by one for the words in the text content. If the matching is successful, the current word is a sensitive word. Based on this, the method for determining text content in this embodiment includes the following steps, as shown in FIG. 10:

Step S1002: Obtain the text area in the image by the above-mentioned method for determining the text area;

Step S1004: Normalize the text area according to the preset size.

Step S1006, input the processed text area into the pre-trained text recognition model, and output the recognition result of the text area; the recognition result of the text area includes multiple probability matrices corresponding to the text area;

Step S1008: Determine the position of the maximum probability value in each probability matrix;

Step S1010, obtaining the character corresponding to the position with the maximum probability value from the correspondence between each position and the character in the preset probability matrix;

Step S1012: Arrange the acquired characters according to the arrangement order of the multiple probability matrices;

Step S1014: Determine the text content in the text area according to the arranged characters.

Step S1016, if the image contains multiple text areas, obtain the text content in each text area;

Step S1018, perform word segmentation operation on the obtained text content;

The word segmentation operation can also be called the word segmentation operation; for example, a thesaurus can be established and the word segmentation operation can be performed based on the thesaurus; specifically, it can start from the first character in the text content, and the first character And the second character as a combination, search from the thesaurus, if the word that contains the combination is not found, divide the first character into a single word; if the word that contains the combination can be found, then Add the third character to the combination, and continue to search from the thesaurus; until no word containing the combination is found, the characters except the last character in the combination are divided into one word, and so on, until Complete the word segmentation operation of the text content.

Step S1020, matching the word segmentation obtained after the word segmentation operation with the pre-established sensitive vocabulary one by one;

Step S1022, if at least one word segmentation is successfully matched, it is determined that the text content corresponding to the image contains sensitive information.

Step S1024: Obtain the text area to which the successfully matched word segment belongs, and identify the acquired text area or the successfully matched word segment in the image.

In actual implementation, the acquired text area can be identified by the identification box, or the word segmentation that has been successfully matched; if it is real-time detection in video playback or real-time live broadcast scenes, mosaic or obfuscation can be used to identify the acquired text area. Text area, or matching successful word segmentation, to achieve the purpose of filtering sensitive words.

In the above method, after obtaining the text content of the text area, the sensitive words are identified from the text content through the sensitive word database to achieve the purpose of speech supervision; this method can obtain the content and identify sensitive words in real time, which is beneficial to realize the live broadcast , Video live broadcast and other scenarios, and restrict the purpose of dissemination of sensitive words.

It should be noted that the foregoing method embodiments are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

Corresponding to the foregoing method embodiment, refer to the schematic structural diagram of a text detection model training device shown in FIG. 11, which includes:

The training image determining module 110 is configured to determine the target training image;

The training image input module 111 is configured to input the target training image into a first initial model; the first initial model includes a first feature extraction network, a feature fusion network, and a first output network;

The feature extraction module 112 is configured to extract multiple initial feature maps of the target training image through the first feature extraction network; the multiple initial feature maps have different scales;

The feature fusion module 113 is configured to perform fusion processing on multiple initial feature maps through a feature fusion network to obtain a fusion feature map;

The output module 114 is configured to input the fusion feature map to the first output network, and output the candidate regions of the text region in the target training image and the probability value of each candidate region;

The loss value determination and training module 115 is configured to determine the candidate area and the first loss value of the probability value of each candidate area through a preset detection loss function; train the first initial model according to the first loss value until the first The parameters in the initial model converge, and the text detection model is obtained.

The text detection model training device provided by the embodiment of the application first extracts multiple initial feature maps with different scales of the target training image; then performs fusion processing on the multiple initial feature maps to obtain a fused feature map; and then inputs the fused feature map To the first output network, output the candidate regions of the text region in the target training image and the probability value of each candidate region; after the first loss value is determined by the preset detection loss function, the first initial model is calculated according to the first loss value Perform training to obtain a detection model. In this method, the feature extraction network can automatically extract features of different scales. Therefore, the text detection model only needs to input an image to obtain the candidate regions of the text area of various scales in the image, without manually changing the image scale. The operation is convenient, especially in the scene of multiple font sizes, multiple fonts, multiple shapes, and multiple orientations. It can quickly, comprehensively and accurately detect various types of text in the image, which is also conducive to the accuracy of subsequent text recognition and improves The effect of text recognition.

In some embodiments, the aforementioned first feature extraction network includes multiple groups of first convolutional networks connected in sequence; each group of first convolutional networks includes a convolution layer, batch normalization layer, and activation function layer connected in sequence.

In some embodiments, the above-mentioned feature fusion module is further configured to arrange a plurality of initial feature maps in sequence according to the scale of the initial feature map; wherein the scale of the initial feature map at the top level is the smallest; and the initial feature map at the bottom level The scale of is the largest; according to the order of arrangement, for each level below the top level, the initial feature map of this level and the fusion result of the upper level of this level are fused to obtain the fusion result of this level; where, The fusion result of the top level is the initial feature map of the top level; the fusion result of the lowest level is determined as the fusion feature map of the initial feature map.

In some embodiments, the above-mentioned first output network includes a first convolutional layer and a second convolutional layer; the above-mentioned output module is further configured to: input the fusion feature map to the first convolutional layer and the second convolutional layer respectively; Perform the first convolution operation on the fused feature map through the first convolutional layer, and output a coordinate matrix; the coordinate matrix includes the vertex coordinates of the candidate regions of the text area in the target training image; perform the second convolution on the fused feature map through the second convolution layer Convolution operation, output probability matrix; the probability matrix includes the probability value of each candidate area.

In some embodiments, the aforementioned detection loss function includes a first function and a second function; the first function is L ₁ = |G ^* -G|; where G ^* is the coordinate matrix of the text area in the pre-labeled target training image ; G is the coordinate matrix of the candidate area of the text area in the target training image output by the first output network; the second function is L ₂ =-Y ^* logY-(1-Y ^* )log(1-Y); where Y ^* Is the probability matrix of the text area in the pre-labeled target training image; Y is the probability matrix of the candidate area of the text area in the target training image output by the first output network; the candidate area and the first loss of the probability value of each candidate area The value L=L ₁ +L ₂ .

In some embodiments, the aforementioned loss value determination and training module is further configured to: update the parameters in the first initial model according to the first loss value; determine whether the updated parameters are all converged; if the updated parameters are all converged, set the parameters The updated first initial model is determined to be the detection model; if the updated parameters do not all converge, the step of determining the target training image based on the preset training set is continued until the updated parameters all converge.

In some embodiments, the aforementioned loss value determination and training module is further configured to: determine the parameter to be updated from the first initial model according to preset rules; and calculate the derivative of the first loss value to the parameter to be updated in the first initial model

Among them, L is the first loss value; W is the parameter to be updated; update the parameter to be updated to obtain the updated parameter to be updated

Among them, α is the preset coefficient.

See FIG. 12 for a schematic structural diagram of a text area determining device; the device includes:

The image acquisition module 120 is configured to acquire the image to be detected;

The detection module 122 is configured to input the image to be detected into the pre-trained text detection model, and output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region; the text detection model adopts the above text detection model Training method of training;

The text area determination module 124 is configured to determine the text area in the image to be detected from the multiple candidate areas according to the probability value of the candidate area and the degree of overlap between the multiple candidate areas.

The above-mentioned text area determination device provided by the embodiment of the application inputs the acquired image to be detected into the text detection model, and outputs multiple candidate areas of the text area in the image to be detected and the probability value of each candidate area; and then according to the candidate area The probability value of and the degree of overlap between multiple candidate areas are used to determine the text area in the image to be detected from the multiple candidate areas. In this method, the text detection model can automatically extract features of different scales, so you only need to input an image to the model to get the candidate regions of the text area of various scales in the image, without manually changing the image scale, and the operation is convenient , Especially in the scene of multiple font sizes, multiple fonts, multiple shapes, multiple orientations, it can quickly, comprehensively and accurately detect all types of text in the image, which is also conducive to the accuracy of subsequent text recognition and improves text recognition Effect.

In some embodiments, the above-mentioned text area determination module is further configured to arrange multiple candidate areas in sequence according to the probability value of the candidate area; wherein the probability value of the first candidate area is the largest, and the probability value of the last candidate area is the smallest. ; According to the arrangement order, for each candidate area in turn, calculate the degree of overlap between the candidate area and the candidate area other than the candidate area one by one; In the candidate areas other than the candidate area, the degree of overlap is greater than the preset overlap Threshold candidate area removal; the remaining candidate area after removal is determined as the text area in the image to be detected.

In some embodiments, the above-mentioned device further includes: a region elimination module, configured to eliminate candidate regions whose probability value is lower than a preset probability threshold among the multiple candidate regions to obtain the final multiple candidate regions.

See FIG. 13 for a schematic structural diagram of a text content determination device; the device includes:

The area obtaining module 130 is configured to obtain the text area in the image by using any of the foregoing text area determination methods;

The recognition module 132 is configured to input the text area into the pre-trained text recognition model, and output the recognition result of the text area;

The text content determination module 134 is configured to determine the text content in the text area according to the recognition result.

The text content determination device provided by the embodiment of the present application first obtains the text area in the image by the above-mentioned text area determination method; then inputs the text area into the pre-trained text recognition model, and outputs the recognition result of the text area; The recognition result determines the text information in the text area. In this method, because the above-mentioned text area determination method can obtain text areas of various scales through the text detection model, it can quickly, comprehensively and accurately detect in a variety of font sizes, fonts, shapes, and directions. The various types of text in the image in turn also contribute to the accuracy of text recognition and improve the effect of text recognition.

In some embodiments, the above-mentioned apparatus further includes: a normalization module, configured to perform normalization processing on the text area according to a preset size to obtain a processed text area;

The recognition module 132 is specifically configured to input the processed text area into the pre-trained recognition model.

In some embodiments, the above-mentioned device further includes a text recognition model training module, which is configured to complete the training of the text recognition model in the following manner: determining the target training text image; inputting the target training text image to the second initial model; The model includes a second feature extraction network, a second output network, and a classification function; the feature map of the target training text image is extracted through the second feature extraction network; the feature map is split into at least one sub-feature map through the second initial model; The feature map is input to the second output network, and the output matrix corresponding to each sub feature map is output; the output matrix corresponding to each sub feature map is input to the classification function, and the probability matrix corresponding to each sub feature map is output; through preset recognition The loss function determines the second loss value of the probability matrix; the second initial model is trained according to the second loss value until the parameters in the second initial model converge to obtain a text recognition model.

In some embodiments, the above-mentioned second feature extraction network includes multiple groups of second convolutional networks connected in sequence; each group of second convolutional networks includes a convolution layer, a pooling layer, and an activation function layer connected in sequence.

In some embodiments, the text recognition model training module described above is further configured to split the feature map into at least one sub-feature map along the column direction of the feature map; the column direction of the feature map is the vertical direction of the text row direction.

In some embodiments, the above-mentioned second output network includes multiple fully connected layers; the number of fully connected layers corresponds to the number of sub-feature maps; the recognition model training module is further configured to: input each sub-feature map to the corresponding full In the connection layer, the output matrix corresponding to the sub-characteristic map output by each fully connected layer is obtained.

In some embodiments, the above classification function includes a Softmax function; the Softmax function is

Is the i-th element in the output matrix; the

Is the i-th element in the probability matrix pt.

In some embodiments, the aforementioned recognition loss function includes L=-log p(y|{p ^t } _t=1...T ); where y is the pre-labeled probability matrix of the target training text image; t represents the first t probability matrices; p ^t is the probability matrix corresponding to each sub-characteristic map output by the classification function; T is the total number of the probability matrices; p represents the calculated probability; log represents the logarithmic operation.

In some embodiments, the aforementioned recognition model training module is further configured to: update the parameters in the second initial model according to the second loss value; determine whether the updated parameters are all converged; if the updated parameters are all converged, set the parameters The updated second initial model is determined to be the text recognition model; if the updated parameters do not all converge, continue to perform the step of determining the target training text image based on the preset training set until all the updated parameters converge.

In some embodiments, the aforementioned recognition model training module is further configured to: determine the parameter to be updated from the second initial model according to preset rules; and calculate the derivative of the second loss value of the parameter to be updated

Among them, L′ is the loss value of the probability matrix; W′ is the parameter to be updated; update the parameter to be updated to obtain the updated parameter to be updated

Among them, α'is the preset coefficient.

In some embodiments, the recognition result of the above text area includes multiple probability matrices corresponding to the text area; the text content determination module is further configured to: determine the position of the maximum probability value in each probability matrix; from a preset probability matrix In the correspondence between each position and the character, the character corresponding to the position where the maximum probability value is obtained is used as the character to be arranged; according to the arrangement order of the multiple probability matrices, the characters to be arranged are arranged to obtain the arranged characters; according to the arranged characters The character determines the text content in the text area.

In some embodiments, the above-mentioned text content determination module is further configured to delete repeated characters and empty characters in the arranged characters according to a preset rule to obtain the text content in the text area.

In some embodiments, the above-mentioned apparatus further includes: a sensitive information determining module configured to determine whether the text content contains sensitive information through a pre-established sensitive vocabulary.

In some embodiments, the above-mentioned sensitive information determination module is further configured to: perform word segmentation operations on the acquired text content; match the word segmentation obtained after the word segmentation operation with the pre-established sensitive vocabulary one by one; if at least one word segmentation is successfully matched, Make sure that the text contains sensitive information.

In some embodiments, the above-mentioned apparatus further includes: an area identification module configured to determine a text area to which the successfully matched word segment belongs as the area to be identified; and identify the area to be identified in the image.

The implementation principles and technical effects of the device provided in the embodiment of the application are the same as those of the foregoing method embodiment. For a brief description, for the parts not mentioned in the device embodiment, please refer to the corresponding content in the foregoing method embodiment.

An embodiment of the present application also provides an electronic device. As shown in FIG. 14, the electronic device includes a memory 100 and a processor 101, where the memory 100 is configured to store one or more computer instructions, and one or more computer instructions are The processor 101 executes to implement the steps of the above-mentioned text detection model training method, text region determination method, or text content determination method.

Further, the electronic device shown in FIG. 14 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.

The memory 100 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in FIG. 14, but it does not mean that there is only one bus or one type of bus.

The processor 101 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 101 or instructions in the form of software. The aforementioned processor 101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short). ), Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 100, and the processor 101 reads information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with its hardware.

The embodiment of the present application also provides a machine-readable storage medium that stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement For the steps of the text detection model training method, the text region determination method, or the text content determination method, the specific implementation can be found in the method embodiment, which will not be repeated here.

The text detection model training method, text area, content determination method, device, and computer program product of electronic equipment provided by the embodiments of the present application include a computer-readable storage medium storing program code, and instructions included in the program code can be set In order to implement the method described in the foregoing method embodiment, the specific implementation can be referred to the method embodiment, and details are not described herein again.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the related technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .

The embodiment of the present application provides an executable program code that is configured to be executed to execute the steps of the text detection model training method, the text region determination method, or the text content determination method.

The above are only the preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in this application Within the scope of protection.

Claims

A text detection model training method, the method includes:

Determine the target training image;

Input the target training image to a first initial model; the first initial model includes a first feature extraction network, a feature fusion network, and a first output network;

Extracting multiple initial feature maps of the target training image through the first feature extraction network; the multiple initial feature maps have different scales;

Performing fusion processing on a plurality of the initial feature maps through the feature fusion network to obtain a fusion feature map;

Inputting the fusion feature map to the first output network, and outputting candidate regions of the text region in the target training image and the probability value of each candidate region;

Determine the candidate area and the first loss value of the probability value of each candidate area through a preset detection loss function; train the first initial model according to the first loss value until the first The parameters in the initial model converge, and the text detection model is obtained.
The method according to claim 1, wherein the first feature extraction network comprises multiple groups of first convolutional networks connected in sequence; each group of the first convolutional network comprises convolutional layers connected in sequence, batch normalization The transformation layer and activation function layer.
The method according to claim 1, wherein the step of performing fusion processing on a plurality of the initial feature maps through the feature fusion network to obtain a fusion feature map comprises:

According to the scale of the initial feature map, arrange a plurality of the initial feature maps in sequence; wherein the scale of the initial feature map at the top level is the smallest; and the scale of the initial feature map at the bottom level is the largest;

According to the arrangement sequence, for each level below the top level, the initial feature map of the level and the fusion result of the upper level of the level are fused to obtain the fusion result of the level; wherein, the topmost level The fusion result of the level is the initial feature map of the top level;

The fusion result of the lowest level is determined as the fusion feature map of the initial feature map.
The method according to claim 1, wherein the first output network includes a first convolutional layer and a second convolutional layer;

The step of inputting the fusion feature map to the first output network and outputting the candidate regions of the text region in the target training image and the probability value of each candidate region includes:

Input the fusion feature map to the first convolutional layer and the second convolutional layer respectively;

Performing a first convolution operation on the fusion feature map through the first convolution layer to output a coordinate matrix; the coordinate matrix includes the vertex coordinates of the candidate regions of the text region in the target training image;

Perform a second convolution operation on the fusion feature map through the second convolution layer to output a probability matrix; the probability matrix includes the probability value of each candidate region.
The method according to claim 1, wherein the detection loss function includes a first function and a second function;

The first function is L 1 = |G * -G|; wherein, G * is the coordinate matrix of the text area in the target training image that is pre-labeled; G is the output of the first output network The coordinate matrix of the candidate area of the text area in the target training image;

The second function is L 2 =-Y * logY-(1-Y * )log(1-Y); where Y * is the pre-labeled probability matrix of the text area in the target training image; Y is the The probability matrix of the candidate area of the text area in the target training image output by the first output network; log represents a logarithmic operation;

The first loss value of the candidate area and the probability value of each candidate area is L=L 1 +L 2 .
The method according to claim 1, wherein the step of training the first initial model according to the first loss value until the parameters in the first initial model converge to obtain a text detection model comprises:

Updating the parameters in the first initial model according to the first loss value;

Determine whether the updated parameters are all converged;

If the updated parameters all converge, determining the first initial model after the updated parameters as the detection model;

If the updated parameters do not all converge, continue to perform the step of determining the target training image until the updated parameters all converge.
The method according to claim 6, wherein the step of updating the parameters in the first initial model according to the first loss value comprises:

Determine the parameters to be updated from the first initial model according to preset rules;

Calculate the derivative of the first loss value to the parameter to be updated in the first initial model
Wherein, L is the first loss value; W is the parameter to be updated;

Update the parameter to be updated to obtain the updated parameter to be updated
Among them, α is the preset coefficient.
A method for determining a text area, the method comprising:

Obtain the image to be detected;

Input the image to be detected into a pre-trained text detection model, output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region; the text detection model passes the claims The training method for the text detection model described in any one of 1-7 is obtained through training;

According to the probability value of the candidate area and the degree of overlap between the multiple candidate areas, the text area in the image to be detected is determined from the multiple candidate areas.
8. The method according to claim 8, wherein the text area in the image to be detected is determined from the plurality of candidate areas according to the probability value of the candidate area and the degree of overlap between the plurality of candidate areas The steps include:

According to the probability value of the candidate region, arrange a plurality of the candidate regions in sequence; wherein the probability value of the first candidate region is the largest, and the probability value of the last candidate region is the smallest;

According to the arrangement sequence, for each candidate area, calculate the degree of overlap between the candidate area and the candidate area other than the candidate area one by one; in the candidate areas other than the candidate area, the degree of overlap is greater than the preset overlap threshold The candidate area is eliminated;

The remaining candidate area after the elimination is determined as the text area in the image to be detected.
The method according to claim 9, wherein, before the step of arranging a plurality of the candidate regions in sequence according to the probability value of the candidate region, the method further comprises:

Among the multiple candidate regions, candidate regions with a probability value lower than a preset probability threshold are eliminated.
A method for determining text content, the method comprising:

Obtain the text area in the image by the method for determining the text area according to any one of claims 8-10;

Input the text area into a pre-trained text recognition model, and output the recognition result of the text area;

Determine the text content in the text area according to the recognition result.
The method according to claim 11, wherein the step of inputting the text area into a pre-trained recognition model comprises: normalizing the text area according to a preset size to obtain the processed text Region; input the processed text region into the pre-trained recognition model.
The method according to claim 11, wherein the text recognition model is trained in the following manner:

Determine the target training text image;

Inputting the target training text image to a second initial model; the second initial model includes a second feature extraction network, a feature splitting network, a second output network, and a classification function;

Extracting the feature map of the target training text image through the second feature extraction network;

Splitting the feature map into at least one sub feature map through the feature splitting network;

Input the sub-feature maps to the second output network respectively, and output the output matrix corresponding to each of the sub-feature maps;

Input the output matrix corresponding to each of the sub-feature maps to the classification function, and output the probability matrix corresponding to each of the sub-feature maps;

The second loss value of the probability matrix is determined by the preset recognition loss function; the second initial model is trained according to the second loss value until the parameters in the second initial model converge to obtain text recognition model.
The method according to claim 13, wherein the second feature extraction network comprises multiple groups of second convolutional networks connected in sequence; each group of the second convolutional network comprises a convolutional layer and a pooling layer connected in sequence And activation function layer.
The method according to claim 13, wherein the step of splitting the feature map into at least one sub-feature map through the feature splitting network comprises:

The feature map is split into at least one sub feature map along the column direction of the feature map; the column direction of the feature map is the vertical direction of the text row direction.
The method according to claim 13, wherein the second output network includes a plurality of fully connected layers; the number of the fully connected layers corresponds to the number of the sub feature maps;

The step of inputting the sub-feature maps to the second output network and outputting the output matrix corresponding to each of the sub-feature maps respectively includes: inputting each of the sub-feature maps to the corresponding fully connected In the layer, the output matrix corresponding to the sub-feature map respectively output by each fully connected layer is obtained.
The method according to claim 13, wherein the classification function comprises a Softmax function;

The Softmax function is
Where, e represents a natural constant; t represents the t-th probability matrix; K represents the number of different characters contained in the target training text image of the training set; m represents from 1 to K+1; ∑ represents a sum operation;
Is the i-th element in the output matrix; the
Is the i-th element in the probability matrix pt.
The method according to claim 13, wherein the recognition loss function comprises L=-log p(y|{p t } t=1...T ); wherein y is the pre-labeled target training text image Probability matrix; t represents the t-th probability matrix; p t is the probability matrix corresponding to each of the sub-characteristic maps output by the classification function; T is the total number of the probability matrix; p represents the calculated probability; log represents the pair Numerical operations.
The method according to claim 13, wherein the step of training the second initial model according to the second loss value until the parameters in the second initial model converge to obtain a text recognition model comprises:

Update the parameters in the second initial model according to the second loss value;

Determine whether the updated parameters are all converged;

If the updated parameters all converge, determining the second initial model after the updated parameters as the text recognition model;

If the updated parameters do not all converge, continue to perform the step of determining the target training text image until all the updated parameters converge.
The method according to claim 19, wherein the step of updating each parameter in the second initial model according to the second loss value comprises:

Determine the parameters to be updated from the second initial model according to preset rules;

Calculate the derivative of the second loss value to the parameter to be updated
Wherein, L′ is the loss value of the probability matrix; W′ is the parameter to be updated;

Update the parameter to be updated to obtain the updated parameter to be updated
Among them, α'is the preset coefficient.
The method according to claim 11, wherein the recognition result of the text area includes multiple probability matrices corresponding to the text area;

The step of determining the text content in the text area according to the recognition result includes:

Determining the position of the maximum probability value in each probability matrix;

Obtaining the character corresponding to the position with the maximum probability value from the correspondence between each position and the character in the preset probability matrix, as the character to be arranged;

Arrange the characters to be arranged according to the arrangement sequence of the plurality of probability matrices to obtain the arranged characters;

The text content in the text area is determined according to the arranged characters.
22. The method according to claim 21, wherein the step of determining the text content in the text area according to the arranged characters comprises:

The repeated characters and empty characters in the arranged characters are deleted to obtain the text content in the text area.
The method according to claim 11, wherein, after the step of determining the text content in the text area according to the recognition result, the method further comprises:

It is determined whether the text content contains sensitive information through a pre-established sensitive vocabulary.
22. The method according to claim 23, wherein the step of determining whether the text content contains sensitive information through a pre-established sensitive vocabulary includes:

Perform word segmentation operations on the obtained text content;

Match the word segmentation obtained after word segmentation operation with the pre-established sensitive vocabulary one by one;

If at least one word segmentation matches successfully, it is determined that the text content contains sensitive information.
The method according to claim 24, wherein after determining that the text content contains sensitive information, the method further comprises:

Determine the text area to which the successfully matched word belongs as the area to be marked; mark the area to be marked in the image;

Alternatively, the matched word segmentation is identified in the image.
A text detection model training device, the device includes:

The training image determination module is set to determine the target training image;

A training image input module, configured to input the target training image to a first initial model; the first initial model includes a first feature extraction network, a feature fusion network, and a first output network;

A feature extraction module, configured to extract multiple initial feature maps of the target training image through the first feature extraction network; the scales of the multiple initial feature maps are different;

A feature fusion module, configured to perform fusion processing on a plurality of the initial feature maps through the feature fusion network to obtain a fusion feature map;

An output module, configured to input the fusion feature map to the first output network, and output candidate regions of the text region in the target training image and the probability value of each candidate region;

The loss value determination and training module is configured to determine the candidate region and the first loss value of the probability value of each candidate region through a preset detection loss function; compare the first initial loss value to the first loss value according to the first loss value The model is trained until the parameters in the first initial model converge to obtain a text detection model.
The apparatus according to claim 26, wherein the first feature extraction network comprises multiple groups of first convolutional networks connected in sequence; each group of the first convolutional network comprises convolutional layers connected in sequence, batch normalization The transformation layer and activation function layer.
The device according to claim 26, wherein the feature fusion module is further configured to:

According to the scale of the initial feature map, arrange a plurality of the initial feature maps in sequence; wherein the scale of the initial feature map at the top level is the smallest; and the scale of the initial feature map at the bottom level is the largest;

According to the arrangement sequence, for each level below the top level, the initial feature map of the level and the fusion result of the upper level of the level are fused to obtain the fusion result of the level; wherein, the topmost level The fusion result of the level is the initial feature map of the top level;

The fusion result of the lowest level is determined as the fusion feature map of the initial feature map.
The apparatus of claim 26, wherein the first output network includes a first convolutional layer and a second convolutional layer;

The output module is also set to:

Input the fusion feature map to the first convolutional layer and the second convolutional layer respectively;

Performing a first convolution operation on the fusion feature map through the first convolution layer to output a coordinate matrix; the coordinate matrix includes the vertex coordinates of the candidate regions of the text region in the target training image;

Perform a second convolution operation on the fusion feature map through the second convolution layer to output a probability matrix; the probability matrix includes the probability value of each candidate region.
The apparatus of claim 26, wherein the detection loss function includes a first function and a second function;

The first function is L 1 = |G * -G|; wherein, G * is the coordinate matrix of the text area in the target training image that is pre-labeled; G is the output of the first output network The coordinate matrix of the candidate area of the text area in the target training image;

The second function is L 2 =-Y * logY-(1-Y * )log(1-Y); where Y * is the pre-labeled probability matrix of the text area in the target training image; Y is the The probability matrix of the candidate area of the text area in the target training image output by the first output network; log represents a logarithmic operation;

The first loss value of the candidate area and the probability value of each candidate area is L=L 1 +L 2 .
The device according to claim 26, wherein the loss value determination and training module is further configured to:

Updating the parameters in the first initial model according to the first loss value;

Determine whether the updated parameters are all converged;

If the updated parameters all converge, determining the first initial model after the updated parameters as the detection model;

If the updated parameters do not all converge, continue to perform the step of determining the target training image based on the preset training set until the updated parameters all converge.
The device according to claim 31, wherein the loss value determination and training module is further configured to:

Determine the parameters to be updated from the first initial model according to preset rules;

Calculate the derivative of the first loss value to the parameter to be updated in the first initial model
Wherein, L is the first loss value; W is the parameter to be updated;

Update the parameter to be updated to obtain the updated parameter to be updated
Among them, α is the preset coefficient.
A device for determining a text area, the device comprising:

The image acquisition module is set to acquire the image to be detected;

The detection module is configured to input the image to be detected into a pre-trained text detection model, and output multiple candidate regions of the text region in the image to be detected, and the probability value of each candidate region; the text The detection model is obtained by training the text detection model training method according to any one of claims 1-7;

The text area determination module is configured to determine the text area in the image to be detected from the plurality of candidate areas according to the probability value of the candidate area and the degree of overlap between the plurality of candidate areas.
The device according to claim 33, wherein the text area determination module is further configured to:

According to the probability value of the candidate region, arrange a plurality of the candidate regions in sequence; wherein the probability value of the first candidate region is the largest, and the probability value of the last candidate region is the smallest;

According to the arrangement sequence, for each candidate area, calculate the degree of overlap between the candidate area and the candidate area other than the candidate area one by one; in the candidate areas other than the candidate area, the degree of overlap is greater than the preset overlap threshold The candidate area is eliminated;

The remaining candidate area after the elimination is determined as the text area in the image to be detected.
The device according to claim 34, wherein the device further comprises: an area elimination module, configured to eliminate candidate areas with a probability value lower than a preset probability threshold among the plurality of candidate areas to obtain a final multiple Said candidate regions.
A device for determining text content, the device comprising:

An area obtaining module, configured to obtain a text area in an image through the method for determining a text area according to any one of claims 8-10;

A recognition module, configured to input the text area into a pre-trained text recognition model, and output the recognition result of the text area;

The text content determination module is configured to determine the text content in the text area according to the recognition result.
The device according to claim 36, wherein the device further comprises: a normalization module, configured to perform a normalization process on the text area according to a preset size to obtain a processed text area;

The recognition module is specifically configured to input the processed text area into a pre-trained recognition model.
The device according to claim 36, wherein the device further comprises a text recognition model training module configured to train the text recognition model in the following manner:

Determine the target training text image;

Inputting the target training text image into a second initial model; the second initial model includes a second feature extraction network, a second output network, and a classification function;

Extracting the feature map of the target training text image through the second feature extraction network;

Split the feature map into at least one sub feature map by using the second initial model;

Input the sub-feature maps to the second output network respectively, and output the output matrix corresponding to each of the sub-feature maps;

Inputting the output matrix corresponding to each of the sub-feature maps to the classification function respectively, and outputting the probability matrix corresponding to each of the sub-feature maps;

The second loss value of the probability matrix is determined by the preset recognition loss function; the second initial model is trained according to the second loss value until the parameters in the second initial model converge to obtain text recognition model.
The device according to claim 38, wherein the second feature extraction network comprises multiple groups of second convolutional networks connected in sequence; each group of the second convolutional network comprises a convolutional layer and a pooling layer connected in sequence And activation function layer.
The device according to claim 38, wherein the recognition model training module is further configured to:

The feature map is split into at least one sub feature map along the column direction of the feature map; the column direction of the feature map is the vertical direction of the text row direction.
The apparatus according to claim 38, wherein the second output network includes a plurality of fully connected layers; the number of the fully connected layers corresponds to the number of the sub-feature maps;

The recognition model training module is further configured to: input each of the sub-feature maps into the corresponding fully connected layer to obtain the output matrix corresponding to the sub-feature map respectively output by each fully connected layer.
The device of claim 38, wherein the classification function comprises a Softmax function;

The Softmax function is
Where, e represents a natural constant; t represents the t-th probability matrix; K represents the number of different characters contained in the target training text image of the training set; m represents from 1 to K+1; ∑ represents a sum operation;
Is the i-th element in the output matrix; the
Is the i-th element in the probability matrix pt.
The device according to claim 38, wherein the recognition loss function comprises L=-log p(y|{p t } t=1...T ); wherein y is the pre-labeled target training text image Probability matrix; t represents the t-th probability matrix; p t is the probability matrix corresponding to each of the sub-characteristic maps output by the classification function; T is the total number of the probability matrix; p represents the calculated probability; log represents the pair Numerical operations.
The device according to claim 38, wherein the recognition model training module is further configured to:

Update the parameters in the second initial model according to the second loss value;

Determine whether each of the updated parameters converges;

If all the updated parameters converge, determining the second initial model after the parameter update is a text recognition model;

If all the updated parameters do not converge, continue to perform the step of determining the target training text image based on the preset training set until all the updated parameters converge.
The device according to claim 44, wherein the recognition model training module is further configured to:

Determine the parameters to be updated from the second initial model according to preset rules;

Calculate the derivative of the second loss value to the parameter to be updated
Wherein, L′ is the loss value of the probability matrix; W′ is the parameter to be updated;

Update the parameter to be updated to obtain the updated parameter to be updated
Among them, α'is the preset coefficient.
The device according to claim 36, wherein the recognition result of the text area includes multiple probability matrices corresponding to the text area;

The text content determination module is also set to:

Determining the position of the maximum probability value in each probability matrix;

Obtaining the character corresponding to the position with the maximum probability value from the correspondence between each position and the character in the preset probability matrix, as the character to be arranged;

Arrange the characters to be arranged according to the arrangement sequence of the plurality of probability matrices to obtain the arranged characters;

The text content in the text area is determined according to the arranged characters.
The device according to claim 46, wherein the text content determination module is further configured to:

The repeated characters and empty characters in the arranged characters are deleted to obtain the text content in the text area.
The device of claim 36, wherein the device further comprises:

The sensitive information determining module is configured to determine whether the text content contains sensitive information through a pre-established sensitive vocabulary.
The device according to claim 48, wherein the sensitive information determining module is further configured to:

Performing word segmentation operations on the acquired text content;

Match the word segmentation obtained after word segmentation operation with the pre-established sensitive vocabulary one by one;

If at least one word segmentation matches successfully, it is determined that the text content contains sensitive information.
The device of claim 49, wherein the device further comprises:

The region identification module is configured to determine the text region to which the successfully matched word segment belongs, as the region to be identified; and identify the region to be identified in the image.
An electronic device comprising a processor and a memory, the memory storing machine executable instructions that can be executed by the processor, and the processor executing the machine executable instructions to implement any one of claims 1 to 7 The text detection model training method, the text region determination method according to any one of claims 8 to 10, or the steps of the text content determination method according to any one of claims 11 to 25.
A machine-readable storage medium, the machine-readable storage medium stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement any one of claims 1 to 7 Steps of the text detection model training method according to the item, the text area determination method according to any one of claims 8 to 10, or the text content determination method according to any one of claims 11 to 25.
An executable program code, wherein the executable program code is set to be executed to execute the text detection model training method according to any one of claims 1 to 7, and the text according to any one of claims 8 to 10 The region determining method, or the steps of the text content determining method according to any one of claims 11 to 25.