CN115797947A - Image-based font identification method and device, electronic equipment and storage medium - Google Patents

Image-based font identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115797947A
CN115797947A CN202211559191.6A CN202211559191A CN115797947A CN 115797947 A CN115797947 A CN 115797947A CN 202211559191 A CN202211559191 A CN 202211559191A CN 115797947 A CN115797947 A CN 115797947A
Authority
CN
China
Prior art keywords
text
font
image
target
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211559191.6A
Other languages
Chinese (zh)
Inventor
缪瑜
刘奎龙
祁欣妍
李可娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211559191.6A priority Critical patent/CN115797947A/en
Publication of CN115797947A publication Critical patent/CN115797947A/en
Pending legal-status Critical Current

Links

Images

Abstract

According to the embodiment of the application, firstly, an interference text element except for a target text element is determined from an image to be detected, then a text relevance identification result of at least one text paragraph area which is divided correspondingly in the target image area and used for representing the target image area is obtained in the target image area except for the interference text element, and then a target font corresponding to the target text element is obtained according to a font identification result corresponding to the at least one text paragraph area. By adopting the scheme, the interference text element region in the image to be detected can be eliminated, the font identification is concentrated on the target text element region in the image to be detected, the calculation amount of the font identification is reduced, the efficiency of the font identification is further improved, and the requirement of image batch detection can be met.

Description

Image-based font identification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for font identification based on an image, an electronic device, and a storage medium.
Background
In electronic commerce, images are often distributed to promote goods to increase the number of pages visited by advertising, and further to introduce traffic to goods or stores to promote business growth. Some merchants attract the visit of net friends by designing characters with unique characters in the images.
However, even fonts carried by computers are not commercially viable. In order to avoid the font infringement condition and protect the intellectual property of a font owner, it is necessary to provide font detection service to assist copyright problem self-checking and avoid copyright risks.
In the related art, manual assistance is usually needed to determine the image area where the text is located, so that the detection efficiency is low and the experience is poor.
Disclosure of Invention
Embodiments of the present application provide a font identification method and apparatus based on an image, an electronic device, and a storage medium, so as to solve one or more of the above technical problems.
In a first aspect, an embodiment of the present application provides an image-based font identification method, including:
determining interference text elements except the target text elements from the image to be detected;
acquiring a text relevance identification result of a target image area except the interference text element, wherein the text relevance identification result represents at least one text paragraph area correspondingly divided by the target image area;
and obtaining a target font corresponding to the target text element according to the font identification result corresponding to the at least one text paragraph area.
In a second aspect, an embodiment of the present application provides an image-based font identification method, including:
submitting a target image to be identified on a text identification page;
acquiring a target font correspondingly recognized to the target image to be recognized; the target font is determined according to a font identification result corresponding to at least one text paragraph area, and the text paragraph area is obtained by dividing a text relevance identification result of the target image area in the target image area except interference character elements in an image to be detected;
and displaying the target font and corresponding infringement analysis early warning information and/or infringement processing strategy on the text identification page.
In a third aspect, an embodiment of the present application provides an apparatus for recognizing fonts based on images, including:
the interference text determining module is used for determining interference text elements except the target text element from the image to be detected;
the text relevance identification module is used for acquiring a text relevance identification result of a target image area except the interference text element, and the text relevance identification result represents at least one text paragraph area correspondingly divided by the target image area;
and the target font obtaining module is used for obtaining a target font corresponding to the target text element according to the font identification result corresponding to the at least one text paragraph area.
In a fourth aspect, an embodiment of the present application provides an image-based font identification apparatus, including:
the detection image submitting module is used for submitting an image to be detected on the text recognition page;
the target font acquisition module is used for acquiring a target font which is correspondingly identified for the image to be detected; the target font is determined according to a font identification result corresponding to at least one text paragraph area, and the text paragraph area is obtained by dividing a text relevance identification result based on the target image area in a target image area except interference character elements in an image to be detected;
and the infringement information disclosure module is used for displaying the target font and corresponding infringement analysis early warning information and/or infringement processing strategy on the text identification page.
In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor implements the method of any one of the above when executing the computer program.
In a sixth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method of any one of the above.
Compared with the related art, the method has the following advantages:
according to the embodiment of the application, the interference text element except the target text element is determined from the image to be detected, then the text relevance identification result of at least one text paragraph area which is used for representing the corresponding division of the target image area and is in the target image area except the interference text element is obtained, and then the target font corresponding to the target text element is obtained according to the font identification result corresponding to the at least one text paragraph area. By adopting the scheme, the interference text element region in the image to be detected can be eliminated, the font identification is concentrated in the target image region in the image to be detected, the calculation amount of the font identification is reduced, the font identification efficiency is further improved, and the requirement of image batch detection is met. The image area corresponding to the text needing to be identified does not need to be confirmed manually, the font identification result of the target text element in the image can be obtained only by providing the image, the special requirement on the image to be detected is avoided, and the operation is simple and convenient. Moreover, the influence of the interfering text element font on the identification accuracy of the target text element font can be reduced.
The method is characterized in that the text relevance identification is carried out on the font identification results belonging to the same text paragraph area in the image to be detected, which is equivalent to the comprehensive judgment of the font by combining the relevant text, so that the inaccurate identification result of part of the text is eliminated, and the accuracy of the font identification can be improved.
The foregoing description is only an overview of the technical solutions of the present application, and the following detailed description of the present application is given to enable the technical means of the present application to be more clearly understood and to enable the above and other objects, features, and advantages of the present application to be more clearly understood.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are not to be considered limiting of its scope.
FIG. 1 is a diagram illustrating an example application of a font recognition scheme according to an embodiment of the present application;
FIG. 2 illustrates a flow diagram of a method for image-based font recognition in accordance with an embodiment of the present application;
FIG. 3 shows a flow diagram of an image-based font identification method of another embodiment of the present application;
FIG. 4 is a block diagram illustrating an exemplary embodiment of an apparatus for image-based font recognition;
FIG. 5 is a block diagram of an image-based font recognition apparatus according to another embodiment of the present application; and
FIG. 6 shows a block diagram of an electronic device used to implement an embodiment of the application.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
To facilitate understanding of the technical solutions of the embodiments of the present application, the following describes related arts of the embodiments of the present application. The following related arts as alternatives can be arbitrarily combined with the technical solutions of the embodiments of the present application, and all of them belong to the scope of the embodiments of the present application.
At present, most of fonts on the market share corresponding copyright by font companies, if the fonts are used as network images for business, the fonts need to be authorized by the font companies, and can be used for business after corresponding copyright fees are paid, but due to various reasons, the number of cases related to font infringement in the last decade is nearly hundreds of times, a large number of E-commerce sellers, content organizations and the like receive the maintenance rights of different copyright parties and suffer loss due to various font infringements, and attorney letters warn that the fonts become popular stems of a large number of sellers and self-media once, and the annual pay fees or the like are hundreds of millions of dollars. Therefore, it is necessary to provide a font identification scheme for identifying the font type corresponding to the text in the network image to assist the font copyright self-check, protect the intellectual property rights of the font owner, and prevent the font infringement.
In the prior art, the image area where the text is located needs to be determined through manual assistance, the font identification efficiency is low, the requirement of image batch detection cannot be met, the use cost of font identification is increased, and the use experience is poor.
In view of the above, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for image-based font recognition, so as to solve all or part of the above technical problems.
Fig. 1 is a schematic diagram of an application example of the font recognition scheme of the embodiment of the present application. The scheme of this application embodiment can realize as a typeface recognition device, corresponding with the typeface recognition device of this application, can provide the customer end, the page or the program function plug-in that are used for submitting the typeface recognition request, individual, trade company, the typeface copyright side or bear the weight of the demander such as the corresponding platform of image can be based on this customer end, the page or the program function plug-in provides the image to be detected that needs to carry out the detection, submit the request of typeface discernment to the detection terminal by customer end, page or program function plug-in, accomplish the typeface discernment back by the typeface recognition device of detection terminal, feedback typeface discernment result.
Taking the example of a merchant detecting fonts in banner (banner advertisement), since banner can usually link to a commodity or a web page of the merchant, in order to improve click rate, the merchant may use various fonts in banner, which may further cause a risk of font infringement. Referring to fig. 1, after a merchant uploads a banner to a text recognition page of a client, a font recognition request is sent to a detection end by the client, the detection end is provided with a font recognition device, the font recognition device may be provided with deep learning algorithm models such as a semantic segmentation model, a target detection model, a textbox detection model and the like, and is used for recognizing an interfering text element in an image to be detected, so as to further determine a target image area other than the interfering text element, and perform text relevance recognition, and then, at least one text paragraph area correspondingly divided in the target image area is used as a text relevance recognition result of the target image area according to a layout detection model, and after comprehensively judging one or more font recognition results in the same paragraph area, a font recognition result corresponding to the whole paragraph area is obtained and used as a target font corresponding to the target text element. And sending the obtained font identification result to the client by the detection end, and displaying the font identification result in a text identification page. Wherein, in the banner image uploaded by the merchant, only "mini70, camera battery, CR2 battery x 2 particles" belong to the target text element to be detected, and other text elements are interference text elements, including three types: object identification element (double logo part in banner); background elements (text elements on the camera in the banner); commodity object element (text element on battery in banner).
In the process, after the client sends the font identification request to the detection end and before the font detection result is output, the font identification device of the detection end completes the whole process of image font identification without additional operation. The image area corresponding to the text needing to be identified does not need to be confirmed manually, the font identification result of the target text element in the image can be obtained only by providing the image, the special requirement on the image to be detected is avoided, and the operation is simple and convenient.
Secondly, text recognition is carried out on the image to be detected through a font recognition device, and interference text elements in the image, such as commodities, people, backgrounds, marks and the like in the image, are recognized, the interference text elements may be texts and marks carried on packages of the commodities, and detection of the interference text elements and the marks wastes detection resources, reduces detection efficiency and can influence accuracy of a final detection result. After the interference text elements in the image are identified, the interference text element area in the image to be detected can be eliminated, so that the font identification is concentrated in the target image area in the image to be detected, the calculation amount of the font identification is reduced, the font identification efficiency is improved, and the requirement of image batch detection is met. Meanwhile, the influence of the interfering text element font on the identification accuracy of the target text element font can be reduced.
And the character recognition device is used for segmenting the image to be detected. Since the text elements in the image to be detected may belong to different parts of the image, but some text elements may be processed as the same paragraph when edited into the image, and the possibility that the fonts used by the text in the same paragraph are the same is high. Therefore, through paragraph division, the text relevance identification can be carried out on the font identification results possibly belonging to the same text paragraph area in the image to be detected, which is equivalent to the comprehensive judgment of the font by combining the relevant text, the inaccurate identification result of partial text is eliminated, and the accuracy of font identification can be improved.
It should be noted that, the application of the font identification scheme is only an exemplary application scenario of the present application, and the embodiments of the present application may also be applied to any font identification scenario. In addition to the foregoing example that the merchant detects the font in the banner through the client, the font copyright may also detect the font in the image that contains the infringement font that is found, and the font that is uploaded to the image on the platform by the third party may also be detected by the corresponding platform that bears the image (such as a transaction platform and a photo website), which is not limited in this application. The image to be detected may be a web poster image, a public number matching image, or a website design interface, for example, a dynamic web page in the HTML5 (Hyper Text Markup Language 5, fifth edition), which is not limited in this application.
The execution subject of the embodiment of the present application may be a functional module in the form of an application, a service, an instance, or software, a Virtual Machine (VM), a container, or a cloud server, or a hardware device (such as a server or a terminal device) or a hardware chip (such as a CPU, a GPU, an FPGA, an NPU, an AI accelerator card, or a DPU) having a data processing function. The device for realizing font identification may be deployed on a computing device of an application side providing a corresponding Service or a cloud computing Platform providing computing power, storage, and network resources, and the mode of the cloud computing Platform providing the Service to the outside may be IaaS (Infrastructure as a Service), paaS (Platform as a Service), saaS (Software-as-a-Service), or DaaS (Data-as-a-Service). Taking a platform providing Software-as-a-Service (Software-as-a-Service) as an example, a cloud computing platform can provide a font recognition device for training or deployment of a model by using computing resources of the cloud computing platform, and a specific application architecture can be built according to Service requirements. For example, the platform may provide a building service based on the model to an application or an individual using the platform resource, and further invoke the model and implement the recognition function based on a font recognition request submitted by a device such as an associated client or server.
The following describes the technical solution of the present application and how to solve the above technical problems with specific embodiments. Several of the specific embodiments listed may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Fig. 2 is a flowchart illustrating an image-based font identification method according to an embodiment of the present application, which may include:
in step S201, an interfering text element other than the target text element is determined from the image to be detected.
The images related to the embodiment of the application comprise text elements and image elements, wherein the text elements refer to text objects in the images and can be in various text forms such as Chinese, english, numbers, symbols and the like. The text object to be recognized can be determined as the target text element, and since other text objects do not need to be subjected to font recognition, recognition of the other text objects may affect the font recognition result of the target text element, so that part or all of the other text objects can be determined as the interfering text elements. For example, when font recognition is required for an advertisement text in a banner, the advertisement text is a target text element, and other portions that do not belong to the advertisement text are interfering text elements that interfere with the recognition of the advertisement text. The image element may be other content excluding a text portion, such as an icon in an image, a subject object, a background portion, and the like.
The target text element and the interference text element can be distributed in one or more areas and can be formed by corresponding one or more text forms.
It can be understood that the interference text element in the image to be detected is determined, and the influence of the interference text element on the target text element is removed in the subsequent font identification, so that the effects of improving the detection efficiency and the accuracy of the font identification result can be achieved.
The number of the images to be detected can be one or more, and the detection efficiency can be improved by carrying out concurrent detection on the multiple images to be detected. The method includes the steps that a submission inlet of an image to be detected can be provided based on a client, a page or a program functional plug-in, a font detecting demander can upload the image to be detected through the submission inlet, or the image to be detected can be actively acquired by a font recognizing device.
In some embodiments, the distracting textual element includes at least one or more of a merchandise object element, a background element, or an object identification element. The commodity object element may be a text object carried by the main body portion in the image to be detected, for example, in a scene in which a merchant performs font detection on a banner in a commodity page of the merchant, the commodity object element may be a text object carried by a commodity shown in the banner image, for example: characters carried on clothes, food and living goods; the background element may be a text object carried by an auxiliary part in the image to be detected, for example, in an image in which a keyboard is used as a background, the keyboard may be provided with text objects such as letters a, B, and C, and the letters carried by the keyboard itself may not belong to a target that needs font detection, but since the keyboard as an image background may occupy all or most of an area in the image, the area where the target text element is located coincides with the area where the background element is located, or directly overlaps (for example, the commercial content to be displayed by the image to be detected is a keyboard cleaning cloth, and advertisement text content related to the size, material, and the like of the cleaning cloth is displayed on the image area where the keyboard is located). For example, since the background element and the target element are often in different network layers in the image to be detected, the position of the background element can be determined by detecting the spatial position of the pixel. The object identification element can be a visual symbol, such as a trademark or a logo of a commodity or a merchant, or can be a specific image processed by certain art.
In a possible implementation manner, the interference text element at least includes a commodity object element and a background element. Since the commodity object element and the background element in the image may carry text elements themselves, but the text elements are not target text elements, the text carried by the commodity object element and the background element does not belong to the target text for font recognition, and the text needs to be removed as interference text. When the interfering text elements except the target text element are determined from the image to be detected, an attention detection module of a semantic segmentation model can be called to detect a commodity object element and a background element from the image to be detected; and calling a main body segmentation module of the semantic segmentation model to perform edge segmentation on the commodity object elements and the background elements, and determining an image area defined after the edge segmentation as an object image area corresponding to the commodity object elements and the background elements.
The attention detection module of the semantic segmentation model is used for classifying pixel points in the image to be detected, endowing each pixel in the image with a class label according to the characteristics of the pixel, and applying different weights, so that the pixel is judged to belong to a commodity object element or a background element in the image to be detected. And the main body segmentation module of the semantic segmentation model is used for performing edge segmentation on the marked pixels in the area range where the pixels of the commodity object elements or the background elements marked by the attention detection module are positioned to obtain an image area after the edge segmentation, and dividing the image area into object image areas corresponding to the commodity object elements or the background elements. Therefore, the text belonging to the object image area can be determined to be the interference text, and the text in the object image area is not detected in the subsequent font identification process, so that the font identification resources are saved, and the font identification efficiency is improved.
The attention detection module in the semantic segmentation model can extract the features of the image to be detected in a feature extraction mode to obtain a feature map of the image. For example, an image to be detected may be input into the attention detection module, and image features in the image to be detected are extracted by the attention detection module, where the extraction may be performed through one network layer or multiple network layers, and then feature maps extracted each time are fused to obtain an overall feature map; when the image to be detected is divided into a plurality of regions, feature extraction can be performed on the image in each region respectively, and an image feature map of the region can be obtained. Furthermore, the attention detection module may apply different attention weights to different features in the feature map according to the obtained feature map; or, the process of applying attention weight to different features in the attention detection module may be parallel to the process of feature extraction, and the attention weight is respectively applied to feature maps of a plurality of network layers to assist the fusion of different feature layers in different networks and determine the influence weight of each feature layer in the network on the output.
Optionally, the attention detection module in the semantic segmentation model obtains a weight distribution through pre-training, and applies the weight distribution on the features extracted from the image. The weighting method may be to keep all components weighted (i.e. soft attribute), or to select some components in the distribution with a certain sampling strategy to weight (i.e. hard attribute); can act on a spatial scale to weight different spatial regions; it can also act on the channel scale to weight different channel characteristics; it may also act on the feature map to weight each element. For example, the above attention mechanism may adopt a CV (channel attention) attention mechanism, and apply a weight distribution to the channels to weight different channel features.
Optionally, the attention detection module in the semantic segmentation model may be implemented by using an FPN (Feature Pyramid Network) to extract features of an image to be detected. For example, feature extraction is performed on an image to be detected through the FPN, so as to obtain a feature map output from the last layer from top to bottom in the FPN. Since the resolution of the feature map output by the top-down network layer in the FPN is gradually decreased layer by layer, the resolution of the feature map output by the last layer is the highest, so that the features of the commodity object elements detected in the finally obtained feature map have higher accuracy.
Further, when the feature extraction is performed in the FPN manner, the feature extraction may be implemented by combining the attention mechanism, for example, a SAM (spatial attention) mechanism, which can improve a fusion effect of guiding different network layers in the FPN to make up for differences in scale and semantics of feature maps of different network layers in the FPN, and effectively alleviate a problem of cue combination (a cue refers to a network layer in the FPN, and a problem of cue combination refers to a problem of combining features having differences in scale and semantics with each other) of the FPN, which is more helpful for a semantic segmentation model to obtain more edge information, and is not easy to scratch even if a small area of an edge is lost. Wherein the SAM can be used to take advantage of spatial relationships between different feature maps to allow the model to look at the spatial locations of different features in the feature maps. For example, the SAM may perform global operation on pixel values at the same position on different feature maps in the direction of the axis to obtain two spatial attention maps and fuse them, perform convolution on the feature map, then follow a sigmoid function (S-shaped growth curve function) to obtain a spatial matrix (which may be a binary mask map) with the same dimension as the above-mentioned linguistic feature map and added with spatial attention weight, and finally multiply the obtained spatial matrix onto the original feature map correspondingly to obtain a new feature map.
The main body segmentation module of the semantic segmentation model can perform edge segmentation on the region range where the commodity object element or the background element is located to obtain an image region after the edge segmentation, and the image region is defined as an object image region corresponding to the commodity object element or the background element.
Optionally, the main body segmentation module of the semantic segmentation model may be implemented by an HED (end-to-end nested Edge Detection) mechanism, and compared with other Edge Detection mechanisms, an Edge graph generated by the HED performs better in terms of preserving the boundary of the target object in the image, and can preserve all the boundaries of the detected target, and the accuracy of Edge Detection can be improved by the mechanism.
For example, in combination with the attention detection module, the feature map extracted by the attention detection module by the HED may be input to the HED mechanism, the HED performs edge detection on the feature map, and the feature map processed by the HED is output to the SAM, so that the SAM generates a weight map to assist the fusion of different feature layers in the FPN. The weight map is equivalent to a layer of attention weight parallel to the feature extraction network, and determines the influence weight of each feature layer in the network on the output. Weighting a feature map output by the network layer and attention weight in a FPN (field programmable gate array) top-down network layer, and inputting a weighting result into an HED (hybrid electric energy device) mechanism until a final output feature map is obtained after SAM (sample access) marking; or, according to the corresponding relationship between the top-down network layer in the FPN and the attention weight, taking the attention weight as the weight of the weighting operation, and performing the weighted summation on the feature maps output by the top-down network layers in the FPN to obtain the final feature map.
In a possible implementation manner, the interfering text elements at least include an object identification element, and since the object identification element is usually an artistic unique image, the object identification element mainly serves to identify a commodity source, such as a logo, which is not a text element that may constitute font infringement concerned by a demanding party, and often does not need font detection on characters therein. The object identification element is characterized in that the position in the image can not be rotated or turned, the number in the image is limited, the occupied area is small, and a more accurate model is needed for identification. Therefore, when the interfering text elements except the target text element are determined from the image to be detected, a target detection model can be called to detect the object identification element and the identification image area where the object identification element is located from the image to be detected, and the target detection model is used for detecting the object identification element and the corresponding identification image area in parallel.
The related target detection model can sample a large number of areas in the image to be detected, then judge whether the areas contain interested targets, and adjust the area edges so as to predict the real bounding boxes (ground-route bounding boxes) of the targets more accurately. The Object Detection model may be a deep learning model, for example, a YOLO model (young Only Look on: unifield, real-Time Object Detection system based on a single neural network) may be adopted, and compared with the conventional Object Detection series of methods, YOLO provides another idea: the problem of Object Detection is transformed into a Regression classification problem, i.e. given an input image, a bounding box of the Object and its classification category are regressed directly at multiple positions of the image. The YOLO model includes, but is not limited to Yolov3, yolov4, yolov5 (all are different versions of YOLO), etc., and the weight, network structure diagram, and algorithm of different models are different, and the used region sampling method is also different.
The related labeled image area may be an image area defined by the target detection model and the object identification element, where the object identification element is located, and the image area defined by the edge segmentation is determined as the labeled image area corresponding to the object identification element.
Optionally, the target detection model further includes an FPN feature extraction module and a PAN (Pyramid Attention model) feature extraction module, that is, an FPN + PAN mechanism is adopted. The traditional image feature extraction mode can sample an image in an up-sampling layer, and generally adopts bottom-up sampling, so that the higher the sampling layer number is, the larger the receptive field of an acquired feature map is, the more abstract the features are, the more semantic information is, but the less position information is, and the lower the detection precision of a small target object is; when down-sampling layer sampling is performed on an image, a top-down sampling mode is generally adopted, and the lower the number of sampling layers is, the less semantic information and the more position information of the acquired feature map are, so that although a small target can be detected, the small target is easily subjected to wrong division. Compared with the traditional feature extraction network, the FPN is a feature pyramid from top to bottom, transmits high-level strong semantic features, and enhances the whole pyramid, but only enhances semantic information but does not transmit positioning information (or the transmission effect is poor because an upward transmission path is too long); the PAN aims at the point, a pyramid from bottom to top is added behind the FPN to supplement the FPN and transmit the positioning features of the lower layer, so that the formed pyramid combines semantic information and has positioning information, and the double-killing effect is achieved. Therefore, the FPN + PAN comprises an up-sampling layer with different bottom-up scales and a down-sampling layer with different top-down scales in the traditional method, so that abundant and various image features can be extracted, small targets can be detected, the accuracy of classifying the small targets can be considered, and finally, a text box divided according to the image features is more detailed and accurate.
For example, a target detection model Yolov5l (a version of YOLO) may be used to detect a target as a logo in an image. Because the logo is characterized by a special image processed through art, characters contained in the logo are usually specially designed and do not belong to a certain existing font, so that the logo can be regarded as a special image element, and the logo in the image can be detected through identifying the characteristics of pixels in the image. The reason for adopting the Yolov5l model is that compared with other target detection models, the Yolov5l model is provided with a deeper network depth and a wider network width, so that the extracted feature map has a larger thickness and has stronger network feature extraction, feature fusion capability and learning capability of network feature extraction. Specifically, 5 times of downsampling is conducted on an image to be detected through a pre-trained target detection model Yolov5l, and feature fusion is conducted on a feature map obtained through the 5 times of downsampling to obtain a feature map of the image to be detected; and meanwhile, determining the area edge of the logo in the image to be detected according to the characteristic diagram, adjusting the area edge, and finally determining the real bounding box of the logo area. The method for adjusting the region edge may be to determine the region edge according to the result of each downsampling in the process of 5 downsampling, and then fuse the 5 region edges; the region edge may be directly determined from the feature map after feature fusion. Through the process of down-sampling for multiple times, the scale of the image to be detected can be reduced, a smaller characteristic diagram is obtained, pixel points representing the logo part are quickly detected by using smaller calculated amount, and the detection efficiency and the accuracy are improved.
Before the above Yolov5l model is adopted to detect a logo target in an image, the model can be trained in advance, for example, an image is input, and the position (if any) of the logo in the image is told to the model, and after multiple times of training, the model can predict the position of the logo in the newly input image.
The mode of detecting the logo target by using the Yolov5l model can comprise the following four parts:
firstly, at an input end, a Mosaic data enhancement (a data enhancement mechanism), a self-adaptive anchor frame calculation and/or a self-adaptive picture scaling mechanism can be used for carrying out primary processing on an image to be detected, so that a target detection model can better distinguish an object identification element and a background element, and finally, the information in the feature map of the image to be detected is more accurate. The Mosaic data enhancement can better distinguish a detection target from a background in an image through pre-training (a training mode that a detection target is found out after a plurality of pictures are spliced and superposed), so that the capability of a Yolov5l model for distinguishing background elements from object identification elements is enhanced; compared with the traditional target detection method, the method is characterized in that a window with a fixed size is given, the adaptive Anchor frame calculation is a slidable window and can be set to have different sizes according to a set step sliding mode from left to right and from top to bottom step by step, the method is more suitable for objects with larger deformation, the window has less operation amount, the target detection precision can be improved by adopting the adaptive Anchor frame calculation, and the class probabilities (namely the probability that the region belongs to an object identification element or a background element) of a plurality of Anchor frames (Anchor boxes) in the image to be detected and the position of a boundary frame where the object identification element is located can be obtained through detection. The adaptive picture scaling mechanism (Letter box) is used for uniformly scaling input images with different sizes to the same size, so that the detection effect is better, but the scaling process is likely to cause the loss of image information, and the information characteristics of the receptive field in the network picture can be fully utilized by adopting the adaptive picture scaling mechanism. For example, each pixel point in the last layer of feature map in the Yolov5l model can correspond to the area information of 32 × 32 in the original image, so that the information of the receptive field can be effectively utilized and the accuracy of feature extraction of the picture to be detected can be improved as long as the length and the width can be evenly divided by 32 under the condition of ensuring the consistency of the transformation proportion of the whole image.
Secondly, in the backbone structure, a Focus structure (a down-sampling network structure) and a CSP structure (CSPNet, a cross-stage local fusion network structure) may be used, which respects the variability of the gradient by integrating the feature maps at the beginning and end of the network stage, so that the finally output feature map of the image to be detected retains more image features. The Focus structure can perform slicing operation on an image before the image to be detected enters a main structure of a target detection model, specifically, the operation is to take a value at every other pixel in one image, which is similar to adjacent down-sampling, so that four images are taken, the four images are complementary, and no information is lost, so that image characteristic information can be concentrated in a channel space, an input channel is expanded by 4 times, the obtained new image is subjected to convolution operation, a double down-sampling characteristic diagram under the condition of no information loss is obtained, and more complete down-sampling information of the image is also reserved; the CSP structure divides the original input into two branches, respectively performs convolution operation to reduce the number of channels by half, and then combines the two branches to enable the input and the output of the two branches to have the same size, so that the detection model can detect more features.
Thirdly, in a feature fusion structure (namely the Neck side of the Yolov5l model), the FPN + PAN feature extraction module can be used, downsampling layers with different top-down scales can be obtained, abundant and various image features can be extracted, and the recognition capability and classification accuracy of small targets such as object identification elements are considered.
Finally, in the Prediction structure, GIOU Loss (ground truth distance Loss function) may be used, where IoU is a distance, which means that an index between two rectangular boxes is evaluated, and the index has all characteristics of distance, including symmetry, nonnegativity, identity, triangle inequality, and the IoU Loss function reflects the ratio between the intersection and the union of two detection boxes, and thus is not related to the size of the detection boxes, but related to the scale, and for a large target and a small target with the same Loss value, the detection effect of the large target is better than that of the small target, and the detection of the small target is not flexible enough; the goal of GIoU is equivalent to adding a penalty of a closure formed by a ground truth and a prediction box in the IoU loss function, and its penalty term is that the smaller the proportion of the area of the closure minus the union of the two boxes in the closure is, the better is, when the two rectangular boxes do not intersect, the value of IoU is 0, and GIoU has different values and is positively correlated with the detection effect. Thus, the GIoU loss is optimized for the case when the two rectangular boxes do not overlap, and when the two rectangular boxes are located very close together, the values of the GIoU loss and the IoU loss are very close, and the model effect using the two losses may be closer in some scenarios, but the GIoU should have a faster convergence rate. Therefore, the position of the small target can be better predicted by using the GIOU Loss, and the target detection model can further accurately predict the position of the object identification element in the image to be detected.
In step S202, a text relevance identification result of a target image area outside the interference text element is obtained, where the text relevance identification result represents at least one text paragraph area divided by the target image area correspondingly. This step mainly solves the problem that performing font recognition of the target image area alone may result in an inaccurate recognition result. During the image production process, text elements in the image may belong to different parts in the image, and when a part of the text elements are edited into the image, the text elements may be processed as the same paragraph, and the possibility that the fonts used by the text in the same paragraph are the same is high. Therefore, through paragraph division, text relevance identification can be performed on the font identification results possibly belonging to the same text paragraph area in the image to be detected, which is equivalent to performing comprehensive judgment of the font by combining the relevant texts, so that the inaccurate identification result of part of the text is eliminated, and the accuracy of font identification can be improved.
The target image area related in the embodiment of the application may include a target text element to be detected, and does not include an interference text element. The number of the target image areas can be one or more, and each target image area at least contains one target text element.
The text relevance identification result is used for representing at least one text paragraph area correspondingly divided by the target image area, and further determining a text paragraph where the target text element is located, and specifically, features such as the position, format, area, shape, image layer where the target text element is located, character type, font identification result, relative position relation between different target text elements, character number difference and the like in the target image area can be extracted for relevance calculation, and relevant texts are divided into the same text paragraph.
The text relevance identification result can be determined through a layout detection model, after a text paragraph area is determined, font identification can be performed on target text elements belonging to the same paragraph, and a font identification result of the whole paragraph of target text is output. The layout detection model may also be a target detection model trained in advance.
In the embodiment of the present application, the time for performing text relevance identification on the text in the target image area may be determined before, after, or simultaneously with the interfering text element. The text in the target image area may be further partitioned into different text boxes using a text box detection model, where each text box may include one or more characters.
Optionally, the text box detection model includes an OCR (Optical Character Recognition) Character Recognition module, and accuracy of characters in the region image and the region image can be improved based on a Character Recognition capability of the OCR.
Optionally, the text box detection model further includes an FPN feature extraction module. Further, a text box detection model based on ResNet + FPN (namely, a residual neural network + a feature pyramid network) can be adopted, the model is suitable for a target detection task, a detection target can be set to be a text box, the model can be better adapted to the conditions of deformation, angle conversion and the like which may occur to the text box in an image (for example, the position of the text box in the image is not always horizontal and may involve rotation and overturning, and the model can detect the text box.
For example, the backbone network ResNet outputs feature maps of a plurality of different network layers (which may be all network layers P0 to P4), and conventionally performs feature extraction and then fusion on the output feature maps. However, since the resolution of the feature maps from P0 to P4 decreases at twice the speed, if the conventional method is used, the resolution of the finally extracted features is not high enough, and the accuracy of detection is affected. When the FPN pyramid network is adopted as the feature extractor, features can be extracted from the P0 layer only, and the self-adaption selection of the features from the P0-P4 layers is replaced, so that the features with high resolution are obtained, and the accuracy of detection is improved. Meanwhile, because the number of characters on the input picture to be detected is limited, in order to reserve more video card memories for subsequent high-resolution feature maps, the number of propusals (coordinates) in the extracted feature maps can be reduced from 512 to 128, and experiments prove that the improvement of reducing the number of propusals does not influence the performance of a detection part. In short, resNet + FPN is the operation of taking out the feature maps of the layers in the image detected in ResNet and putting them into FPN. Finally, the image with the target text element marked by the text box is output through the text box detection model.
In some embodiments, the obtaining of the text relevance identification result of the target image area other than the interference text element may specifically include two ways, one is obtaining a font identification result of the target image area other than the interference text element; and the other method is to perform text relevance identification on the font identification result of the target image area to obtain the text relevance identification result of the target image area. Therefore, the target text element can be further processed according to the font identification result of the target image area and the text relevance identification result of the target image area, and finally the corresponding target font is obtained.
Exemplarily, assuming that 3 target image regions other than the interfering text element are present in the image to be detected, and 2 text boxes containing the target text element are detected in each target image region by using the text box detection model, there are 6 target text element text boxes in the image to be detected. The font recognition can be performed on the target text elements in the 6 text boxes, and then the text relevance recognition is performed according to the 6 font recognition results. The weights of the text relevance identification results of different types can be preset, and then the text paragraph where the target text element is located in each text box is determined. For example, among the font identification results, the font identification result of the target text element in 2 text boxes is "song style", and meanwhile, according to the relevance identification of the position relation of the 2 target text elements, if the 2 target text elements are found to be in the same target image area, the 2 target text elements can be divided into the same text paragraph; if the 2 target text elements are found in different target image areas, the 2 target text elements may be divided into different text paragraphs. For another example, although the font identification results of the target text elements in 2 text boxes are all "songhua" and are located in the same target image area, if the 2 target text elements are found to be located in different network layers or occupy areas in the target image area greatly different from each other according to the comparison of the other text relevance identification results of the 2 target text elements, the 2 target text elements may also be divided into different text paragraphs after comprehensive calculation is performed according to the preset font identification result, the relative position relationship, and the weights of the network layers and the areas.
One possible implementation manner is that the obtaining of the font identification result of the target image area other than the interfering text element may include: determining a target image area outside the interference text element according to the interference image area where the interference text element is located; and carrying out font identification on the target image area to obtain a font identification result of the target image area.
That is to say, the interference image area where the interference text element is located can be defined according to the position of the interference text element in the image to be detected, so that other areas except the interference image area in the image to be detected can be determined as the target image area. Further, determining the text in the target image area as target text elements, and performing text recognition on the target text elements to obtain a font recognition result of the target text elements in the target image area. One or more interference image areas can be provided, and one or more target image areas determined according to the interference image areas can also be provided; each target image region contains at least one target text element. By adopting the method, the interfering text elements can be eliminated before the primary font identification is carried out on the target text elements, the area needing font identification is reduced, and the efficiency of font identification is further improved.
In another possible implementation manner, when obtaining the font identification result of the target image area other than the interfering text element, a font identification result obtained by performing font identification on the image to be detected may be obtained first; and removing the font identification result corresponding to the interference image area where the interference text element is positioned from the font identification result of the image to be detected to obtain the font identification result of the target image area.
The font identification result obtained by performing font identification on the image to be detected can be obtained by performing font identification on all text elements in the image to be detected, and the text element does not belong to which area in the image to be detected; the obtained font identification result of the image to be detected can be the font identification result of the text elements belonging to different text boxes defined by the text box detection model. Other models, such as the semantic segmentation model and the target detection model, can be called to detect the interfering text element in the image to be detected, and an interfering image area where the interfering text element is located is defined according to the position of the interfering text element in the image to be detected, so that the font identification result corresponding to the interfering image area where the interfering text element is located is removed from the font identification result of the image to be detected, and the font identification result of the target image area is obtained. The order of font identification and region division is not particularly limited, and the manner of font identification for the target text element has certain flexibility. After the font identification results of all the text elements are obtained, or before the font identification is carried out on the image to be detected, or simultaneously with the font identification.
In an embodiment of the present application, the font identification may be implemented by at least one of the following steps: calling a hybrid recognition model to perform image recognition on an image area to be recognized so as to respectively perform font detection on each character according to image characteristics, wherein the hybrid recognition model is trained to simultaneously execute the tasks of font recognition and text recognition; calling a single character recognition model to perform image recognition on an image area to be recognized so as to respectively perform font detection on each character according to image characteristics; and calling an overall recognition model to perform image recognition on the image area to be recognized so as to perform font detection on all the character overall according to the image characteristics. The three steps are used for carrying out font identification on the target text element in different modes, so that if the three steps are carried out in parallel, the font detection accuracy can be enhanced and the accuracy of the font detection result can be improved equivalently by carrying out font identification in different modes for multiple times.
The related image area to be recognized may be a target image area determined according to the interference image area where the interference text element is located in the foregoing step; or may be a whole area or a partial area of the image to be detected, which may include the interference image area. Furthermore, the image area to be recognized can be subjected to area segmentation to obtain a plurality of image areas to be recognized with smaller units, and one or more text elements belonging to the image area to be recognized are defined by using the text box detection model.
In the related technology, the font corresponding to the text content to be detected can be determined only by comparing the text image in the image to be detected with the font image in the preset database, namely by means of characteristic comparison between a single image and an image. This type of font recognition not only requires a high storage requirement for the database, but also cannot recognize fonts in new images that are not stored in the database. In the embodiment of the application, the font of the text in the image to be detected can be recognized not only by singly comparing the images, but also by a mixed recognition mode of 'image + text', and the recognition accuracy of two related tasks, namely the image and the text, is restricted, so that the recognition results of the two tasks are influenced mutually, and the accuracy of the font recognition result is improved. Meanwhile, the embodiment of the application has no special requirements on the database and the image to be detected.
The related font identification method can be realized by at least one of the three steps, and at least one font identification result is finally obtained. Different steps can adopt different recognition models to respectively perform font recognition on the text elements to be recognized in the target image area. The different font recognition models adopt different technical routes, and the font results recognized by the same text element to be recognized may be the same or different.
Further, in some embodiments, the font identification method may further include: and taking the recognition confidence degrees corresponding to the various font detection results as weights, and performing weighted operation on the various font recognition results to obtain weighted font recognition results. Therefore, the probability of inaccurate font identification results of partial text can be reduced, and the accuracy of font identification is improved.
For example, three different recognition models based on CRNN (Convolutional Recurrent Neural Network, convolutional Recurrent Neural Network structure) may be used to perform font recognition on text elements to be recognized, perform weighting calculation on font results recognized by the different recognition models, and output a font corresponding to the text elements to be recognized after weighting. In one example, the CRNN includes a CNN (Convolutional Neural Network) feature extraction layer and an LSTM (Long Short Term Memory, which is a cyclic Neural Network with a special structure) sequence feature extraction layer, and can implement image-based sequence recognition, and is mainly used for recognizing a text sequence with an indefinite length end to end, without cutting a single word first, but converting the text recognition into a sequence learning problem with a time sequence dependence.
Optionally, the three different recognition models based on CRNN include a character decoding module, configured to decode characters in the text box defined by the font detection model or recognized by the OCR module, and determine a font corresponding to the text element to be recognized in the text box. The text decoding module may be one or more of a CTC module (connection Temporal Classification, neural network-based Temporal Classification), an Attention Decoder (Attention Decoder, attention-based decoding algorithm), or an ACE module (Aggregation Cross Entropy, a weak supervision algorithm for solving sequence problems). In a model combining a CRNN module and a CTC module, firstly, CNN in the CRNN is used for extracting image convolution characteristics, then LSTM in the CRNN is used for further extracting sequence characteristics in the image convolution characteristics, and finally, CTC is introduced to solve the problem that characters cannot be aligned during training.
One identification model can be named as a CRNN-CTC-TEXT model, and the image area to be identified is subjected to hybrid identification. The mixed recognition comprises the steps of carrying out image recognition on an image area to be recognized so as to respectively carry out font detection on each character according to the image characteristics; and simultaneously, carrying out image recognition on the image area to be recognized, and respectively carrying out text detection on each character according to the image characteristics. In the training process of the hybrid recognition model, the recognition accuracy of two related tasks is simultaneously restricted, the recognition results of the two tasks are influenced by each other, and the accuracy of both the font recognition result and the character recognition result can be improved.
Another recognition model, named as CRNN-CTC-ACE model, performs image recognition on an image area to be recognized, so as to perform font detection on each character according to image features. The model does not care about character recognition content, and only utilizes the model to output the font type corresponding to each character according to the image characteristics.
Another recognition model may be named as a CRNN-authentication model, and performs image recognition on an image region to be recognized, so as to perform font detection on all the entire characters according to image features. The model does not care about character recognition content, only utilizes the model to output the font type according to the image characteristics, and because the model does not contain a CTC module and cannot well process tasks of different lengths, the model does not adopt a character-by-character recognition mode, namely, the font type corresponding to each character in each text box cannot be output, and only the font type corresponding to the whole text box is output.
Further, the output results of different models in the recognition task may be weighted and averaged, the recognition model may output a font type confidence corresponding to each word of the text box or the whole text box, and perform weighting calculation using the confidence as a weight to output the font type of the text box.
Yet another possible implementation is that since text elements in an image may belong to different parts of the image, and a part of the text elements may be processed as the same paragraph when being edited into the image, the text within the same paragraph may have the same font. Therefore, through paragraph division, the text relevance identification can be carried out on the font identification results possibly belonging to the same text paragraph area in the image to be detected, which is equivalent to the comprehensive judgment of the font by combining the relevant text, the inaccurate identification result of partial text is eliminated, and the accuracy of font identification can be improved. For example, whether the target text element may belong to a paragraph may be determined according to the relevance recognition result such as the position information of the target text element, and the possibility of an inaccurate recognition result of a part of text may be reduced by comparing the font recognition results of the text elements in the paragraph. Correspondingly, when the text relevance identification result of the target image area is obtained by performing text relevance identification on the font identification result of the target image area, paragraph type detection can be performed on the target image area according to the font identification result corresponding to at least one text paragraph area, where the paragraph type may include a table, a text paragraph, or a text title; and dividing the image area where the font identification results belonging to the same paragraph type are located into the same text paragraph area.
In the embodiment of the present application, a target detection model based on fast Regions with CNN features (fast convolutional neural network) may be used for paragraph type detection, and the model is suitable for a target detection task, and three detection targets may be set, which are "table", "text paragraph", or "text title", respectively.
The fast-RCNN is a deep learning detection algorithm, and feature extraction, candidate frame selection, frame regression and classification are integrated into one Network by adding an RPN (Region probable Network) and an Anchor mechanism, so that detection accuracy and detection efficiency are effectively improved. The Anchor frame is a multi-scale sliding window, also called a priori frame, and is used for detecting an object in the network image, and generating anchors with different sizes and different aspect ratios on unit elements on the network image feature map so as to determine a candidate frame.
The detection steps of the fast-RCNN are as follows: firstly, inputting an image to be detected. Second, extracting a candidate region: extracting candidate regions from an input image, mapping the candidate regions to a final convolution feature layer according to a spatial position relationship, namely, scaling the input image, entering the convolution layer to extract features to obtain a feature map, and then sending the feature map into an RPN (Region predictive Network, a full convolution Network) to generate a series of possible candidate frames. Tasks in RPN networks can have two parts, one is classification: judging whether all preset anchors belong to positive correlation or negative correlation (namely whether the anchors have targets); the other part is regression: using bbox regression (bounding box regression) to correct Anchor and obtain more accurate proposal (coordinate). Therefore, the RPN network is equivalent to perform a part of detection in advance, that is, determining whether there is a target, and correcting the anchor to make the position of the candidate frame more accurate. Thirdly, area normalization: inputting each candidate Region on the convolution layer feature map with the candidate frame output by the original feature map and the RPN into a RoI Pooling layer (Region of Interest Pooling layer) to obtain the feature with fixed dimension. The propofol can be completely pooled into fully-connected input by ROI Pooling processing, and has no deformation, fixed length and no loss of information content. Step four, classification and regression: inputting the extracted features into the full connection layer to perform target classification and coordinate regression, for example, classifying the features by using a Softmax (normalized exponential function), and performing regression on the positions of the candidate regions; or the specific category to which the candidate region in the candidate frame belongs can be calculated by using a prosages feature maps (coordinate feature vectors), and a final accurate position of the candidate frame is obtained by performing a bbox regression (bounding box regression) again.
For example, first, an image to be detected, or a target detection image region segmented by the semantic segmentation model, or an image detected by the text box detection model is input at an input end of the layout detection model, an Anchor mechanism in the layout detection model is called, and anchors of different sizes and different aspect ratios are generated on the input image. Secondly, fm (image feature points) is sent to an RPN network, 9 anchors are predicted for each pixel point in an input feature map in the RPN, and the coordinates of the center point of the anchors are the coordinate points of the current pixel point mapped on the input image. The width and the height of the anchor boxes can be manually preset, the convolution in the RPN does not change the resolution of the feature map, but the number of output feature map channels is related to the anchor boxes, the probability that the anchor boxes are foreground or background is shown, whether the anchor boxes (the values before coordinate adjustment) belong to positive samples or negative samples is judged, and then the candidate frame is determined. If the samples are negative samples, only the classification loss (classified as background) is calculated, and if the samples are positive samples, the classification loss (classified as foreground) and the regression loss (corrected for the coordinate values of the anchor boxes) are calculated. And thirdly, inputting each candidate region on the original feature map and the convolution layer feature map with the candidate frame output by the RPN into a RoI Pooling layer to obtain the feature with fixed dimension. RPN and fast R-CNN share convolutional layer features, where the task of the fast R-CNN model is to classify and regress the feature vectors for each candidate frame region input from the RPN. Furthermore, obtaining an ROI (Region of Interest) Region on the feature map according to the mapping of the candidate frame, and performing ROI pooling to obtain a feature vector (feature vector) with a fixed length; and finally, sending the feature vectors into a full-connection layer to obtain classification category vectors and position coordinates.
Optionally, the fast-RCNN based output model has two outputs, namely Softmax (normalized exponential function) and bbox regression (bounding box regression), respectively. The Softmax corresponds to a classification result, that is, which class the current candidate frame belongs to, and the bbox regressor outputs the position of the current candidate frame on the picture. In the embodiment of the present application, the output direction of softmax may be limited in advance for detecting different targets, and three detection targets, that is, "table", "text passage", or "text heading" may be set.
In step S203, a target font corresponding to the target text element is obtained according to the font identification result corresponding to the at least one text paragraph area.
The related font identification result can be a font identification result of the target text element in each area obtained by carrying out area segmentation on the target text elements in different areas; or carrying out font recognition on the target text element for multiple times to obtain all font recognition results; the font identification result may also be obtained by integrating and correcting (for example, weighting) the font identification result of the target text element. The output font identification result can be one type or multiple types; the probability of each font recognition result can be provided at the same time, and only one or more font recognition results with the highest probability can be output.
In this embodiment of the present application, when obtaining a target font corresponding to a target text element according to a font identification result corresponding to at least one text paragraph region, a weighting operation may be performed on the at least one font identification result according to a text region area and an identification confidence corresponding to the at least one font identification result in the text paragraph region, so as to obtain the target font corresponding to the target text element in the text paragraph region.
In a possible implementation manner, according to the text element characteristics in the image to be detected, regions which may belong to the same paragraph are marked out. The paragraph area is divided according to the text box format in the target image area, and the text boxes conforming to the table format, the paragraph format, the title format and the like are identified as the text boxes in the same paragraph area. The text box texts belonging to the same paragraph area can be subjected to overall character recognition detection, and the text relevance recognition result in the same paragraph is output; or not performing integral character recognition detection on the text in the same paragraph area, but performing weighting calculation on the text relevance recognition result of the text box in the same paragraph area based on the text relevance recognition result of the text box in the target image area, and outputting a weighted font detection result of the text box characters in the same paragraph area. Because different text boxes correspond to different font confidences, the font confidences can be set as first weights, the text box areas can be set as second weights, the proportion of the first weights to the second weights is set, and then the target fonts corresponding to the target text elements in the paragraph are output through weighting calculation.
Optionally, font infringement analysis may be performed according to the identified target font, and corresponding infringement analysis early warning information and/or infringement processing policy may be provided.
The infringement analysis early warning information related in the embodiment of the application includes but is not limited to: the target font, whether the target font belongs to the copyright protection font, the information of the copyright party to which the target font belongs, the price information of the target font, the suggestion for changing the target font (for example, which other fonts can be changed without infringement risk), the mode for obtaining the service of changing the font, the quotation and time for changing the font service, and the like. The information of the copyright party to which the target font belongs may include a name of the copyright party, a contact way, font authorization information, font selling information, and the like. The font detection result can be obtained through the information, and whether the infringement risk possibly exists on the font on the image to be detected, the subsequent cost and time possibly generated and the like can be further known.
For example, when the identified target font belongs to a font which needs to be authorized by a font manufacturer, on a display page of the client, which texts in the image to be detected may belong to the font, the probability percentage that the texts in the image to be detected and the target font are the same font can be displayed, and the font with the highest similarity is displayed for reference; meanwhile, suggestions for changing fonts are provided, or image modification functions are provided, and then services for modifying font types in the images are provided by professional designers.
Fig. 3 is a flowchart of an image-based font identification method according to another embodiment of the present application, which may include:
in step S301, an image to be detected is rendered on the text recognition page.
The text recognition page can be deployed at a server side or a client side; the electronic media display screen can be a static or dynamic page on various electronic media display screens such as a mobile phone, a desktop computer, a notebook computer, a tablet computer, an intelligent watch and the like; and can also be a display interface of a program function plug-in. This is not to be taken in any way limiting by the present application.
In step S302, a target font correspondingly identified to the image to be detected is obtained; the target font is determined according to a font identification result corresponding to at least one text paragraph area, and the text paragraph area is obtained by dividing a text relevance identification result of the target image area in the target image area except for interference character elements in the image to be detected.
The related text paragraph region, the interfering word element, the target image region, and the text relevance identification result may refer to the concept or process description in other embodiments, and are not described herein again.
Optionally, the determination mode of the font identification result may be determined according to the matching degree between the characters identified in the picture to be detected and the fonts in the font library. For example, a candidate font corresponding to the target image region may be selected from the plurality of font types based on the probabilities of the target image region being under the plurality of font types. If the probability of the area image under a certain font type in the font library is larger than the probability threshold value, determining the font type as a candidate font corresponding to the target image area; or, the font types arranged in the first few bits from the largest probability to the smallest probability may be selected to determine several candidate fonts corresponding to the target image area.
In one possible mode, if the matching degree of the characters in the candidate font and the character image corresponding to the target image area is greater than the matching threshold, determining the candidate font as the target font to which the characters in the target image area belong; in another possible mode, among the characters in the candidate fonts, the character in the candidate font which has the highest matching degree with the characters in the target image area is determined, and the candidate font is determined as the target font; in yet another possible manner, among the characters in the candidate fonts, the character in the candidate font which has the highest degree of matching with the characters in the target image region and the degree of matching is greater than the matching threshold is determined, and the candidate font is determined as the target font.
In step S303, the target font and the corresponding infringement analysis early warning information and/or infringement processing policy are displayed on the text identification page.
In a possible implementation manner, if the target font displayed on the text recognition page belongs to a copyright protected font, the client may further perform at least one of the following steps: firstly, showing the target font and the probability of whether the target font possibly belongs to the copyright protection font, and giving the price of purchasing the target font or the mode of modifying the target font or giving the modification suggestion of the target font; secondly, displaying a purchase page of the target font or displaying a purchase page of the service for modifying the target font; and finally, displaying a font modification page corresponding to the image to be detected.
Corresponding to the application scenario and the method of the method provided by the embodiment of the application, the embodiment of the application further provides an image-based font recognition device. Fig. 4 is a block diagram illustrating an exemplary image-based font recognition apparatus according to an embodiment of the present disclosure, where the apparatus may include:
and an interference text determining module 401, configured to determine, from the image to be detected, an interference text element other than the target text element.
A text relevance identification module 402, configured to obtain a text relevance identification result of a target image area outside the interfering text element, where the text relevance identification result represents at least one text paragraph area correspondingly divided by the target image area.
A target font obtaining module 403, configured to obtain a target font corresponding to the target text element according to the font identification result corresponding to the at least one text paragraph area.
In a possible implementation manner, the interference text determination module may include:
and the attention detection calling submodule is used for calling an attention detection module of the semantic segmentation model to detect commodity object elements and background elements from the image to be detected.
And the main body segmentation calling submodule is used for calling a main body segmentation module of the semantic segmentation model to perform edge segmentation on the commodity object elements and the background elements, and determining an image area defined after the edge segmentation as an object image area corresponding to the commodity object elements and the background elements.
In another possible implementation manner, the interference text determination module may include:
and the target detection calling submodule is used for calling a target detection model to detect the object identification elements and the identification image areas where the object identification elements are located from the image to be detected, and the target detection model is used for detecting the object identification elements and the corresponding identification image areas in parallel.
In some embodiments, the text association identification module may include:
and the recognition result acquisition submodule is used for acquiring the font recognition result of the target image area except the interference text element.
And the text relevance identification submodule is used for performing text relevance identification on the font identification result of the target image area to obtain the text relevance identification result of the target image area.
In a possible implementation manner, the recognition result obtaining sub-module is specifically configured to determine, according to an interference image area where the interference text element is located, a target image area other than the interference text element; and carrying out font identification on the target image area to obtain a font identification result of the target image area.
In another possible implementation manner, the recognition result obtaining sub-module is specifically configured to obtain a font recognition result obtained by performing font recognition on the image to be detected; and removing the font identification result corresponding to the interference image area where the interference text element is positioned from the font identification result of the image to be detected to obtain the font identification result of the target image area.
In some embodiments, font recognition may be implemented by at least one of:
the mixed recognition unit is used for calling a mixed recognition model to perform image recognition on an image area to be recognized so as to respectively perform font detection on each character according to image characteristics, and the mixed recognition model is trained to simultaneously execute the tasks of font recognition and text recognition;
the single character recognition unit is used for calling a single character recognition model to perform image recognition on the image area to be recognized so as to respectively perform font detection on each character according to image characteristics;
and the integral identification unit is used for calling an integral identification model to carry out image identification on the image area to be identified so as to carry out font detection on all the character integrals according to the image characteristics.
In other embodiments, font recognition may also be implemented by at least one of:
and the recognition result weighting unit is used for weighting the recognition results of the various fonts by taking the recognition confidence degrees corresponding to the detection results of the various fonts as weights, so as to obtain the weighted font recognition results.
In a possible implementation manner, the text relevance identification submodule may include:
a paragraph type detection unit, configured to perform paragraph type detection on the target area image according to a font identification result corresponding to the at least one text paragraph area, where the paragraph type includes a table, a text paragraph, or a text heading;
and the paragraph area dividing unit is used for dividing the image areas where the font identification results belonging to the same paragraph type are located into the same text paragraph area.
In some possible implementations, the target font obtaining module may include:
and the recognition result weighting submodule is used for carrying out weighting operation on at least one font recognition result according to the text area and the recognition confidence coefficient corresponding to at least one font recognition result in the text paragraph area to obtain a target font corresponding to a target text element in the text paragraph area.
In other possible implementations, the apparatus may further include:
and the font infringement analysis module is used for carrying out font infringement analysis according to the identified target font and providing corresponding infringement analysis early warning information and/or an infringement processing strategy.
Corresponding to the application scenario and method of the method provided by the embodiment of the application, the embodiment of the application further provides a font recognition device based on the image. Fig. 5 is a block diagram illustrating a structure of an image-based font identification apparatus according to an embodiment of the present application, which may include:
and a detection image submitting module 501, configured to submit the image to be detected on the text recognition page.
A target font obtaining module 502, configured to obtain a target font correspondingly identified for the image to be detected; the target font is determined according to a font identification result corresponding to at least one text paragraph area, and the text paragraph area is obtained by dividing a text relevance identification result of the target image area in the target image area except for interference word elements in the image to be detected based on the target image area.
And an infringement information disclosure module 503, configured to display the target font and corresponding infringement analysis early warning information and/or infringement processing policy on the text identification page.
The functions of the modules in the apparatuses in the embodiment of the present application may refer to the corresponding descriptions in the above method, and have corresponding beneficial effects, which are not described herein again.
FIG. 6 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 6, the electronic apparatus includes: a memory 601 and a processor 602, wherein the memory 601 stores computer programs running on the processor 602. The processor 602, when executing the computer program, implements the method in the above embodiments. The number of the memory 601 and the processor 602 may be one or more.
The electronic device further includes:
the communication interface 603 is configured to communicate with an external device, and perform data interactive transmission.
If the memory 601, the processor 602 and the communication interface 603 are implemented independently, the memory 601, the processor 602 and the communication interface 603 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
Optionally, in a specific implementation, if the memory 601, the processor 602, and the communication interface 603 are integrated on a chip, the memory 601, the processor 602, and the communication interface 603 may complete mutual communication through an internal interface.
Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.
The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and run an instruction stored in a memory from the memory, so that a communication device in which the chip is installed executes the method provided in the embodiment of the present application.
An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting Advanced reduced instruction set machine (ARM) architecture.
Further, optionally, the memory may include a read-only memory and a random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM may be used. For example, static Random Access Memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Any process or method described in a flowchart or otherwise herein may be understood as representing a module, segment, or portion of code, which includes one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps described in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, and these should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. An image-based font recognition method, comprising:
determining interference text elements except the target text elements from the image to be detected;
acquiring a text relevance identification result of a target image area except the interference text element, wherein the text relevance identification result represents at least one text paragraph area correspondingly divided by the target image area;
and obtaining a target font corresponding to the target text element according to the font identification result corresponding to the at least one text paragraph area.
2. The method of claim 1, wherein the interfering text elements include at least a merchandise object element and a background element, and the determining the interfering text elements other than the target text element from the image to be detected includes:
calling an attention detection module of a semantic segmentation model to detect commodity object elements and background elements from the image to be detected;
and calling a main body segmentation module of the semantic segmentation model to perform edge segmentation on the commodity object element and the background element, and determining an image area defined after the edge segmentation as an object image area corresponding to the commodity object element and the background element.
3. The method according to claim 1, wherein the interfering text elements include at least an object identification element, and the determining the interfering text elements other than the target text element from the image to be detected includes:
and calling a target detection model to detect the object identification elements and the identification image areas where the object identification elements are located from the image to be detected, wherein the target detection model is used for detecting the object identification elements and the corresponding identification image areas in parallel.
4. The method of claim 1, wherein the obtaining of the text relevance identification result of the target image region outside the interfering text element comprises:
acquiring a font identification result of a target image area except the interference text element;
and performing text relevance identification on the font identification result of the target image area to obtain the text relevance identification result of the target image area.
5. The method of claim 4, wherein the obtaining of the font identification result for the target image area outside the interfering text element comprises:
determining a target image area outside the interference text element according to the interference image area where the interference text element is located;
and carrying out font identification on the target image area to obtain a font identification result of the target image area.
6. The method of claim 4, wherein the obtaining of the font identification result of the target image area outside the interfering text element comprises:
acquiring a font identification result obtained by carrying out font identification on the image to be detected;
and removing the font identification result corresponding to the interference image area where the interference text element is positioned from the font identification result of the image to be detected to obtain the font identification result of the target image area.
7. The method according to claim 5 or 6, wherein said font recognition is achieved by at least one of the following steps:
calling a hybrid recognition model to perform image recognition on an image area to be recognized so as to perform font detection on each character according to image characteristics, wherein the hybrid recognition model is trained to execute a task of font recognition and text recognition at the same time;
calling a single character recognition model to perform image recognition on the image area to be recognized so as to respectively perform font detection on each character according to image characteristics;
and calling an overall recognition model to perform image recognition on the image area to be recognized so as to perform font detection on all character overall according to image characteristics.
8. The method of claim 7, wherein the font identification further comprises:
and taking the recognition confidence degrees corresponding to the various font detection results as weights, and performing weighted operation on the various font recognition results to obtain weighted font recognition results.
9. The method according to claim 4, wherein the performing text relevance identification on the font identification result of the target image area, and obtaining the text relevance identification result of the target image area comprises:
detecting paragraph types of the target area images according to font identification results corresponding to the at least one text paragraph area, wherein the paragraph types comprise tables, text paragraphs or text titles;
and dividing the image area where the font identification results belonging to the same paragraph type are located into the same text paragraph area.
10. The method of claim 1, wherein the obtaining a target font corresponding to the target text element according to the font identification result corresponding to the at least one text paragraph region comprises:
and performing weighting operation on at least one font recognition result according to the text area and the recognition confidence corresponding to at least one font recognition result in the text paragraph area to obtain a target font corresponding to a target text element in the text paragraph area.
11. The method of claim 1, wherein the method further comprises:
and carrying out font infringement analysis according to the identified target font, and providing corresponding infringement analysis early warning information and/or an infringement processing strategy.
12. An image-based font recognition method, comprising:
submitting an image to be detected on a text recognition page;
acquiring a target font correspondingly identified to the image to be detected; the target font is determined according to a font identification result corresponding to at least one text paragraph area, and the text paragraph area is obtained by dividing a text relevance identification result of the target image area in the target image area except interference character elements in an image to be detected;
and displaying the target font and corresponding infringement analysis early warning information and/or infringement processing strategy on the text identification page.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-12 when executing the computer program.
14. A computer-readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-12.
CN202211559191.6A 2022-12-06 2022-12-06 Image-based font identification method and device, electronic equipment and storage medium Pending CN115797947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211559191.6A CN115797947A (en) 2022-12-06 2022-12-06 Image-based font identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211559191.6A CN115797947A (en) 2022-12-06 2022-12-06 Image-based font identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115797947A true CN115797947A (en) 2023-03-14

Family

ID=85417417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211559191.6A Pending CN115797947A (en) 2022-12-06 2022-12-06 Image-based font identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115797947A (en)

Similar Documents

Publication Publication Date Title
US11314969B2 (en) Semantic page segmentation of vector graphics documents
CN111046784B (en) Document layout analysis and identification method and device, electronic equipment and storage medium
CN111080628B (en) Image tampering detection method, apparatus, computer device and storage medium
CN111488826B (en) Text recognition method and device, electronic equipment and storage medium
US20190385054A1 (en) Text field detection using neural networks
RU2695489C1 (en) Identification of fields on an image using artificial intelligence
US10521567B2 (en) Digital image processing for element removal and/or replacement
US10339642B2 (en) Digital image processing through use of an image repository
US20220207889A1 (en) Method for recognizing vehicle license plate, electronic device and computer readable storage medium
CN113762309B (en) Object matching method, device and equipment
CA3119249C (en) Querying semantic data from unstructured documents
CN112580507A (en) Deep learning text character detection method based on image moment correction
CN114429566A (en) Image semantic understanding method, device, equipment and storage medium
US20210217215A1 (en) Text placement within images using neural networks
CN112785601B (en) Image segmentation method, system, medium and electronic terminal
Nag et al. Offline extraction of Indic regional language from natural scene image using text segmentation and deep convolutional sequence
CN115797947A (en) Image-based font identification method and device, electronic equipment and storage medium
US20230094787A1 (en) Utilizing machine-learning based object detection to improve optical character recognition
CN115019322A (en) Font detection method, device, equipment and medium
CN116266259A (en) Image and text structured output method and device, electronic equipment and storage medium
CN114821062A (en) Commodity identification method and device based on image segmentation
CN113887375A (en) Text recognition method, device, equipment and storage medium
Pototzky et al. Self-supervised learning for object detection in autonomous driving
CN114445807A (en) Text region detection method and device
US20230036812A1 (en) Text Line Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination